今天這篇博文針對Assignment3的全連線網路作業,對前面學習的內容進行一些總結
在前面的作業中我們建立神經網路的操作比較簡單,也不具有模組化的特徵,在A3作業中,引導我們對前面的比如linear layer,Relu layer,Loss layer以及dropout layer(這個前面課程內容未涉及 但是在cs231n中有出現),以及梯度下降不同方法(SGD,SGD+Momentum,RMSprop,Adam)等等進行模組化的實現
class Linear(object):
@staticmethod
def forward(x, w, b):
"""
Computes the forward pass for an linear (fully-connected) layer.
The input x has shape (N, d_1, ..., d_k) and contains a minibatch of N
examples, where each example x[i] has shape (d_1, ..., d_k). We will
reshape each input into a vector of dimension D = d_1 * ... * d_k, and
then transform it to an output vector of dimension M.
Inputs:
- x: A tensor containing input data, of shape (N, d_1, ..., d_k)
- w: A tensor of weights, of shape (D, M)
- b: A tensor of biases, of shape (M,)
Returns a tuple of:
- out: output, of shape (N, M)
- cache: (x, w, b)
"""
out = None
out = x.view(x.shape[0],-1).mm(w)+b
cache = (x, w, b)
return out, cache
@staticmethod
def backward(dout, cache):
"""
Computes the backward pass for an linear layer.
Inputs:
- dout: Upstream derivative, of shape (N, M)
- cache: Tuple of:
- x: Input data, of shape (N, d_1, ... d_k)
- w: Weights, of shape (D, M)
- b: Biases, of shape (M,)
Returns a tuple of:
- dx: Gradient with respect to x, of shape (N, d1, ..., d_k)
- dw: Gradient with respect to w, of shape (D, M)
- db: Gradient with respect to b, of shape (M,)
"""
x, w, b = cache
dx, dw, db = None, None, None
db = dout.sum(dim = 0)
dx = dout.mm(w.t()).view(x.shape)
dw = x.view(x.shape[0],-1).t().mm(dout)
return dx, dw, db
class ReLU(object):
@staticmethod
def forward(x):
"""
Computes the forward pass for a layer of rectified linear units (ReLUs).
Input:
- x: Input; a tensor of any shape
Returns a tuple of:
- out: Output, a tensor of the same shape as x
- cache: x
"""
out = None
out = x.clone()
out[out<0] = 0
cache = x
return out, cache
@staticmethod
def backward(dout, cache):
"""
Computes the backward pass for a layer of rectified linear units (ReLUs).
Input:
- dout: Upstream derivatives, of any shape
- cache: Input x, of same shape as dout
Returns:
- dx: Gradient with respect to x
"""
dx, x = None, cache
dx = dout.clone()
dx[x<0] = 0
return dx
class Linear_ReLU(object):
@staticmethod
def forward(x, w, b):
"""
Convenience layer that performs an linear transform followed by a ReLU.
Inputs:
- x: Input to the linear layer
- w, b: Weights for the linear layer
Returns a tuple of:
- out: Output from the ReLU
- cache: Object to give to the backward pass
"""
a, fc_cache = Linear.forward(x, w, b)
out, relu_cache = ReLU.forward(a)
cache = (fc_cache, relu_cache)
return out, cache
@staticmethod
def backward(dout, cache):
"""
Backward pass for the linear-relu convenience layer
"""
fc_cache, relu_cache = cache
da = ReLU.backward(dout, relu_cache)
dx, dw, db = Linear.backward(da, fc_cache)
return dx, dw, db
從上面的程式碼我們可以看到,針對linear與relu層,我們可以將前向傳播與反向傳播分開實現,具體過程在上一篇我的博文中有討論:https://www.cnblogs.com/dyccyber/p/17764347.html
不同的是我們要對x進行一個reshape,將其轉換為N*D的矩陣,才能與矩陣進行點積
在分別實現了linear與relu之後,因為神經網路的架構往往是在linear之後立馬加入一個relu層,所以我們可以再建立一個linear-relu class,將這兩個層的前向與反向傳播合併
def svm_loss(x, y):
"""
Computes the loss and gradient using for multiclass SVM classification.
Inputs:
- x: Input data, of shape (N, C) where x[i, j] is the score for the jth
class for the ith input.
- y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and
0 <= y[i] < C
Returns a tuple of:
- loss: Scalar giving the loss
- dx: Gradient of the loss with respect to x
"""
N = x.shape[0]
correct_class_scores = x[torch.arange(N), y]
margins = (x - correct_class_scores[:, None] + 1.0).clamp(min=0.)
margins[torch.arange(N), y] = 0.
loss = margins.sum() / N
num_pos = (margins > 0).sum(dim=1)
dx = torch.zeros_like(x)
dx[margins > 0] = 1.
dx[torch.arange(N), y] -= num_pos.to(dx.dtype)
dx /= N
return loss, dx
def softmax_loss(x, y):
"""
Computes the loss and gradient for softmax classification.
Inputs:
- x: Input data, of shape (N, C) where x[i, j] is the score for the jth
class for the ith input.
- y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and
0 <= y[i] < C
Returns a tuple of:
- loss: Scalar giving the loss
- dx: Gradient of the loss with respect to x
"""
shifted_logits = x - x.max(dim=1, keepdim=True).values
Z = shifted_logits.exp().sum(dim=1, keepdim=True)
log_probs = shifted_logits - Z.log()
probs = log_probs.exp()
N = x.shape[0]
loss = (-1.0/ N) * log_probs[torch.arange(N), y].sum()
dx = probs.clone()
dx[torch.arange(N), y] -= 1
dx /= N
return loss, dx
上面損失函數層我們在之前已經實現過,具體實現需要用到一些矩陣微分的知識,具體可以參考這兩篇博文:
http://giantpandacv.com/academic/演演算法科普/深度學習基礎/SVM Loss以及梯度推導/
https://blog.csdn.net/qq_27261889/article/details/82915598
關於多層神經網路,首先是類的初始化定義,我們可以看神經網路的結構{linear - relu - [dropout]} x (L - 1) - linear - softmax,有L-1個linear層與relu層與dropout層的組合,最後再以linear-softmax的結構結束輸出結果,初始化我們要遍歷每個隱藏層,初始化權重矩陣與偏置項,最後再去初始化最後一個linear層,要注意矩陣的維度
class FullyConnectedNet(object):
"""
A fully-connected neural network with an arbitrary number of hidden layers,
ReLU nonlinearities, and a softmax loss function.
For a network with L layers, the architecture will be:
{linear - relu - [dropout]} x (L - 1) - linear - softmax
where dropout is optional, and the {...} block is repeated L - 1 times.
Similar to the TwoLayerNet above, learnable parameters are stored in the
self.params dictionary and will be learned using the Solver class.
"""
def __init__(self, hidden_dims, input_dim=3*32*32, num_classes=10,
dropout=0.0, reg=0.0, weight_scale=1e-2, seed=None,
dtype=torch.float, device='cpu'):
"""
Initialize a new FullyConnectedNet.
Inputs:
- hidden_dims: A list of integers giving the size of each hidden layer.
- input_dim: An integer giving the size of the input.
- num_classes: An integer giving the number of classes to classify.
- dropout: Scalar between 0 and 1 giving the drop probability for networks
with dropout. If dropout=0 then the network should not use dropout.
- reg: Scalar giving L2 regularization strength.
- weight_scale: Scalar giving the standard deviation for random
initialization of the weights.
- seed: If not None, then pass this random seed to the dropout layers. This
will make the dropout layers deteriminstic so we can gradient check the
model.
- dtype: A torch data type object; all computations will be performed using
this datatype. float is faster but less accurate, so you should use
double for numeric gradient checking.
- device: device to use for computation. 'cpu' or 'cuda'
"""
self.use_dropout = dropout != 0
self.reg = reg
self.num_layers = 1 + len(hidden_dims)
self.dtype = dtype
self.params = {}
############################################################################
# TODO: Initialize the parameters of the network, storing all values in #
# the self.params dictionary. Store weights and biases for the first layer #
# in W1 and b1; for the second layer use W2 and b2, etc. Weights should be #
# initialized from a normal distribution centered at 0 with standard #
# deviation equal to weight_scale. Biases should be initialized to zero. #
############################################################################
# Replace "pass" statement with your code
last_dim = input_dim
for n ,hidden_dim in enumerate(hidden_dims):
i = n+1
self.params['W{}'.format(i)] = torch.zeros(last_dim, hidden_dim, dtype=dtype,device = device)
self.params['W{}'.format(i)] += weight_scale*torch.randn(last_dim, hidden_dim, dtype=dtype,device= device)
self.params['b{}'.format(i)] = torch.zeros(hidden_dim, dtype=dtype,device= device)
last_dim = hidden_dim
i+=1
self.params['W{}'.format(i)] = torch.zeros(last_dim, num_classes, dtype=dtype,device = device)
self.params['W{}'.format(i)] += weight_scale*torch.randn(last_dim, num_classes, dtype=dtype,device= device)
self.params['b{}'.format(i)] = torch.zeros(num_classes, dtype=dtype,device= device)
# When using dropout we need to pass a dropout_param dictionary to each
# dropout layer so that the layer knows the dropout probability and the mode
# (train / test). You can pass the same dropout_param to each dropout layer.
self.dropout_param = {}
if self.use_dropout:
self.dropout_param = {'mode': 'train', 'p': dropout}
if seed is not None:
self.dropout_param['seed'] = seed
其次,我們可以定義save與load函數,對模型引數等等進行儲存與載入:
def save(self, path):
checkpoint = {
'reg': self.reg,
'dtype': self.dtype,
'params': self.params,
'num_layers': self.num_layers,
'use_dropout': self.use_dropout,
'dropout_param': self.dropout_param,
}
torch.save(checkpoint, path)
print("Saved in {}".format(path))
def load(self, path, dtype, device):
checkpoint = torch.load(path, map_location='cpu')
self.params = checkpoint['params']
self.dtype = dtype
self.reg = checkpoint['reg']
self.num_layers = checkpoint['num_layers']
self.use_dropout = checkpoint['use_dropout']
self.dropout_param = checkpoint['dropout_param']
for p in self.params:
self.params[p] = self.params[p].type(dtype).to(device)
print("load checkpoint file: {}".format(path))
最後是前向傳播與反向傳播的實現,這裡直接使用前面基礎的linear與relu的前向與反向傳播即可,注意一下神經網路的結構,不要把順序搞錯即可
def loss(self, X, y=None):
"""
Compute loss and gradient for the fully-connected net.
Input / output: Same as TwoLayerNet above.
"""
X = X.to(self.dtype)
mode = 'test' if y is None else 'train'
# Set train/test mode for batchnorm params and dropout param since they
# behave differently during training and testing.
if self.use_dropout:
self.dropout_param['mode'] = mode
scores = None
############################################################################
# TODO: Implement the forward pass for the fully-connected net, computing #
# the class scores for X and storing them in the scores variable. #
# #
# When using dropout, you'll need to pass self.dropout_param to each #
# dropout forward pass. #
############################################################################
# Replace "pass" statement with your code
cache_dict = {}
last_out = X
for n in range(self.num_layers-1):
i=n+1
last_out, cache_dict['cache_LR{}'.format(i)] = Linear_ReLU.forward(last_out,self.params['W{}'.format(i)],self.params['b{}'.format(i)])
if self.use_dropout:
last_out, cache_dict['cache_Dropout{}'.format(i)] = Dropout.forward(last_out,self.dropout_param)
i+=1
last_out, cache_dict['cache_L{}'.format(i)] = Linear.forward(last_out,self.params['W{}'.format(i)],self.params['b{}'.format(i)])
scores = last_out
# If test mode return early
if mode == 'test':
return scores
loss, grads = 0.0, {}
############################################################################
# TODO: Implement the backward pass for the fully-connected net. Store the #
# loss in the loss variable and gradients in the grads dictionary. Compute #
# data loss using softmax, and make sure that grads[k] holds the gradients #
# for self.params[k]. Don't forget to add L2 regularization! #
# NOTE: To ensure that your implementation matches ours and you pass the #
# automated tests, make sure that your L2 regularization includes a factor #
# of 0.5 to simplify the expression for the gradient. #
############################################################################
# Replace "pass" statement with your code
loss, dout = softmax_loss(scores, y)
loss += (self.params['W{}'.format(i)]*self.params['W{}'.format(i)]).sum()*self.reg
last_dout, dw, db = Linear.backward(dout, cache_dict['cache_L{}'.format(i)])
grads['W{}'.format(i)] = dw + 2*self.params['W{}'.format(i)]*self.reg
grads['b{}'.format(i)] = db
for n in range(self.num_layers-1)[::-1]:
i = n +1
if self.use_dropout:
last_dout = Dropout.backward(last_dout, cache_dict['cache_Dropout{}'.format(i)])
last_dout, dw, db = Linear_ReLU.backward(last_dout, cache_dict['cache_LR{}'.format(i)])
grads['W{}'.format(i)] = dw + 2*self.params['W{}'.format(i)]*self.reg
grads['b{}'.format(i)] = db
loss += (self.params['W{}'.format(i)]*self.params['W{}'.format(i)]).sum()*self.reg
return loss, grads
SGD,SGD+Momentum,RMSprop,Adam(Momentum+RMSprop+bias)的實現
具體原理介紹可參考之前的一篇博文:https://www.cnblogs.com/dyccyber/p/17759697.html
這裡特別提及一下在Adam中我們加入了偏置項,是為了防止在初期進行梯度下降的過程中,下降的過快
def sgd(w, dw, config=None):
"""
Performs vanilla stochastic gradient descent.
config format:
- learning_rate: Scalar learning rate.
"""
if config is None: config = {}
config.setdefault('learning_rate', 1e-2)
w -= config['learning_rate'] * dw
return w, config
def sgd_momentum(w, dw, config=None):
"""
Performs stochastic gradient descent with momentum.
config format:
- learning_rate: Scalar learning rate.
- momentum: Scalar between 0 and 1 giving the momentum value.
Setting momentum = 0 reduces to sgd.
- velocity: A numpy array of the same shape as w and dw used to store a
moving average of the gradients.
"""
if config is None: config = {}
config.setdefault('learning_rate', 1e-2)
config.setdefault('momentum', 0.9)
v = config.get('velocity', torch.zeros_like(w))
next_w = None
#############################################################################
# TODO: Implement the momentum update formula. Store the updated value in #
# the next_w variable. You should also use and update the velocity v. #
#############################################################################
# Replace "pass" statement with your code
v = config['momentum']*v - config['learning_rate'] * dw
next_w = w + v
#############################################################################
# END OF YOUR CODE #
#############################################################################
config['velocity'] = v
return next_w, config
def rmsprop(w, dw, config=None):
"""
Uses the RMSProp update rule, which uses a moving average of squared
gradient values to set adaptive per-parameter learning rates.
config format:
- learning_rate: Scalar learning rate.
- decay_rate: Scalar between 0 and 1 giving the decay rate for the squared
gradient cache.
- epsilon: Small scalar used for smoothing to avoid dividing by zero.
- cache: Moving average of second moments of gradients.
"""
if config is None: config = {}
config.setdefault('learning_rate', 1e-2)
config.setdefault('decay_rate', 0.99)
config.setdefault('epsilon', 1e-8)
config.setdefault('cache', torch.zeros_like(w))
next_w = None
###########################################################################
# TODO: Implement the RMSprop update formula, storing the next value of w #
# in the next_w variable. Don't forget to update cache value stored in #
# config['cache']. #
###########################################################################
# Replace "pass" statement with your code
config['cache'] = config['decay_rate'] * config['cache'] + (1 - config['decay_rate']) * dw**2
w += -config['learning_rate'] * dw / (torch.sqrt(config['cache']) + config['epsilon'])
next_w = w
###########################################################################
# END OF YOUR CODE #
###########################################################################
return next_w, config
def adam(w, dw, config=None):
"""
Uses the Adam update rule, which incorporates moving averages of both the
gradient and its square and a bias correction term.
config format:
- learning_rate: Scalar learning rate.
- beta1: Decay rate for moving average of first moment of gradient.
- beta2: Decay rate for moving average of second moment of gradient.
- epsilon: Small scalar used for smoothing to avoid dividing by zero.
- m: Moving average of gradient.
- v: Moving average of squared gradient.
- t: Iteration number.
"""
if config is None: config = {}
config.setdefault('learning_rate', 1e-3)
config.setdefault('beta1', 0.9)
config.setdefault('beta2', 0.999)
config.setdefault('epsilon', 1e-8)
config.setdefault('m', torch.zeros_like(w))
config.setdefault('v', torch.zeros_like(w))
config.setdefault('t', 0)
next_w = None
#############################################################################
# TODO: Implement the Adam update formula, storing the next value of w in #
# the next_w variable. Don't forget to update the m, v, and t variables #
# stored in config. #
# #
# NOTE: In order to match the reference output, please modify t _before_ #
# using it in any calculations. #
#############################################################################
# Replace "pass" statement with your code
config['t'] += 1
config['m'] = config['beta1']*config['m'] + (1-config['beta1'])*dw
mt = config['m'] / (1-config['beta1']**config['t'])
config['v'] = config['beta2']*config['v'] + (1-config['beta2'])*(dw*dw)
vc = config['v'] / (1-(config['beta2']**config['t']))
w = w - (config['learning_rate'] * mt)/ (torch.sqrt(vc) + config['epsilon'])
next_w = w
#############################################################################
# END OF YOUR CODE #
#############################################################################
return next_w, config
注意在前面多層全連線網路的實現中,dropout只有在我們進行train的時候才使用,在test的時候是不使用的
dropout層是一個非常高效與簡單的正則化方法,具體來說,在訓練時,dropout 是通過僅以一定概率 p 保持神經元活躍來實現的,如果我們設定的亂數小於p就將其設定為零,如下圖所示:
用另一種視角去看,dropout實際上是一種對全神經網路進行抽樣的方法,可以減少不同神經元之間複雜的關係
具體論文原文見:https://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf
程式碼實現:
class Dropout(object):
@staticmethod
def forward(x, dropout_param):
"""
Performs the forward pass for (inverted) dropout.
Inputs:
- x: Input data: tensor of any shape
- dropout_param: A dictionary with the following keys:
- p: Dropout parameter. We *drop* each neuron output with probability p.
- mode: 'test' or 'train'. If the mode is train, then perform dropout;
if the mode is test, then just return the input.
- seed: Seed for the random number generator. Passing seed makes this
function deterministic, which is needed for gradient checking but not
in real networks.
Outputs:
- out: Tensor of the same shape as x.
- cache: tuple (dropout_param, mask). In training mode, mask is the dropout
mask that was used to multiply the input; in test mode, mask is None.
NOTE: Please implement **inverted** dropout, not the vanilla version of dropout.
See http://cs231n.github.io/neural-networks-2/#reg for more details.
NOTE 2: Keep in mind that p is the probability of **dropping** a neuron
output; this might be contrary to some sources, where it is referred to
as the probability of keeping a neuron output.
"""
p, mode = dropout_param['p'], dropout_param['mode']
if 'seed' in dropout_param:
torch.manual_seed(dropout_param['seed'])
mask = None
out = None
if mode == 'train':
###########################################################################
# TODO: Implement training phase forward pass for inverted dropout. #
# Store the dropout mask in the mask variable. #
###########################################################################
# Replace "pass" statement with your code
mask = torch.rand(x.shape) > p
out = x.clone()
out[mask] = 0
###########################################################################
# END OF YOUR CODE #
###########################################################################
elif mode == 'test':
###########################################################################
# TODO: Implement the test phase forward pass for inverted dropout. #
###########################################################################
# Replace "pass" statement with your code
out = x
cache = (dropout_param, mask)
return out, cache
@staticmethod
def backward(dout, cache):
"""
Perform the backward pass for (inverted) dropout.
Inputs:
- dout: Upstream derivatives, of any shape
- cache: (dropout_param, mask) from Dropout.forward.
"""
dropout_param, mask = cache
mode = dropout_param['mode']
dx = None
if mode == 'train':
###########################################################################
# TODO: Implement training phase backward pass for inverted dropout #
###########################################################################
# Replace "pass" statement with your code
dx = dout
dx[mask] = 0
elif mode == 'test':
dx = dout
return dx