論文題目：Deep Learning
論文來源：Deep Learning
翻譯人：BDML@CQUT實驗室

Deep learning

Yann LeCun, Yoshua Bengio & Geoffrey Hinton

深度學習

Yann LeCun, Yoshua Bengio & Geoffrey Hinton

Abstract

Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.

摘要

深度學習允許由多個處理層組成的計算模型學習具有多個抽象級別的數據表示。這些方法極大地改善了語音識別，視覺物件識別，物件檢測以及許多其他領域的最新技術，例如藥物發現和基因組學。深度學習通過使用反向傳播演算法來指示機器應如何更改其內部參數（從上一層的表示形式計算每一層的表示形式）中，從而發現大型數據集中的複雜結構。深層折積網路在處理影象，視訊，語音和音訊方面帶來了突破，而遞回網路則對諸如文字和語音之類的順序數據有所啓發。

Introduction

Machine-learning technology powers many aspects of modern society: from web searches to content filtering on social networks to recommendations on e-commerce websites, and it is increasingly present in consumer products such as cameras and smartphones. Machine-learning systems are used to identify objects in images, transcribe speech into text, match news items, posts or products with users’ interests, and select relevant results of search. Increasingly, these applications make use of a class of techniques called deep learning.

Conventional machine-learning techniques were limited in their ability to process natural data in their raw form. For decades, constructing a pattern-recognition or machine-learning system required careful engineering and considerable domain expertise to design a feature extractor that transformed the raw data (such as the pixel values of an image) into a suitable internal representation or feature vector from which the learning subsystem, often a classifier, could detect or classify patterns in the input.

Representation learning is a set of methods that allows a machine to be fed with raw data and to automatically discover the representations needed for detection or classification. Deep-learning methods are representation-learning methods with multiple levels of representation, obtained by composing simple but non-linear modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level. With the composition of enough such transformations, very complex functions can be learned. For classification tasks, higher layers of representation amplify aspects of the input that are important for discrimination and suppress irrelevant variations. An image, for example, comes in the form of an array of pixel values, and the learned features in the first layer of representation typically represent the presence or absence of edges at particular orientations and locations in the image. The second layer typically detects motifs by spotting particular arrangements of edges, regardless of small variations in the edge positions. The third layer may assemble motifs into larger combinations that correspond to parts of familiar objects, and subsequent layers would detect objects as combinations of these parts. The key aspect of deep learning is that these layers of features are not designed by human engineers: they are learned from data using a general-purpose learning procedure.

Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years. It has turned out to be very good at discovering intricate structures in high-dimensional data and is therefore applicable to many domains of science, business and government. In addition to beating records in image recognition and speech recognition, it has beaten other machine-learning techniques at predicting the activity of potential drug molecules, analysing particle accelerator data, reconstructing brain circuits, and predicting the effects of mutations in non-coding DNA on gene expression and disease. Perhaps more surprisingly, deep learning has produced extremely promising results for various tasks in natural language understanding, particularly topic classification, sentiment analysis, question answering and language translation.

We think that deep learning will have many more successes in the near future because it requires very little engineering by hand, so it can easily take advantage of increases in the amount of available computation and data. New learning algorithms and architectures that are currently being developed for deep neural networks will only accelerate this progress.

引言

機器學習技術爲現代社會的各個方面提供了強大的動力：從網路搜尋到社羣網路上的內容過濾再到電子商務網站上的推薦，並且它在諸如相機和智慧手機之類的消費產品中越來越多地出現。機器學習系統用於識別影象中的物件，將語音轉錄爲文字，使新聞項，貼文或具有使用者興趣的產品匹配，以及選擇相關的搜尋結果。這些應用程式越來越多地使用一類稱爲深度學習的技術。

傳統的機器學習技術在處理原始格式的自然數據方面的能力受到限制。幾十年來，構建模式識別或機器學習系統需要認真的工程設計和相當多的領域專業知識，才能纔能設計特徵提取器，以將原始數據（例如影象的畫素值）轉換爲合適的內部表示或特徵向量，學習子系統（通常是分類器）可以檢測或分類輸入中的模式。

表示學習是一組方法，這些方法允許向機器提供原始數據並自動發現檢測或分類所需的表示。深度學習方法是具有表示形式的多層次的表示學習方法，它是通過組合簡單但非線性的模組而獲得的，每個模組都將一個級別（從原始輸入開始）的表示形式轉換爲更高，更抽象的級別的表示形式。有了足夠多的此類轉換，就可以學習非常複雜的功能。對於分類任務，較高的表示層會放大輸入中對區分非常重要的方面，並抑制不相關的變化。例如，影象以畫素值陣列的形式出現，並且在表示的第一層中學習的特徵通常表示影象中特定方向和位置上是否存在邊緣。第二層通常通過發現邊緣的特定佈置來檢測圖案，而與邊緣位置的微小變化無關。第三層可以將圖案組裝成與熟悉的物件的各個部分相對應的較大組合，並且隨後的層將物件檢測爲這些部分的組合。深度學習的關鍵方面是這些層的功能不是由人類工程師設計的：它們是使用通用學習過程從數據中學習的。

深度學習在解決多年來一直抵制人工智慧社羣最佳嘗試的問題方面取得了重大進展。事實證明，它非常善於發現高維數據中的複雜結構，因此適用於科學，商業和政府的許多領域。除了在影象識別和語音識別方面打破記錄之外，它還在預測潛在藥物分子的活性、分析粒子加速器數據、重建大腦回路以及預測非編碼DNA突變對基因表達和疾病的影響方面超越了其他機器學習技術。更令人驚喜的是，深度學習爲自然語言理解中的各種任務（特別是主題分類，情感分析，問題解答和語言翻譯）產生了非常有前景的結果。

我們認爲深度學習將在不久的將來取得更多的成功，因爲它只需要很少的人工操作，並且可以輕鬆地利用可用計算量和增量數據。目前正在爲深度神經網路開發的新學習演算法和體系結構只會加速這一進展。

Supervised learning

The most common form of machine learning, deep or not, is supervised learning. Imagine that we want to build a system that can classify images as containing, say, a house, a car, a person or a pet. We first collect a large data set of images of houses, cars, people and pets, each labelled with its category. During training, the machine is shown an image and produces an output in the form of a vector of scores, one for each category. We want the desired category to have the highest score of all categories, but this is unlikely to happen before training. We compute an objective function that measures the error (or distance) between the output scores and the desired pattern of scores. The machine then modifies its internal adjustable parameters to reduce this error. These adjustable parameters, often called weights, are real numbers that can be seen as ‘knobs’ that define the input–output function of the machine. In a typical deep-learning system, there may be hundreds of millions of these adjustable weights, and hundreds of millions of labelled examples with which to train the machine.

To properly adjust the weight vector, the learning algorithm computes a gradient vector that, for each weight, indicates by what amount the error would increase or decrease if the weight were increased by a tiny amount. The weight vector is then adjusted in the opposite direction to the gradient vector.

The objective function, averaged over all the training examples, can be seen as a kind of hilly landscape in the high-dimensional space of weight values. The negative gradient vector indicates the direction of steepest descent in this landscape, taking it closer to a minimum, where the output error is low on average.

In practice, most practitioners use a procedure called stochastic gradient descent (SGD). This consists of showing the input vector for a few examples, computing the outputs and the errors, computing the average gradient for those examples, and adjusting the weights accordingly. The process is repeated for many small sets of examples from the training set until the average of the objective function stops decreasing. It is called stochastic because each small set of examples gives a noisy estimate of the average gradient over all examples. This simple procedure usually finds a good set of weights surprisingly quickly when compared with far more elaborate optimization techniques. After training, the performance of the system is measured on a different set of examples called a test set. This serves to test the generalization ability of the machine — its ability to produce sensible answers on new inputs that it has never seen during training.

Many of the current practical applications of machine learning use linear classifiers on top of hand-engineered features. A two-class linear classifier computes a weighted sum of the feature vector components. If the weighted sum is above a threshold, the input is classified as belonging to a particular category.

Since the 1960s we have known that linear classifiers can only carve their input space into very simple regions, namely half-spaces separated by a hyperplane. But problems such as image and speech recognition require the input–output function to be insensitive to irrelevant variations of the input, such as variations in position, orientation or illumination of an object, or variations in the pitch or accent of speech, while being very sensitive to particular minute variations (for example, the difference between a white wolf and a breed of wolf-like white dog called a Samoyed). At the pixel level, images of two Samoyeds in different poses and in different environments may be very different from each other, whereas two images of a Samoyed and a wolf in the same position and on similar backgrounds may be very similar to each other.

A linear classifier, or any other ‘shallow’ classifier operating on raw pixels could not possibly distinguish the latter two, while putting the former two in the same category. This is why shallow classifiers require a good feature extractor that solves the selectivity–invariance dilemma — one that produces representations that are selective to the aspects of the image that are important for discrimination, but that are invariant to irrelevant aspects such as the pose of the animal. To make classifiers more powerful, one can use generic non-linear features, as with kernel methods, but generic features such as those arising with the Gaussian kernel do not allow the learner to generalize well far from the training examples. The conventional option is to hand design good feature extractors, which requires a considerable amount of engineering skill and domain expertise. But this can all be avoided if good features can be learned automatically using a general-purpose learning procedure. This is the key advantage of deep learning.

A deep-learning architecture is a multilayer stack of simple modules, all (or most) of which are subject to learning, and many of which compute non-linear input–output mappings. Each module in the stack transforms its input to increase both the selectivity and the invariance of the representation. With multiple non-linear layers, say a depth of 5 to 20, a system can implement extremely intricate functions of its inputs that are simultaneously sensitive to minute details — distinguishing Samoyeds from white wolves — and insensitive to large irrelevant variations such as the background, pose, lighting and surrounding objects.

監督學習

不論深度與否，最常見的機器學習形式都是監督學習。想象一下，我們想建立一個可以將影象分類爲包含房屋、汽車、人或寵物的系統。我們首先收集大量的房屋、汽車、人和寵物的影象數據集，每個影象均標有類別。在訓練過程中，機器將顯示一張影象，並個類別以分數向量的形式產生一個輸出。我們希望所需的類別在所有類別中得分最高，但這不太可能在訓練前發生。我們計算一個目標函數，該函數測量輸出得分與期望得分模式之間的誤差（或距離）。然後機器修改其內部可調參數以減少此錯誤。這些可調參數（通常稱爲權重）是實數，可以看作是定義機器輸入輸出功能的「旋鈕」。在典型的深度學習系統中，可能會有數以億計的可調參數，以及數以億計個帶有標籤的範例，用於訓練機器。

爲了適當地調整權重向量，學習演算法計算一個梯度向量，該梯度向量針對每個權重指示，如果權重增加很小的量，誤差將增加或減少的量。然後沿與梯度向量相反的方向調整權重向量。

在所有訓練樣本上取平均值的目標函數，可以看作是一種高維權重空間中類似丘陵影象。負梯度向量表示該影象中最陡下降的方向，使其更接近最小值，輸出誤差平均較低。

在實踐中，大多數實踐者使用稱爲隨機梯度下降（SGD）的程式。這包括顯示幾個例子的輸入向量，計算輸出和誤差，計算這些例子的平均梯度，並相應地調整權重。從訓練集中的許多小樣本重複這個過程，直到目標函數的平均值停止下降。之所以稱之爲隨機性，是因爲每個小樣本集都給出了所有樣本的平均梯度的噪聲估計。這個簡單的過程通常會很快地找到一組好的權重，與更精細的優化技術相比18。在訓練之後，系統的效能將在一組稱爲測試集的不同範例上進行測量。這是爲了測試機器的泛化能力——它對訓練中從未見過的新輸入產生合理答案的能力。

目前機器學習的許多實際應用是在手工設計的特徵上使用線性分類器。兩類線性分類器計算特徵向量分量的加權和。如果加權和高於閾值，則輸入被分類爲屬於特定類別。

自20世紀60年代以來，我們就知道線性分類器只能將它們的輸入空間分割成非常簡單的區域，即由超平面19分隔的半空間。但是，像影象和語音識別這樣的問題要求輸入-輸出功能對輸入的不相關變化不敏感，例如物體的位置、方向或照明的變化，或者語音的音高或重音的變化，同時對特定的微小變化非常敏感(例如，白狼和一種叫做薩摩耶的狼一樣的白狗之間的區別)。在畫素級別，處於不同姿勢和不同環境中的兩個薩摩耶犬的影象可能彼此非常不同，而處於相同位置且背景相似的薩摩耶犬和狼的兩個影象可能彼此非常相似。

線性分類器或任何其他對原始畫素進行操作的「淺」分類器都無法區分後兩個，而將前兩個分類爲同一類別。這就是爲什麼淺分類器需要一個好的特徵提取器來解決選擇性不變性難題的原因。提取器可以產生對影象中對於辨別重要的方面具有選擇性但對不相關方面（例如姿態）不變的表示形式動物。爲了使分類器更強大，可以使用通用的非線性特徵，如核方法，但是諸如高斯核所產生的那些一般性特徵使學習者無法從訓練範例中很好地概括。傳統的選擇是人工設計好的特徵提取器，這需要大量的工程技術和領域專業知識。但是，如果可以使用通用學習過程自動學習好的功能，則可以避免所有這些情況。這是深度學習的關鍵優勢。

深度學習架構是簡單模組的多層堆疊，所有模組（或大多數模組）都需要學習，並且其中許多模組都會計算非線性的輸入-輸出對映。堆疊中的每個模組都會轉換其輸入，以增加表示的選擇性和不變性。系統具有多個非線性層（例如深度爲5到20），可以實現極爲複雜的輸入功能，這些功能同時對分鐘的細節敏感（區分薩摩耶犬與白狼），並且對諸如背景，姿勢，燈光和周圍物體。

Backpropagation to train multilayer architectures

From the earliest days of pattern recognition, the aim of researchers has been to replace hand-engineered features with trainable multilayer networks, but despite its simplicity, the solution was not widely understood until the mid 1980s. As it turns out, multilayer architectures can be trained by simple stochastic gradient descent. As long as the modules are relatively smooth functions of their inputs and of their internal weights, one can compute gradients using the backpropagation procedure. The idea that this could be done, and that it worked, was discovered independently by several different groups during the 1970s and 1980s.

The backpropagation procedure to compute the gradient of an objective function with respect to the weights of a multilayer stack of modules is nothing more than a practical application of the chain rule for derivatives. The key insight is that the derivative (or gradient) of the objective with respect to the input of a module can be computed by working backwards from the gradient with respect to the output of that module (or the input of the subsequent module) (Fig. 1). The backpropagation equation can be applied repeatedly to propagate gradients through all modules, starting from the output at the top (where the network produces its prediction) all the way to the bottom (where the external input is fed). Once these gradients have been computed, it is straightforward to compute the gradients with respect to the weights of each module.

Many applications of deep learning use feedforward neural network architectures (Fig. 1), which learn to map a fixed-size input (for example, an image) to a fixed-size output (for example, a probability for each of several categories). To go from one layer to the next, a set of units compute a weighted sum of their inputs from the previous layer and pass the result through a non-linear function. At present, the most popular non-linear function is the rectified linear unit (ReLU), which is simply the half-wave rectifier $f(z) = \max(z, 0)$ . In past decades, neural nets used smoother non-linearities, such as $tanh(z)$ or $1/(1 + \exp(−z))$ , but the ReLU typically learns much faster in networks with many layers, allowing training of a deep supervised network without unsupervised pre-training. Units that are not in the input or output layer are conventionally called hidden units. The hidden layers can be seen as distorting the input in a non-linear way so that categories become linearly separable by the last layer (Fig. 1).

In the late 1990s, neural nets and backpropagation were largely forsaken by the machine-learning community and ignored by the computer-vision and speech-recognition communities. It was widely thought that learning useful, multistage, feature extractors with little prior knowledge was infeasible. In particular, it was commonly thought that simple gradient descent would get trapped in poor local minima — weight configurations for which no small change would reduce the average error.

In practice, poor local minima are rarely a problem with large networks. Regardless of the initial conditions, the system nearly always reaches solutions of very similar quality. Recent theoretical and empirical results strongly suggest that local minima are not a serious issue in general. Instead, the landscape is packed with a combinatorially large number of saddle points where the gradient is zero, and the surface curves up in most dimensions and curves down in the remainder. The analysis seems to show that saddle points with only a few downward curving directions are present in very large numbers, but almost all of them have very similar values of the objective function. Hence, it does not much matter which of these saddle points the algorithm gets stuck at.

Interest in deep feedforward networks was revived around 2006 (refs 31–34) by a group of researchers brought together by the Canadian Institute for Advanced Research (CIFAR). The researchers introduced unsupervised learning procedures that could create layers of feature detectors without requiring labelled data. The objective in learning each layer of feature detectors was to be able to reconstruct or model the activities of feature detectors (or raw inputs) in the layer below. By ‘pre-training’ several layers of progressively more complex feature detectors using this reconstruction objective, the weights of a deep network could be initialized to sensible values. A final layer of output units could then be added to the top of the network and the whole deep system could be fine-tuned using standard backpropagation. This worked remarkably well for recognizing handwritten digits or for detecting pedestrians, especially when the amount of labelled data was very limited.

The first major application of this pre-training approach was in speech recognition, and it was made possible by the advent of fast graphics processing units (GPUs) that were convenient to program and allowed researchers to train networks 10 or 20 times faster. In 2009, the approach was used to map short temporal windows of coefficients extracted from a sound wave to a set of probabilities for the various fragments of speech that might be represented by the frame in the centre of the window. It achieved record-breaking results on a standard speech recognition benchmark that used a small vocabulary and was quickly developed to give record-breaking results on a large vocabulary task. By 2012, versions of the deep net from 2009 were being developed by many of the major speech groups and were already being deployed in Android phones. For smaller data sets, unsupervised pre-training helps to prevent overfitting, leading to significantly better generalization when the number of labelled examples is small, or in a transfer setting where we have lots of examples for some ‘source’ tasks but very few for some ‘target’ tasks. Once deep learning had been rehabilitated, it turned out that the pre-training stage was only needed for small data sets.

There was, however, one particular type of deep, feedforward network that was much easier to train and generalized much better than networks with full connectivity between adjacent layers. This was the convolutional neural network (ConvNet). It achieved many practical successes during the period when neural networks were out of favour and it has recently been widely adopted by the computervision community.

反向傳播訓練多層架構

從模式識別的早期開始，研究人員的目的就是用可訓練的多層網路替換手工設計的功能，但是儘管其簡單性，但直到1980年代中期才廣泛瞭解該解決方案。事實證明，可以通過簡單的隨機梯度下降來訓練多層體系結構。只要模組是其輸入及其內部權重的相對平滑函數，就可以使用反向傳播過程來計算梯度。在20世紀70年代和80年代，幾個不同的團體獨立地發現了這種想法，認爲這是可以做到的，而且是有效的。

反向傳播程式用於計算目標函數相對於模組多層堆疊權重的梯度，無非是導數鏈規則的實際應用。關鍵是，相對於模組輸入的目標的導數（或梯度）可以通過相對於該模組的輸出（或後續模組的輸入）的梯度進行反算來計算（圖1）。反向傳播方程式可以反覆反復應用，以通過所有模組傳播梯度，從頂部的輸出（網路產生其預測）一直到底部的輸出（外部輸入被饋送）。一旦計算出這些梯度，就可以相對於每個模組的權重來計算梯度。

深度學習的許多應用使用前饋神經網路架構（圖1），該架構學習將固定大小的輸入（例如，影象）對映到固定大小的輸出（例如，幾個類別中的每一個的概率）。爲了從一層到下一層，一組單元計算上一層輸入的加權和，並將結果傳遞給一個非線性函數。目前最流行的非線性函數是整流線性單元（ReLU），即半波整流器 $f(z)=\max（z，0）$ 。在過去的幾十年裡，神經網路使用更平滑的非線性，例如 $tanh（z）$ 或 $1/（1+\exp（−z））$ ，但ReLU通常在多層網路中學習得更快，允許在無監督預訓練的情況下訓練深度監督網路。不在輸入或輸出層的單元通常稱爲隱藏單元。隱藏層可以被視爲以非線性方式扭曲輸入，使得類別可以由最後一層線性分離（圖1）。

Figure 1 | Multilayer neural networks and backpropagation. a, A multilayer neural network (shown by the connected dots) can distort the input space to make the classes of data (examples of which are on the red and blue lines) linearly separable. Note how a regular grid (shown on the left) in input space is also transformed (shown in the middle panel) by hidden units. This is an illustrative example with only two input units, two hidden units and one output unit, but the networks used for object recognition or natural language processing contain tens or hundreds of thousands of units. Reproduced with permission from C. Olah (http://colah.github.io/). b, The chain rule of derivatives tells us how two small effects (that of a small change of $x$ on $y$ , and that of $y$ on $z$ ) are composed. A small change $Δx$ in $x$ gets transformed first into a small change $Δy$ in $y$ by getting multiplied by $∂y/∂x$ (that is, the definition of partial derivative). Similarly, the change $Δy$ creates a change $Δz$ in $z$ . Substituting one equation into the other gives the chain rule of derivatives — how$ Δx$ gets turned into $Δz$ through multiplication by the product of $∂y/∂x$ and $∂z/∂x$ . It also works when $x$ , $y$ and $z$ are vectors (and the derivatives are Jacobian matrices). c, The equations used for computing the forward pass in a neural net with two hidden layers and one output layer, each constituting a module through which one can backpropagate gradients. At each layer, we first compute the total input $z$ to each unit, which is a weighted sum of the outputs of the units in the layer below. Then a non-linear function $f(.)$ is applied to $z$ to get the output of the unit. For simplicity, we have omitted bias terms. The non-linear functions used in neural networks include the rectified linear unit (ReLU) $f(z) = max(0,z)$ , commonly used in recent years, as well as the more conventional sigmoids, such as the hyberbolic tangent, $f(z) = (exp(z) − exp(−z))/(exp(z) + exp(−z))$ and logistic function logistic,$ f(z) = 1/(1 + exp(−z))$. d, The equations used for computing the backward pass. At each hidden layer we compute the error derivative with respect to the output of each unit, which is a weighted sum of the error derivatives with respect to the total inputs to the units in the layer above. We then convert the error derivative with respect to the output into the error derivative with respect to the input by multiplying it by the gradient of $f(z)$ . At the output layer, the error derivative with respect to the output of a unit is computed by differentiating the cost function. This gives $y_l − t_l$ if the cost function for unit $l$ is $0.5(yl − tl)2$ , where $t_l$ is the target value. Once the $∂E/∂z_k$ is known, the error-derivative for the weight $w_{jk}$ on the connection from unit $j$ in the layer below is just$ y_j ∂E/∂z_k$.

**圖1：多層神經網路和反向傳播。 ** a，多層神經網路（由連線的點顯示）可以使輸入空間變形，以使數據類別（例如紅線和藍線）可線性分離。請注意，輸入空間中的規則網格（如左圖所示）也如何通過隱藏單位轉換（如中圖所示）。這是僅具有兩個輸入單元，兩個隱藏單元和一個輸出單元的說明性範例，但是用於物件識別或自然語言處理的網路包含數以萬計的單元。經C. Olah（http://colah.github.io/）許可轉載。**b，**導數的鏈式規則告訴我們如何構成兩個小的影響（ $x$ 對 $y$ 的微小變化以及 $y$ 對 $z$ 的微小變化）。 $x$ 中的小變化 $Δx$ 通過乘以 $∂y/∂x$ （即偏導數的定義）首先轉換爲 $y$ 中的小變化 $Δy$ 。類似地，改變 $Δy$ 在 $z$ 中產生改變 $Δz$ 。將一個方程式代入另一個方程式可得出導數的鏈式規則-如何通過乘以 $∂y/∂x$ 和 $∂z/∂x$ 的乘積將 $Δx$ 轉換爲 $Δz$ 。當 $x，y$ 和 $z$ 是向量（並且導數是Jacobian矩陣）時，它也適用。c，用於計算神經網路中前向通路的方程，該神經網路有兩個隱藏層和一個輸出層，每個層構成一個模組，通過該模組可以反向傳播梯度。在每一層，我們首先計算每個單元的總輸入 $z$ ，它是下一層中這些單元的輸出的加權和。然後將非線性函數 $f（.）$ 應用於 $z$ ，以獲取單位的輸出。爲了簡單起見，我們省略了偏差項。神經網路中使用的非線性函數包括近年來常用的整流線性單位（ReLU） $f(z)= \max(0，z)$ 以及更常規的S型曲線，例如雙曲線正切 $f (z)=(\exp(z)-\exp(-z))/(\exp(z)+ \exp(-z))$ 和邏輯函數logistic， $f(z)= 1 /(1 + \exp(-z))$ 。d，用於計算反向通過的方程式。在每個隱藏層，我們針對每個單元的輸出計算誤差導數，該誤差導數是相對於上一層中這些單元的總輸入的誤差導數的加權和。然後，通過將其乘以 $f(z)$ 的梯度，將相對於輸出的誤差導數轉換爲相對於輸入的誤差導數。在輸出層，通過微分成本函數來計算相對於單元輸出的誤差導數。如果單位 $l$ 的成本函數爲 $0.5(y_l-t_l)^2$ ，則給出 $y_l-t_l$ ，其中 $t_l$ 是目標值。一旦獲知 $∂E/∂z_k$ ，下一層中來自單元 $j$ 的連線上的權重 $w_{jk}$ 的誤差導數就是 $y_j∂E/∂z_k$ 。

在20世紀90年代末，神經網路和反向傳播在很大程度上被機器學習界所拋棄，被計算機視覺和語音識別界所忽視。人們普遍認爲，在沒有先驗知識的情況下學習有用的多階段特徵提取器是不可行的。特別是，通常認爲簡單的梯度下降會陷入區域性極小值——對於這種設定，任何微小的變化都不會降低平均誤差。

實際上，較差的區域性最小值在大型網路中很少出現問題。不管初始條件如何，該系統幾乎總是能獲得品質非常相似的解決方案。最近的理論和經驗結果表明，區域性極小值通常不是一個嚴重的問題。取而代之的是，景觀中聚集了大量的鞍點，其中梯度爲零，表面在大多數維度上向上彎曲，在其餘維度上向下彎曲。分析似乎表明，只有幾個向下彎曲方向的鞍點出現的數量非常大，但幾乎所有的鞍點都有非常相似的目標函數值。因此，演算法陷入這些鞍點中的哪一個並不重要。

加拿大高階研究所(CIFAR)召集了一批研究人員，他們在2006年前後重新燃起了對深度前饋網路的興趣。研究人員引入了無監督的學習程式，這種程式可以在不需要標記數據的情況下建立多層特徵檢測器。學習每一層特徵檢測器的目的是能夠重建或模擬下一層特徵檢測器(或原始輸入)的活動。通過使用該重建目標「預訓練」幾層逐漸更復雜的特徵檢測器，可以將深層網路的權重初始化爲合理的值。最後一層輸出單元可以新增到網路的頂部，整個深層系統可以使用標準的反向傳播進行微調。這對識別手寫數位或檢測行人非常有效，尤其是當標記數據的數量非常有限時。

這種預訓練方法的第一個主要應用是在語音識別中，它是由於快速圖形處理單元的出現而成爲可能的，這種單元便於程式設計，使研究人員能夠以10到20倍的速度訓練網路。在2009年，該方法被用於將從聲波中提取的係數的短時間視窗對映到可能由視窗中心的幀表示的各種語音片段的一組概率。它在使用小詞彙量的標準語音識別基準測試中取得了破紀錄的結果，並很快被開發出來，在大詞彙量的測試中給出了破紀錄的結果。到2012年，許多主要的語音小組已經在開發2009年的深度網路版本，並且已經在安卓手機上部署。對於較小的數據集，無監督的預訓練有助於防止過度擬合，當標記樣本的數量較少時，或者在遷移環境中，對於一些「源」任務，我們有很多樣本，但是對於一些「目標」任務，我們只有很少的樣本，可以顯著提高泛化能力。一旦深度學習得到恢復，結果證明只有小數據集才需要預訓練階段。

然而，有一種特殊型別的深度前饋網路比相鄰層之間完全連通的網路更容易訓練和推廣。那就是折積神經網路。在神經網路不被看好的時期，它取得了許多實際的成功，最近被計算機視覺界廣泛採用。

Convolutional neural networks

ConvNets are designed to process data that come in the form of multiple arrays, for example a colour image composed of three 2D arrays containing pixel intensities in the three colour channels. Many data modalities are in the form of multiple arrays: 1D for signals and sequences, including language; 2D for images or audio spectrograms; and 3D for video or volumetric images. There are four key ideas behind ConvNets that take advantage of the properties of natural signals: local connections, shared weights, pooling and the use of many layers.

The architecture of a typical ConvNet (Fig. 2) is structured as a series of stages. The first few stages are composed of two types of layers: convolutional layers and pooling layers. Units in a convolutional layer are organized in feature maps, within which each unit is connected to local patches in the feature maps of the previous layer through a set of weights called a filter bank. The result of this local weighted sum is then passed through a non-linearity such as a ReLU. All units in a feature map share the same filter bank. Different feature maps in a layer use different filter banks. The reason for this architecture is twofold. First, in array data such as images, local groups of values are often highly correlated, forming distinctive local motifs that are easily detected. Second, the local statistics of images and other signals are invariant to location. In other words, if a motif can appear in one part of the image, it could appear anywhere, hence the idea of units at different locations sharing the same weights and detecting the same pattern in different parts of the array. Mathematically, the filtering operation performed by a feature map is a discrete convolution, hence the name.

Although the role of the convolutional layer is to detect local conjunctions of features from the previous layer, the role of the pooling layer is to merge semantically similar features into one. Because the relative positions of the features forming a motif can vary somewhat, reliably detecting the motif can be done by coarse-graining the position of each feature. A typical pooling unit computes the maximum of a local patch of units in one feature map (or in a few feature maps). Neighbouring pooling units take input from patches that are shifted by more than one row or column, thereby reducing the dimension of the representation and creating an invariance to small shifts and distortions. Two or three stages of convolution, non-linearity and pooling are stacked, followed by more convolutional and fully-connected layers. Backpropagating gradients through a ConvNet is as simple as through a regular deep network, allowing all the weights in all the filter banks to be trained.

Deep neural networks exploit the property that many natural signals are compositional hierarchies, in which higher-level features are obtained by composing lower-level ones. In images, local combinations of edges form motifs, motifs assemble into parts, and parts form objects. Similar hierarchies exist in speech and text from sounds to phones, phonemes, syllables, words and sentences. The pooling allows representations to vary very little when elements in the previous layer vary in position and appearance.

There have been numerous applications of convolutional networks going back to the early 1990s, starting with time-delay neural networks for speech recognition and document reading. The document reading system used a ConvNet trained jointly with a probabilistic model that implemented language constraints. By the late 1990s this system was reading over 10% of all the cheques in the United States. A number of ConvNet-based optical character recognition and handwriting recognition systems were later deployed by Microsoft. ConvNets were also experimented with in the early 1990s for object detection in natural images, including faces and hands, and for face recognition.

折積神經網路

折積神經網路旨在處理多陣列形式的數據，例如由三個2D陣列組成的彩色影象，其中包含三個彩色通道中的畫素強度。許多數據形式是多陣列的形式:信號和序列的1D，包括語言；2D用於影象或音訊頻譜圖；和3D用於視訊或體積影象。折積神經網路背後有四個利用自然信號屬性的關鍵思想：本地連線，共用權重，池化和多層使用。

Figure 2 | Inside a convolutional network. The outputs (not the filters) of each layer (horizontally) of a typical convolutional network architecture applied to the image of a Samoyed dog (bottom left; and RGB (red, green, blue) inputs, bottom right). Each rectangular image is a feature map corresponding to the output for one of the learned features, detected at each of the image positions. Information flows bottom up, with lower-level features acting as oriented edge detectors, and a score is computed for each image class in output. ReLU, rectified linear unit.

**圖2：折積網路內部。**典型折積網路結構的每一層(水平方向)的輸出(不是濾波器)應用於薩摩耶狗的影象(左下；和RGB(紅、綠、藍)輸入，右下角)。每個矩形影象是對應於在每個影象位置檢測到的學習特徵之一的輸出的特徵圖。資訊自下而上流動，低層特徵充當定向邊緣檢測器，並在輸出中爲每個影象類計算分數。整流線性單元。

典型的折積神經網路(圖2)的體系結構是由一系列階段構成的。前幾個階段由兩種型別的層組成:折積層和池層。折積層中的單元被組織在特徵圖中，其中每個單元通過一組稱爲濾波器組的權重與前一層的特徵圖中的區域性面片相連。然後，這個區域性加權和的結果通過諸如ReLU的非線性傳遞。要素圖中的所有單元共用同一個濾波器組。一個圖層中的不同要素圖使用不同的濾波器組。這種架構有兩個原因。首先，在影象等陣列數據中，區域性值組通常高度相關，形成易於檢測的獨特區域性圖案。其次，影象和其他信號的區域性統計對位置是不變的。換句話說，如果一個圖案可以出現在影象的一個部分，它可以出現在任何地方，因此不同位置的單元共用相同的權重，並在陣列的不同部分檢測相同的圖案。從數學上講，由要素圖執行的過濾操作是離散折積，因此得名。

儘管折積層的作用是檢測來自前一層的特徵的區域性結合，但彙集層的作用是將語意相似的特徵合併成一個。因爲形成圖案的特徵的相對位置可能有所不同，所以可以通過粗粒化每個特徵的位置來可靠地檢測圖案。一個典型的池單元計算一個要素圖(或幾個要素圖)中區域性單元塊的最大值。相鄰的彙集單元從移位超過一行或一列的塊中獲取輸入，從而減小表示的維數，並建立對小移位和失真的不變性。折積、非線性和彙集的兩個或三個階段被堆疊，隨後是更多的折積和完全連線的層。通過折積神經網路反向傳播梯度就像通過常規深度網路一樣簡單，允許訓練所有濾波器組中的所有權重。

深層神經網路利用了許多自然信號是組合層次的特性，其中較高階的特徵是通過組合較低階的特徵而獲得的。在影象中，邊緣的區域性組合形成圖案，圖案組合成零件，零件形成物體。從聲音到音素、音素、音節、單詞和句子，語音和文字中存在類似的層次結構。當前一層中的元素在位置和外觀上發生變化時，池化允許表示變化很小。

折積神經網路中的折積和池化層直接受到視覺神經科學中簡單細胞和複雜細胞的經典概唸的啓發，並且總體架構讓人聯想到視覺皮層腹側通路中的LGN–V1–V2–V4–IT層次結構。當折積神經網路模型和猴子顯示相同的圖片時，折積神經網路中高階單元的啓用可以解釋猴子下顳葉皮層中160個神經元的隨機集合的一半變化。折積神經網路的根源是新認知器，其結構有些相似，但沒有反向傳播等端到端監督學習演算法。原始的一維折積神經網路（稱爲時間延遲神經網路）用於識別音素和簡單單詞。

折積網路的大量應用可以追溯到1990年代初，首先是用於語音識別和文件閱讀的時延神經網路。該文件閱讀系統使用了一個折積神經網路，並與一個實現語言約束的概率模型一起進行訓練。到1990年代後期，該系統已讀取了美國所有支票的10％以上。微軟後來部署了許多基於折積神經網路的光學字元識別和手寫識別系統。在1990年代初期，還對折積神經網路進行了試驗，以檢測自然影象（包括臉和手）中的物體，並進行人臉識別。

Image understanding with deep convolutional networks

Since the early 2000s, ConvNets have been applied with great success to the detection, segmentation and recognition of objects and regions in images. These were all tasks in which labelled data was relatively abundant, such as traffic sign recognition, the segmentation of biological images particularly for connectomics, and the detection of faces, text, pedestrians and human bodies in natural images. A major recent practical success of ConvNets is face recognition.

Importantly, images can be labelled at the pixel level, which will have applications in technology, including autonomous mobile robots and self-driving cars. Companies such as Mobileye and NVIDIA are using such ConvNet-based methods in their upcoming vision systems for cars. Other applications gaining importance involve natural language understanding and speech recognition.

Despite these successes, ConvNets were largely forsaken by the mainstream computer-vision and machine-learning communities until the ImageNet competition in 2012. When deep convolutional networks were applied to a data set of about a million images from the web that contained 1,000 different classes, they achieved spectacular results, almost halving the error rates of the best competing approaches. This success came from the efficient use of GPUs, ReLUs, a new regularization technique called dropout, and techniques to generate more training examples by deforming the existing ones. This success has brought about a revolution in computer vision; ConvNets are now the dominant approach for almost all recognition and detection tasks and approach human performance on some tasks. A recent stunning demonstration combines ConvNets and recurrent net modules for the generation of image captions (Fig. 3).

Recent ConvNet architectures have 10 to 20 layers of ReLUs, hundreds of millions of weights, and billions of connections between units. Whereas training such large networks could have taken weeks only two years ago, progress in hardware, software and algorithm parallelization have reduced training times to a few hours.

The performance of ConvNet-based vision systems has caused most major technology companies, including Google, Facebook, Microsoft, IBM, Yahoo!, Twitter and Adobe, as well as a quickly growing number of start-ups to initiate research and development projects and to deploy ConvNet-based image understanding products and services.

ConvNets are easily amenable to efficient hardware implementations in chips or field-programmable gate arrays. A number of companies such as NVIDIA, Mobileye, Intel, Qualcomm and Samsung are developing ConvNet chips to enable real-time vision applications in smartphones, cameras, robots and self-driving cars.

深度折積網路的影象理解

自21世紀初以來，折積神經網路已成功地應用於影象中目標和區域的檢測、分割和識別。這些都是標記數據相對豐富的任務，如交通標誌識別、生物影象分割（尤其是連線組學）以及在自然影象中檢測人臉、文字、行人和人體。折積神經網路最近一個主要的實際成功是人臉識別。

重要的是，影象可以在畫素級標記，這將在技術上有所應用，包括自主移動機器人和自動駕駛汽車。像Mobileye和NVIDIA這樣的公司正在他們即將推出的汽車視覺系統中使用這種基於折積神經網路的方法。其他越來越重要的應用包括自然語言理解和語音識別。

儘管取得了一些成功，折積神經網路在很大程度上被主流計算機視覺和機器學習社羣所拋棄，直到2012年的ImageNet競賽。當深度折積網路應用於包含1000個不同類的web上大約一百萬個影象的數據集時，它們取得了驚人的結果，幾乎將最佳競爭方法的錯誤率降低了一半。這個成功來自於GPUs，ReLUs，一種稱爲dropout的新的正則化技術的有效使用，以及通過變形現有範例來生成更多訓練範例的技術。這一成功帶來了計算機視覺的一場革命；折積神經網路現在幾乎是所有識別和檢測任務的主導方法，並在某些任務上接近人類的表現。最近一個驚人的演示結合了折積神經網路和遞回網路模組來生成影象標題（圖3）。

Figure 3 | From image to text. Captions generated by a recurrent neural network (RNN) taking, as extra input, the representation extracted by a deep convolution neural network (CNN) from a test image, with the RNN trained to ‘translate’ high-level representations of images into captions (top). Reproduced with permission from ref. 102. When the RNN is given the ability to focus its attention on a different location in the input image (middle and bottom; the lighter patches were given more attention) as it generates each word (bold), we found that it exploits this to achieve better ‘translation’ of images into captions.

**圖3 ：從影象到文字。**由遞回神經網路(RNN)生成的字幕，將深度折積神經網路(CNN)從測試影象中提取的表示作爲額外輸入，並訓練RNN將影象的高階表示「翻譯」爲字幕(上)。當RNN能夠將其注意力集中在輸入影象的不同位置時(中間和底部；在生成每個單詞(粗體)時，較輕的修補程式得到了更多的關注，我們發現它利用這一點來更好地將影象「翻譯」成字幕。

最近的折積神經網路架構有10到20層的ReLUs，上億個權重，以及數十億個單元之間的連線。兩年前，訓練如此大的網路可能需要幾周時間，而硬體、軟體和演算法並行化的進步已經將訓練時間減少到幾個小時。

基於折積神經網路的視覺系統的效能已經引起了大多數主流技術公司的關注，包括谷歌、臉書、微軟、IBM、雅虎！、推特(Twitter)和奧多比(Adobe)以及越來越多的初創企業發起研發專案，並部署基於ConvNet的影象理解產品和服務。

折積神經網路易於適應晶片或現場可程式化門陣列中的高效硬體實現。 NVIDIA，Mobileye，英特爾，高通和三星等多家公司正在開發折積神經網路晶片，以支援智慧手機，相機，機器人和自動駕駛汽車中的實時視覺應用。

Distributed representations and language processing

Deep-learning theory shows that deep nets have two different exponential advantages over classic learning algorithms that do not use distributed representations. Both of these advantages arise from the power of composition and depend on the underlying data-generating distribution having an appropriate componential structure. First, learning distributed representations enable generalization to new combinations of the values of learned features beyond those seen during training (for example, $2^n$ combinations are possible with $n$ binary features). Second, composing layers of representation in a deep net brings the potential for another exponential advantage (exponential in the depth).

The hidden layers of a multilayer neural network learn to represent the network’s inputs in a way that makes it easy to predict the target outputs. This is nicely demonstrated by training a multilayer neural network to predict the next word in a sequence from a local context of earlier words. Each word in the context is presented to the network as a one-of-N vector, that is, one component has a value of 1 and the rest are 0. In the first layer, each word creates a different pattern of activations, or word vectors (Fig. 4). In a language model, the other layers of the network learn to convert the input word vectors into an output word vector for the predicted next word, which can be used to predict the probability for any word in the vocabulary to appear as the next word. The network learns word vectors that contain many active components each of which can be interpreted as a separate feature of the word, as was first demonstrated in the context of learning distributed representations for symbols. These semantic features were not explicitly present in the input. They were discovered by the learning procedure as a good way of factorizing the structured relationships between the input and output symbols into multiple ‘micro-rules’ . Learning word vectors turned out to also work very well when the word sequences come from a large corpus of real text and the individual micro-rules are unreliable. When trained to predict the next word in a news story, for example, the learned word vectors for Tuesday and Wednesday are very similar, as are the word vectors for Sweden and Norway. Such representations are called distributed representations because their elements (the features) are not mutually exclusive and their many configurations correspond to the variations seen in the observed data. These word vectors are composed of learned features that were not determined ahead of time by experts, but automatically discovered by the neural network. Vector representations of words learned from text are now very widely used in natural language applications.

The issue of representation lies at the heart of the debate between the logic-inspired and the neural-network-inspired paradigms for cognition. In the logic-inspired paradigm, an instance of a symbol is something for which the only property is that it is either identical or non-identical to other symbol instances. It has no internal structure that is relevant to its use; and to reason with symbols, they must be bound to the variables in judiciously chosen rules of inference. By contrast, neural networks just use big activity vectors, big weight matrices and scalar non-linearities to perform the type of fast ‘intuitive’ inference that underpins effortless commonsense reasoning.

Before the introduction of neural language models, the standard approach to statistical modelling of language did not exploit distributed representations: it was based on counting frequencies of occurrences of short symbol sequences of length up to $N$ (called $N$ -grams). The number of possible $N$ -grams is on the order of $V^N$ , where $V$ is the vocabulary size, so taking into account a context of more than a handful of words would require very large training corpora. N-grams treat each word as an atomic unit, so they cannot generalize across semantically related sequences of words, whereas neural language models can because they associate each word with a vector of real valued features, and semantically related words end up close to each other in that vector space (Fig. 4).

分佈式表示和語言處理

深度學習理論表明，與不使用分佈式表示的經典學習演算法相比，深度網有兩個不同的指數優勢。這兩個優勢都源於組合的力量，並且依賴於具有適當元件結構的底層數據生成分佈。首先，學習分佈式表示能夠將學習到的特徵的值推廣到新的組合，而不是在訓練期間看到的那些(例如， $2^n$ 個組合可能具有 $n$ 個二進制特徵)。第二，在一個深度網中組成代表層會帶來另一個指數優勢(深度指數)。

多層神經網路的隱藏層學習以易於預測目標輸出的方式來表示網路的輸入。通過訓練多層神經網路從上一個單詞的區域性上下文中預測序列中的下一個單詞可以很好地證明這一點。上下文中的每個單詞都作爲N中的一個向量呈現給網路，也就是說，一個分量的值爲1，其餘的爲0.在第一層中，每個單詞都會建立不同的啓用模式或單詞向量（圖4）。在語言模型中，網路的其他層學習將輸入的單詞向量轉換爲預測的下一個單詞的輸出單詞向量，這可用於預測詞彙表中任何單詞作爲下一個單詞出現的概率。網路學習包含許多活躍成分的詞向量，其中每一個都可以解釋爲單詞的一個單獨的特徵，正如在學習符號的分佈式表示的上下文中首次演示的那樣，這些語意特徵在輸入中沒有顯式呈現。通過學習過程可以發現它們，這是將輸入和輸出符號時間的結構關係分解爲多個「微規則」的好方法。當單詞序列來自大量的真實文字並且單個微規則不可靠時，學習單詞向量也可以很好地工作。例如，當訓練預測新聞報道中的下一個單詞時，星期二和星期三所學的單詞向量非常相似，瑞典和挪威的單詞向量也是如此。這種表示被稱爲分佈式表示，因爲它們的元素（特徵）不是互斥的，它們的許多設定對應於觀測數據中看到的變化。這些詞向量是由學習的特徵組成的，這些特徵不是由專家預先確定的，而是由神經網路自動發現的。從文字中學習單詞的向量表示現在在自然語言方面得到了廣泛的應用。

表示問題時邏輯啓發和神經網路啓發的認知範式之間爭論的核心。在邏輯啓發範式中，符號的一個範例的唯一屬性是它與其他符號範例相同或不相同。它沒有於其使用相關的內部結構：要用符號進行推理，它們必須與經過明智選擇的推理規則中的變數系結在一起。相比之下，神經網路僅使用較大的活動向量、較大的權值矩陣和標量非線性來執行快速的「直覺「推斷型別，從而支援簡單的常識推理。

在引入神經語言模型之前，語言統計建模的標準方法並未利用分佈式表示：它是基於對長度不超過 $N$ （稱爲 $N$ 元語法）的短符號序列的出現頻率進行計數。可能的 $N$ 元語法的數量在 $V^N$ 的數量級上，其中 $V$ 是詞彙量，因此考慮到少數單詞的上下文，將需要非常大的訓練語言資料庫。 $N$ 元語法將每個單詞視爲一個原子單元，因此它們無法在語意上相關的單詞序列中進行泛化，而神經語言模型則可以將它們與實際值特徵的向量相關聯，而語意相關的單詞最終彼此靠近在改向量空間中（圖4）。

Figure 4 | Visualizing the learned word vectors. On the left is an illustration of word representations learned for modelling language, non-linearly projected to 2D for visualization using the t-SNE algorithm. On the right is a 2D representation of phrases learned by an English-to-French encoder–decoder recurrent neural network. One can observe that semantically similar words or sequences of words are mapped to nearby representations. The distributed representations of words are obtained by using backpropagation to jointly learn a representation for each word and a function that predicts a target quantity such as the next word in a sequence (for language modelling) or a whole sequence of translated words (for machine translation).

**圖4：視覺化學習的單詞向量。**左邊是爲建模語言學習的單詞表示的圖示，使用t-SNE演算法非線性投影到2D用於視覺化。右邊是由英語-法語編碼器-解碼器遞回神經網路學習的短語的2D表示。人們可以觀察到語意相似的單詞或單詞序列被對映到附近的表示。單詞的分佈式表示是通過使用反向傳播來聯合學習每個單詞的表示和預測目標量的函數來獲得的，所述目標量例如是序列中的下一個單詞(用於語言建模)或整個翻譯單詞序列(用於機器翻譯)。

Recurrent neural networks

When backpropagation was first introduced, its most exciting use was for training recurrent neural networks (RNNs). For tasks that involve sequential inputs, such as speech and language, it is often better to use RNNs (Fig. 5). RNNs process an input sequence one element at a time, maintaining in their hidden units a ‘state vector’ that implicitly contains information about the history of all the past elements of the sequence. When we consider the outputs of the hidden units at different discrete time steps as if they were the outputs of different neurons in a deep multilayer network (Fig. 5, right), it becomes clear how we can apply backpropagation to train RNNs.

RNNs are very powerful dynamic systems, but training them has proved to be problematic because the backpropagated gradients either grow or shrink at each time step, so over many time steps they typically explode or vanish.

Thanks to advances in their architecture and ways of training them, RNNs have been found to be very good at predicting the next character in the text or the next word in a sequence, but they can also be used for more complex tasks. For example, after reading an English sentence one word at a time, an English ‘encoder’ network can be trained so that the final state vector of its hidden units is a good representation of the thought expressed by the sentence. This thought vector can then be used as the initial hidden state of (or as extra input to) a jointly trained French ‘decoder’ network, which outputs a probability distribution for the first word of the French translation. If a particular first word is chosen from this distribution and provided as input to the decoder network it will then output a probability distribution for the second word of the translation and so on until a full stop is chosen. Overall, this process generates sequences of French words according to a probability distribution that depends on the English sentence. This rather naive way of performing machine translation has quickly become competitive with the state-of-the-art, and this raises serious doubts about whether understanding a sentence requires anything like the internal symbolic expressions that are manipulated by using inference rules. It is more compatible with the view that everyday reasoning involves many simultaneous analogies that each contribute plausiblilty to a conclusion.

Instead of translating the meaning of a French sentence into an English sentence, one can learn to ‘translate’ the meaning of an image into an English sentence (Fig. 3). The encoder here is a deep ConvNet that converts the pixels into an activity vector in its last hidden layer. The decoder is an RNN similar to the ones used for machine translation and neural language modelling. There has been a surge of interest in such systems recently (see examples mentioned in ref. 86).

RNNs, once unfolded in time (Fig. 5), can be seen as very deep feedforward networks in which all the layers share the same weights. Although their main purpose is to learn long-term dependencies, theoretical and empirical evidence shows that it is difficult to learn to store information for very long.

To correct for that, one idea is to augment the network with an explicit memory. The first proposal of this kind is the long short-term memory (LSTM) networks that use special hidden units, the natural behaviour of which is to remember inputs for a long time. A special unit called the memory cell acts like an accumulator or a gated leaky neuron: it has a connection to itself at the next time step that has a weight of one, so it copies its own real-valued state and accumulates the external signal, but this self-connection is multiplicatively gated by another unit that learns to decide when to clear the content of the memory.

LSTM networks have subsequently proved to be more effective than conventional RNNs, especially when they have several layers for each time step, enabling an entire speech recognition system that goes all the way from acoustics to the sequence of characters in the transcription. LSTM networks or related forms of gated units are also currently used for the encoder and decoder networks that perform so well at machine translation.

Over the past year, several authors have made different proposals to augment RNNs with a memory module. Proposals include the Neural Turing Machine in which the network is augmented by a ‘tape-like’ memory that the RNN can choose to read from or write to, and memory networks, in which a regular network is augmented by a kind of associative memory. Memory networks have yielded excellent performance on standard question-answering benchmarks. The memory is used to remember the story about which the network is later asked to answer questions.

Beyond simple memorization, neural Turing machines and memory networks are being used for tasks that would normally require reasoning and symbol manipulation. Neural Turing machines can be taught ‘algorithms’ . Among other things, they can learn to output a sorted list of symbols when their input consists of an unsorted sequence in which each symbol is accompanied by a real value that indicates its priority in the list. Memory networks can be trained to keep track of the state of the world in a setting similar to a text adventure game and after reading a story, they can answer questions that require complex inference. In one test example, the network is shown a 15-sentence version of the The Lord of the Rings and correctly answers questions such as 「where is Frodo now?".

遞回神經網路

當反向傳播首次被引入時，其最令人興奮的用途是訓練遞回神經網路（RNN）。對於涉及輸入的任務，例如語音和文字，使用遞回神經網路通常更好（圖5）.遞回神經網路一次只處理一個輸入序列的一個元素，在它們的隱藏單元中維護一個「狀態向量」，它隱含地包含序列所有過去元素的歷史資訊。當我們考慮隱藏單元在不同離散時間步長的輸入，就好像它們是深層多層網路中不同神經元的輸入輸出一樣（圖5，右圖），我們可以很清楚地知道如何應用反向傳播來訓練遞回神經網路。

遞回神經網路是非常強大的動態系統，但是訓練它們被證實是有問題的，因爲反向傳播的梯度在每一個時間都會被增長或收縮，因此在許多時間同步中，它們通常會激增或消失。

由於其體系結構和訓練方法的進步，人們發現遞回神經網路非常善於預測文字中的下一個字元或序列中的下一個單詞，而且它們也可以用於更復雜的任務。例如，在一次讀一個英語句子後，可以訓練一個英語「編碼器」網路，使其隱藏單元的最終狀態向量很好地表示句子所表達的思想。然後，這個思想向量被用作聯合訓練的法語「解碼器」網路的初始隱藏狀態（或作爲額外輸入），該網路輸出法語翻譯的第一個單詞的概率分佈。如果從這個分佈中選擇一個特定的第一個單詞，並將其作爲輸入提供給解碼器網路，那麼它將輸出翻譯的第二個單詞的概率分佈，以此類推，直到選擇了句號爲止。總體而言，此過程根據一個依賴於英語句子的概率分佈生成法語單詞序列。這種相當幼稚的執行機器翻譯的方式已迅速與最新技術競爭，這引起了人們對以下問題的嚴重懷疑：理解句子是否需要通過使用推理規則操縱的內部符號表示式之類的東西。它更符合這樣一種觀點，即日常推理包括許多同時進行的類比，每一個都有助於得出一個合理的結論。

與其將法語句子的含義翻譯成英語句子，不如學習將影象的含義「翻譯成」英語句子（圖3）。這裏的編碼器是一個深層的折積神經網路，可將畫素轉換爲其最後一個隱藏層中的活動向量。解碼器是一個遞回神經網路，類似於機器翻譯和神經語言建模的遞回神經網路。近年來，人們對此類系統的興趣激增。

Figure 5 | A recurrent neural network and the unfolding in time of the computation involved in its forward computation. The artificial neurons (for example, hidden units grouped under node $s$ with values $s_t$ at time $t$ ) get inputs from other neurons at previous time steps (this is represented with the black square, representing a delay of one time step, on the left). In this way, a recurrent neural network can map an input sequence with elements $x_t$ into an output sequence with elements $o_t$ , with each $o_t$ depending on all the previous $x_t^{'}$ (for $t^{'} ≤ t$ ). The same parameters (matrices $U,V,W$ ) are used at each time step. Many other architectures are possible, including a variant in which the network can generate a sequence of outputs (for example, words), each of which is used as inputs for the next time step. The backpropagation algorithm (Fig. 1) can be directly applied to the computational graph of the unfolded network on the right, to compute the derivative of a total error (for example, the log-probability of generating the right sequence of outputs) with respect to all the states $s_t$ and all the parameters.

**圖5：遞回神經網路及其前向計算中涉及的計算的時間展開。**人工神經元(例如，在時間 $t$ 值爲 $s_t$ 的節點 $s$ 下分組的隱藏單元)在之前的時間步長從其他神經元獲得輸入(這用左邊的黑色正方形表示，表示一個時間步長的延遲)。以這種方式，遞回神經網路可以將具有元素 $x_t$ 的輸入序列對映成具有元素 $o_t$ 的輸出序列，每個元素 $o_t$ 取決於所有先前的 $x_t{'}$ (對於 $t^{'} ≤ t$ )。在每個時間步長使用相同的參數(矩陣 $U、V、W$ )。許多其他體系結構也是可能的，包括一種變體，其中網路可以生成一系列輸出(例如，字)，每個輸出都用作下一個時間步驟的輸入。反向傳播演算法(圖1)可以直接應用於右邊的展開網路的計算圖，以計算關於所有狀態 $s_t$ 和所有參數的總誤差的導數(例如，生成正確輸出序列的對數概率)。

遞回神經網路一旦在時間上展開（圖5），就可以看作是非常深的前饋網路，其中所有層共用相同的權重。雖然它們的主要目的是學習長期依賴性，但理論和經驗證據表明，學習長期儲存資訊是很困難的。

爲了解決這個問題，一個方案是用顯示記憶體擴充套件網路。第一種建議是使用特殊隱藏單元的長短期記憶（LSTM）網路，其自然行爲是長時間記憶輸入。一個叫做記憶細胞的特殊單元就像一個累加器或一個門控泄漏神經元：它在下一時間步與其自身具有連線，其權重爲1，因此它複製自己的實值狀態並累積外部信號，但是此自連線是由另一個單元乘法控制的，該單元學會確定何時清除記憶體內容。

LSTM網路隨後被證明比傳統遞回神經網路更有效，尤其是當它們在每個時間步都有多層時，使整個語音識別系統從聲學到轉錄中的字元序列，一路走來。 LSTM網路或相關形式的門控單元目前也用於編碼器和解碼器網路，它們在機器翻譯中表現出色。

在過去的一年中，幾位作者提出了不同的建議，以使用記憶體模組擴充套件遞回神經網路。提案包括神經圖靈機，其中遞回神經網路可以選擇讀取或寫入的「類磁帶」記憶體來增強網路，以及其中一種關聯記憶體來增強常規網路的記憶體網路。記憶體網路在標準問答基準方面已表現出出色的效能。記憶體用於記住故事，有關該故事後來被要求網路回答問題。

除了簡單的記憶外，神經圖靈機和儲存網路還用於執行通常需要推理和符號操作的任務。神經圖靈機可以被稱爲「演算法」。除其他事項外，當他們的輸入由未排序的序列組成時，他們可以學習輸出已排序的符號列表，其中每個符號都帶有一個表示其在列表中優先順序的實數值。可以訓練記憶網路，使其在類似於文字冒險遊戲的環境中跟蹤世界的狀況，並且在閱讀故事後，它們可以回答需要複雜推理的問題。在一個測試範例中，該網路顯示了15句的《指環王》，並正確回答了諸如「 Frodo現在在哪裏？"問題。

The future of deep learning

Unsupervised learning had a catalytic effect in reviving interest in deep learning, but has since been overshadowed by the successes of purely supervised learning. Although we have not focused on it in this Review, we expect unsupervised learning to become far more important in the longer term. Human and animal learning is largely unsupervised: we discover the structure of the world by observing it, not by being told the name of every object.

Human vision is an active process that sequentially samples the optic array in an intelligent, task-specific way using a small, high-resolution fovea with a large, low-resolution surround. We expect much of the future progress in vision to come from systems that are trained end-to-end and combine ConvNets with RNNs that use reinforcement learning to decide where to look. Systems combining deep learning and reinforcement learning are in their infancy, but they already outperform passive vision systems at classification tasks and produce impressive results in learning to play many different video games.

Natural language understanding is another area in which deep learning is poised to make a large impact over the next few years. We expect systems that use RNNs to understand sentences or whole documents will become much better when they learn strategies for selectively attending to one part at a time.

Ultimately, major progress in artificial intelligence will come about through systems that combine representation learning with complex reasoning. Although deep learning and simple reasoning have been used for speech and handwriting recognition for a long time, new paradigms are needed to replace rule-based manipulation of symbolic expressions by operations on large vectors.

深度學習的未來

無監督學習在重新激發人們對深度學習的興趣方面起到了催化作用，但是自那以後，純監督學習的成功就使它黯然失色。儘管我們在本評論中並未對此進行關注，但我們希望從長遠來看，無監督學習將變得越來越重要。人類和動物的學習在很大程度上不受監督：我們通過觀察來發現世界的結構，而不是通過告知每個物體的名稱來發現世界的結構。

人的視覺是一個活躍的過程，它使用具有高解析度，低解析度環繞的小型高解析度中央凹，以智慧的，針對特定任務的方式對光學陣列進行順序採樣。我們期望在視覺上未來的許多進步都將來自端到端訓練的系統，並將折積神經網路與遞回神經網路結合起來，後者使用強化學習來決定在哪裏看。結合了深度學習和強化學習的系統尚處於起步階段，但是在分類任務上它們已經超過了被動視覺系統，並且在學習玩許多不同的視訊遊戲方面產生了令人印象深刻的結果。

在未來幾年自然語言理解必然是對深度學習產生巨大影響的另一個領域。我們希望使用遞回神經網路來理解句子或整個文件的系統在學習一次選擇性地關注一部分內容的策略時會變得更好。

最終，人工智慧的重大進步將通過將表示學習與複雜推理相結合的系統來實現。雖然深度學習和簡單推理已經被用於語音和手寫識別很長時間了，但是需要新的範例來通過對大向量的操作來代替基於規則的符號表示式操作。

References

參考文獻

https://www.nature.com/articles/nature14539

【論文翻譯】Deep learning