在神经网络反向传播算法Backpropagation算法中，这个公式怎么理解

点击联系发帖人 时间：2017-09-04 02:38

神经网络反向传播算法

理解反向传播(BackPropagation)算法（之一） - Blssel的博客 - CSDN博客
理解反向传播(BackPropagation)算法（之一）
神经网络的训练——由浅入深理解反向传播算法之一
神经网络非常强大，经过足够大数据量的训练之后，在诸多问题上往往能够表现出惊人的准确性。而我们也认识到，神经网络之所以能够有如此强大的性能，与其对参数的训练是离不开的，之前我们从感觉上认识到，所谓训练其实就是对参数值进行合理的设定或拟合，但是具体采用了何种高深的算法才能够对数量如此庞大的参数群进行精确拟合呢？这就是我们今天要讨论的话题——神经网络的训练方法。
如果你想彻底了解神经网络是如何进行训练的，那么就一定要做好心理准备，因为这里涉及一些（但不是很多）数学推导（其实用到的也就是基础的高等数学多元微分和线性代数矩阵的知识罢了，没有特别高深的理论），只要用心感受，看个两三回总能看明白。
先举个例子，相信看了这个例子后你马上就能触类旁通，明白神经网络的训练原理
对于计科专业的小伙伴，相信大多都开过诸如人工智能或机器学习的课程吧（本通信狗表示并没有开过，只能苦苦地自己学）？实在不行貌似数学建模里也有涉及到关于“线性回归”的问题，这个问题其实敲级简单。举个栗子，比如给你一个关于北京房价的数据集，描述了北京的房价与地理位置的关系，{(4.5环,80k),(5环,70k),(6环，50k),(5.2环,71k),(4.8环,78k),(6.5环,40k)…}(纯属瞎编，明白意思就好)，然后让你预测一个给定位置（比如5.5环）的房价，这个问题该怎么解决？
其实这就是一个最基本的线性回归问题。本例子给出的每一个元素都是一个二维向量，因此可以轻松地将其画在平面坐标系中,如下图所示
这样看起来是不是直观些？那么下面我们如何来预测一个任意给定位置的房价呢？直觉告诉你，你可以做出一条漂亮的连续曲线来串联这些点，然后也画在坐标平面上，就是这样
这样拟合完全可以，线性回归中实际上用的是更简单的直线进行的拟合，在这里其实就是一次函数，即线性模型试图学得一个模型：f(xi)=wxi+b，使得f(xi)约等于yi(1)解释下这个式子，这个模型对任意输出xi都可以输出一个预测结果f(xi)使得f(xi)尽量接近真实结果yi，模型的大致模样见下图：
OK，有了这个模型，我们就可以大胆的使用它对其它数据进行预测了。
但是以上的拟合图都是Matlab帮忙做的，这条一次函数直线的倾斜度和截距是怎么确定的？
这里很重要，回顾下模型的表达式f(xi)=wxi+b，要确定模型，我们唯一要做的就是确定w和b的值，使得模型对现有数据拟合的更准确。
如何合理的确定w和b？其实这里用的方法和后面神经网络训练的方法本质上也是一样的，简单粗暴，只需要让现有数据到这个一次函数线的欧式距离累加起来最小就行。
稍稍解释一下，对于输入xi，模型输出的是f(xi)，而真实的输出应该是yi，那么我们让模型的预测值与真实值之间的累积误差最小，这个模型不就是一个好的模型了吗？真实值yi我们都知道了呀，就是一开始给出的那一堆数据（我们现在可以称其为训练数据了），模型的参数w和x我们当成未知数来对待，就可以列出误差函数了，这里的误差我们使用均方误差，即
C(w,b)=∑i=1m(f(xi)-yi)2(2)其中C(w,b)表示均方误差，m表示训练集中元素的个数。
最后一步，确定模型的参数
总之，这个公式很容易理解，它是一个关于w和b的函数，我们的目标就是求这个函数的最小值点。不知大家高等数学下册的内容还记得不？多元函数求极值点的方法“很套路”，就是对此函的每一个自变量各自求一次导，然后另导数为零，解方程便得到极值点（可能不止一个），然后比较所有极值点对应函数值的大小，便能确定最小点，对应的极值点便是w和b最终应该取得值。由于这个式子的形式比较简单，相信你像考试一样拿起笔算两下就明白了（别犯懒~），输入公式实在太麻烦，而且这里重在获得启发性思考，重要的是神经网络的训练，所以我就不推下去了。
最后，相信你已经明白是怎么回事，对一个模型（比如此处的线性模型），正常情况下你都可以像上面一样列出模型的表达式（参数用未知数代替），然后理论上是可以列出这个模型的误差函数，然后利用数学知识获得误差的最小值，此最小值对应的参数取值就可以确定，然后这个模型得以训练~这就是训练原理，相信你通过本例能获得一定的收获，下期正式上干货，手推反向传播算法~
我的热门文章1136人阅读
机器学习（89）
DeepLearning（172）
作者：Evan Hoo
链接：/question//answer/
来源：知乎
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。
BackPropagation算法是多层神经网络的训练中举足轻重的算法。
简单的理解，它的确就是复合函数的链式法则，但其在实际运算中的意义比链式法则要大的多。
要回答题主这个问题“如何直观的解释back propagation算法？” 需要先直观理解多层神经网络的训练。
机器学习可以看做是数理统计的一个应用，在数理统计中一个常见的任务就是拟合，也就是给定一些样本点，用合适的曲线揭示这些样本点随着自变量的变化关系。
深度学习同样也是为了这个目的，只不过此时，样本点不再限定为(x, y)点对，而可以是由向量、矩阵等等组成的广义点对(X,Y)。而此时，(X,Y)之间的关系也变得十分复杂，不太可能用一个简单函数表示。然而，人们发现可以用多层神经网络来表示这样的关系，而多层神经网络的本质就是一个多层复合的函数。借用网上找到的一幅图[1]，来直观描绘一下这种复合关系。
其对应的表达式如下：
上面式中的Wij就是相邻两层神经元之间的权值，它们就是深度学习需要学习的参数，也就相当于直线拟合y=k*x+b中的待求参数k和b。上面式中的Wij就是相邻两层神经元之间的权值，它们就是深度学习需要学习的参数，也就相当于直线拟合y=k*x+b中的待求参数k和b。
和直线拟合一样，深度学习的训练也有一个目标函数，这个目标函数定义了什么样的参数才算一组“好参数”，不过在机器学习中，一般是采用成本函数（cost function），然后，训练目标就是通过调整每一个权值Wij来使得cost达到最小。cost函数也可以看成是由所有待求权值Wij为自变量的复合函数，而且基本上是非凸的，即含有许多局部最小值。但实际中发现，采用我们常用的梯度下降法就可以有效的求解最小化cost函数的问题。
梯度下降法需要给定一个初始点，并求出该点的梯度向量，然后以负梯度方向为搜索方向，以一定的步长进行搜索，从而确定下一个迭代点，再计算该新的梯度方向，如此重复直到cost收敛。那么如何计算梯度呢？
假设我们把cost函数表示为, 那么它的梯度向量[2]就等于, 其中表示正交单位向量。为此，我们需求出cost函数H对每一个权值Wij的偏导数。而BP算法正是用来求解这种多层复合函数的所有变量的偏导数的利器。
我们以求e=(a+b)*(b+1)的偏导[3]为例。
它的复合关系画出图可以表示如下：
在图中，引入了中间变量c,d。在图中，引入了中间变量c,d。
为了求出a=2, b=1时，e的梯度，我们可以先利用偏导数的定义求出不同层之间相邻节点的偏导关系，如下图所示。
利用链式法则我们知道：利用链式法则我们知道：
链式法则在上图中的意义是什么呢？其实不难发现，的值等于从a到e的路径上的偏导值的乘积，而的值等于从b到e的路径1(b-c-e)上的偏导值的乘积加上路径2(b-d-e)上的偏导值的乘积。也就是说，对于上层节点p和下层节点q，要求得，需要找到从q节点到p节点的所有路径，并且对每条路径，求得该路径上的所有偏导数之乘积，然后将所有路径的
“乘积” 累加起来才能得到的值。
大家也许已经注意到，这样做是十分冗余的，因为很多路径被重复访问了。比如上图中，a-c-e和b-c-e就都走了路径c-e。对于权值动则数万的深度模型中的神经网络，这样的冗余所导致的计算量是相当大的。
同样是利用链式法则，BP算法则机智地避开了这种冗余，它对于每一个路径只访问一次就能求顶点对所有下层节点的偏导值。
正如反向传播(BP)算法的名字说的那样，BP算法是反向(自上往下)来寻找路径的。
从最上层的节点e开始，初始值为1，以层为单位进行处理。对于e的下一层的所有子节点，将1乘以e到某个节点路径上的偏导值，并将结果“堆放”在该子节点中。等e所在的层按照这样传播完毕后，第二层的每一个节点都“堆放&些值，然后我们针对每个节点，把它里面所有“堆放”的值求和，就得到了顶点e对该节点的偏导。然后将这些第二层的节点各自作为起始顶点，初始值设为顶点e对它们的偏导值，以&层&为单位重复上述传播过程，即可求出顶点e对每一层节点的偏导数。
以上图为例，节点c接受e发送的1*2并堆放起来，节点d接受e发送的1*3并堆放起来，至此第二层完毕，求出各节点总堆放量并继续向下一层发送。节点c向a发送2*1并对堆放起来，节点c向b发送2*1并堆放起来，节点d向b发送3*1并堆放起来，至此第三层完毕，节点a堆放起来的量为2，节点b堆放起来的量为2*1+3*1=5, 即顶点e对b的偏导数为5.
举个不太恰当的例子，如果把上图中的箭头表示欠钱的关系，即c→e表示e欠c的钱。以a, b为例，直接计算e对它们俩的偏导相当于a, b各自去讨薪。a向c讨薪，c说e欠我钱，你向他要。于是a又跨过c去找e。b先向c讨薪，同样又转向e，b又向d讨薪，再次转向e。可以看到，追款之路，充满艰辛，而且还有重复，即a, b 都从c转向e。
而BP算法就是主动还款。e把所欠之钱还给c，d。c，d收到钱，乐呵地把钱转发给了a，b，皆大欢喜。
------------------------------------------------------------------
【参考文献】
其他推荐网页：
这大概是题主想要的吧（多图）：源地址：
这大概是题主想要的吧（多图）：
源地址：源地址：
就是一个求导的链式法则嘛。。。
就是一个求导的链式法则嘛。。。
拿纸和笔,定个最简单的三层网络,分别对各权值求偏导.
拿纸和笔,定个最简单的三层网络,分别对各权值求偏导.
可以理解为误差分配。对应于强化学习中的信度分配问题。……
可以理解为误差分配。对应于强化学习中的信度分配问题。……
哪个结点对“我”有贡献，反过来要回报它→_→
哪个结点对“我”有贡献，反过来要回报它→_→
神经网络用图来表示复合函数。所以对复合函数求导时，链式法则这一代数运算就能被表示成图上的消息传递。
神经网络用图来表示复合函数。所以对复合函数求导时，链式法则这一代数运算就能被表示成图上的消息传递。
我觉得理解BP算法原理最直接与清晰的就是根据computation graph了，话不多说，直接上图。这里举了一个三层神经网络（一个输入层、一个隐层和一个输出层）的例子，使用了softmax输出层，损失函数使用交叉熵。训练神经网络可以使用梯度下降的方法，重点是计算…
我觉得理解BP算法原理最直接与清晰的就是根据computation graph了，话不多说，直接上图。
这里举了一个三层神经网络（一个输入层、一个隐层和一个输出层）的例子，使用了softmax输出层，损失函数使用交叉熵。训练神经网络可以使用梯度下降的方法，重点是计算梯度，也就是损失函数对参数的导数，在图中可以表示为dloss/dW1，dloss/dW2，dloss/db1和dloss/db2。如何计算这些梯度，使用的就是BP算法，其实也就是求导的链式法则。这里举了一个三层神经网络（一个输入层、一个隐层和一个输出层）的例子，使用了softmax输出层，损失函数使用交叉熵。训练神经网络可以使用梯度下降的方法，重点是计算梯度，也就是损失函数对参数的导数，在图中可以表示为dloss/dW1，dloss/dW2，dloss/db1和dloss/db2。如何计算这些梯度，使用的就是BP算法，其实也就是求导的链式法则。
在每一轮迭代中，首先进行forward propagation，也就是计算computation graph中每个节点的状态：
mul1 = X * W1
add1 = mul1 + b1
tanh1 = tanh(add1)
mul2 = tanh1 * W2
add2 = mul2 + b2
tanh2 = tanh(add2)
loss = softmax_loss(tanh2)
之后进行back propagation，也就是计算computation graph中每个节点相对于损失函数（这里表示为loss）的导数，这里面应用了链式法则。对于dloss/dtanh2, dloss/dadd2等导数，下面省略分子直接表示为dtanh2等。
dtanh2 = softmax_loss_diff(tanh2) * dloss
dadd2 = tanh_diff(add2) * dtanh2
db2 = 1 * dadd2
dmul2 = 1 * dadd2
dW2 = tanh1 * dmul2
dtanh1 = W2 * dmul2
dadd1 = tanh_diff(add1) * dtanh1
db1 = 1 * dadd1
dmul1 = 1 * dadd1
dW1 = X * dmul1
上面的变量都可以用矩阵表示，直接进行矩阵运算。其中dW1，dW2，db1和db2就是我们需要求的参数的梯度。
在编程实现上，每一个计算节点都可以定义两个函数，一个是forward，用于给定输入计算输出；一个是backward，用于给定反向梯度，计算整个表达式（相当于损失函数）相对于这个节点的输入的梯度，应用链式法则就是：这个节点相对于其输入的梯度（直接对输入求导）乘以这个节点接受的反向梯度。
我有一个tutorial，使用Python如何一步一步的实现神经网络，而且可以自定义网络的层数和每层的维度，扩展性很强。其中，抽象出来了gate（AddGate，MulGate），layer（Tanh，Sigmoid）和output（Softmax），你也可以自己实现不同的layer比如ReLu，或不同的output（比如Hinge）。
感兴趣的请移步
这篇写的很好懂.
这篇写的很好懂.
Apply the chain rule to compute the gradient of the lossfunction with respect to the inputs.---cs231n
Apply the chain rule to compute the gradient of the lossfunction with respect to the inputs.
斯坦福的Andrew的公开课中对BP算法的直观演示。
斯坦福的Andrew的公开课中对BP算法的直观演示。
我感觉最直观的理解不就是它的名字么，误差回传播嘛，不知道你想要多直观啊...
我感觉最直观的理解不就是它的名字么，误差回传播嘛，不知道你想要多直观啊...
直接上干货，（深度学习在线书chap2），把这一篇看完然后可以看看torch7的nn里面随便一个层的写法，或者caffe的Backward的实现，这两种框架都是按照layer-by-layer的方法设计的。另外比较灵活的一种方式是通过定义computa…
直接上干货，（深度学习在线书chap2），把这一篇看完然后可以看看torch7的nn里面随便一个层的写法，或者caffe的Backward的实现，这两种框架都是按照layer-by-layer的方法设计的。另外比较灵活的一种方式是通过定义computation
graph，通过定义节点上基本操作的梯度，然后利用auto differentiation的思路就可以进行BP了（Tensorflow和MXNet的采用的思路）。
z为没经过激活函数的输出，a是经过激活函数的输出。z为没经过激活函数的输出，a是经过激活函数的输出。
定义损失函数（Cost）关于 l 层输出z的偏导为：
则可以得到BP的4个基本方程：则可以得到BP的4个基本方程：
该在线书上有关于前两个公式的证明的推导，仿照着利用chain
rule，后两个公式也可以很简单地证明出来。该在线书上有关于前两个公式的证明的推导，仿照着利用chain rule，后两个公式也可以很简单地证明出来。
另外贴一下BP算法和mini-batch SGD的算法：
mini-batch SGD：
BP算法最开始的发现是基于每个权重(weight)或者偏置(bias)的微小改变都能引起输出的改变，于是就进一步产生了用Cost对于这些参数的导数来更新w,b，进而达到改进输出的目的。
这里一种很直接的思路就是直接求Cost关于每个参数的导数（比如 [C(w+dw)-C(w)]/dw 这种数值微分），但是这种方法需要进行参数个数量次的Backward过程，计算代价很大。BP算法的提出就是为了优雅地解决这一问题，它只需要一次Backward就能将误差传到每一个可学的参数(w,b)。
（以上，该在线书的部分搬运，如需进一步阅读，直接戳原网址。以及进一步可参见相关框架的源码和实现细节）
PS:最近才比较深入理解BP，欢迎批评指正。
BP算法的思想是：将训练误差E看作以权重向量每个元素为变量的高维函数，通过不断更新权重，寻找训练误差的最低点，按误差函数梯度下降的方向更新权值。
BP算法的思想是：将训练误差E看作以权重向量每个元素为变量的高维函数，通过不断更新权重，寻找训练误差的最低点，按误差函数梯度下降的方向更新权值。
反向传播其主要目的是通过将输出误差反传，将误差分摊给各层所有单元，从而获得各层单元的误差信号，进而修正各单元的权值（其过程，是一个权值调整的过程）。忘记出处了我是看到这句才突然清晰了许多。
其主要目的是通过将输出误差反传，将误差分摊给各层所有单元，从而获得各层单元的误差信号，进而修正各单元的权值（其过程，是一个权值调整的过程）。
忘记出处了我是看到这句才突然清晰了许多。
误差函数关于模型参数的偏导数，在神经网络里逆向逐一求解。wiki上有非常简练的推导。
误差函数关于模型参数的偏导数，在神经网络里逆向逐一求解。wiki上有非常简练的推导。
没有办法更直观了，除非你没有学过微积分想要理解一个算法，用1+1=2的方式去理解1+1，我想是不行的，希望对你有用
没有办法更直观了，除非你没有学过微积分
想要理解一个算法，用1+1=2的方式去理解1+1，我想是不行的，希望对你有用
&&相关文章推荐
* 以上用户言论只代表其个人观点，不代表CSDN网站的观点或立场
访问：1445144次
积分：17456
积分：17456
排名：第603名
原创：102篇
转载：1360篇
评论：219条
个人邮箱： xuxiduo@
1）OpenCV俱乐部
2）视频/音频/图像/算法/ML
备注：加群需要回答问题，避免广告党。
如果你是博客看到后加的，请注明“博客”并回答问题，只注明”博客“不回答问题的恕不加入。答案为和群相关的任何技术名词，不能出现1）和2）中的任何字眼
阅读：8158
文章：16篇
阅读：14491
阅读：8206
文章：32篇
阅读：49127
(2)(4)(2)(24)(46)(6)(9)(8)(8)(31)(20)(18)(42)(70)(118)(143)(56)(68)(107)(226)(78)(3)(9)(37)(46)(50)(19)(15)(13)(27)(12)(12)(103)(18)(18)
(window.slotbydup = window.slotbydup || []).push({
id: '4740887',
container: s,
size: '250,250',
display: 'inlay-fix'深度学习（1）
反向传播是多层神经网络的训练中举足轻重的，本文着重讲解方向传播算法的原理和推导过程。因此对于一些基本的神经网络的知识，本文不做介绍。在理解反向传播算法前，先要理解神经网络中的前馈神经网络算法。
前馈神经网络
如下图，是一个多层神经网络的简单示意图：&
给定一个前馈神经网络，我们用下面的记号来描述这个网络：&
L：表示神经网络的层数；&
nl：表示第l层神经元的个数；&
fl(?)：表示l层神经元的激活函数；&
Wl∈Rnl×nl-1：表示l-1层到第l层的权重矩阵；&
bl∈Rnl：表示l-1层到l层的偏置；&
zl∈Rnl：表示第l层神经元的输入；&
al∈Rnl：表示第l层神经元的输出；
前馈神经网络通过如下的公式进行信息传播：&
zl=Wl?al-1+blal=fl(zl)
上述两个公式可以合并写成如下形式：&
zl=Wl?fl(zl-1)+bl
这样通过一层一层的信息传递，可以得到网络的最后输出y为：&
x=a0→z1→a1→z1→?→aL-1→zL→aL=y
反向传播算法
在了解前馈神经网络的结构之后，我们一前馈神经网络的信息传递过程为基础，从而推到反向传播算法。首先要明确一点，反向传播算法是为了更好更快的训练前馈神经网络，得到神经网络每一层的权重参数和偏置参数。&
在推导反向传播的理论之前，首先看一幅能够直观的反映反向传播过程的图，这个图取材于。
Principles of training multi-layer neural network using backpropagation
The project describes teaching process of multi-layer neural network employing&backpropagation&algorithm. To illustrate this process the three layer neural network with two inputs and one output,which is shown in the picture
below, is used:&
Each neuron is composed of two units. First unit adds products of weights coefficients and input signals. The second unit realise nonlinear function, called neuron activation function. Signal&e&is adder output signal, and&y = f(e)&is output
signal of nonlinear element. Signal&y&is also output signal of neuron.&
To teach the neural network we need training data set. The training data set consists of input signals (x1&and&x2&) assigned with corresponding target (desired output)&z. The network training is an iterative
process. In each iteration weights coefficients of nodes are modified using new data from training data set. Modification is calculated using algorithm described below: Each teaching step starts with forcing both input signals from training set. After this
stage we can determine output signals values for each neuron in each network layer. Pictures below illustrate how signal is propagating through the network, Symbols&w(xm)n&represent weights of connections between network input&xm&and
neuron&n&in input layer. Symbols&yn&represents output signal of neuron&n.&
Propagation of signals through the hidden layer. Symbols&wmn&represent weights of connections between output of neuron&m&and input of neuron&n&in the next layer.&
Propagation of signals through the output layer.&
In the next algorithm step the output signal of the network&y&is compared with the desired output value (the target), which is found in training data set. The difference is called error signal&d&of
output layer neuron.&
It is impossible to compute error signal for internal neurons directly, because output values of these neurons are unknown. For many years the effective method for training multiplayer networks has been unknown. Only in the middle eighties the backpropagation
algorithm has been worked out. The idea is to propagate error signal&d&(computed in single teaching step) back to all neurons, which output signals were input for discussed
The weights' coefficients&wmn&used to propagate errors back are equal to this used during computing output value. Only the direction of data flow is changed (signals are propagated from output to inputs one after the other). This technique
is used for all network layers. If propagated errors came from few neurons they are added. The illustration is below:&
When the error signal for each neuron is computed, the weights coefficients of each neuron input node may be modified. In formulas below&df(e)/de&represents derivative of neuron activation function (which weights are modified).&
Coefficient&h&affects network teaching speed. There are a few techniques to select this parameter. The first method is to start teaching process with large value of the
parameter. While weights coefficients are being established the parameter is being decreased gradually. The second, more complicated, method starts teaching with small parameter value. During the teaching process the parameter is being increased when the teaching
is advanced and then decreased again in the final stage. Starting teaching process with low parameter value enables to determine weights coefficients signs.&
References
Ryszard Tadeusiewcz &Sieci neuronowe&, Kraków 1992
mgr in?. Adam Go?da (2005)
Katedra Elektroniki AGH
Last modified: 06.09.2004&
A Step by Step Backpropagation&Example
Background
Backpropagation is a common method for training a neural network. There is&&online that attempt to explain how backpropagation works, but few that include an example with actual numbers. This post is my attempt to explain how it works with a concrete example that folks can compare their own calculations to in
order to ensure they understand backpropagation correctly.
If this kind of thing interests you, you should&&where I post about AI-related projects that I’m working on.
Backpropagation in Python
You can play around with a
script that I wrote that implements the backpropagation algorithm in&.
Backpropagation Visualization
For an interactive&visualization showing a neural network as it learns, check out my&.
Additional Resources
If you find this tutorial useful and want to continue learning about neural networks and their applications, I highly recommend checking out Adrian Rosebrock’s excellent tutorial on&.
For this tutorial, we’re going to use a neural network with two inputs, two hidden neurons, two output neurons. Additionally, the hidden and output neurons will include a bias.
Here’s the basic structure:
In order to have some numbers to work with, here are the&initial weights,&the
biases, and&training inputs/outputs:
The goal of backpropagation is to optimize the weights so that the neural network can learn how to correctly map arbitrary inputs to outputs.
For the rest of this tutorial we’re going to work with a single training set: given inputs 0.05 and 0.10, we want the neural network to output 0.01 and 0.99.
The Forward Pass
To begin, lets see what the neural network currently predicts given the weights and biases above and inputs of 0.05 and 0.10. To do this we’ll feed those inputs forward though the network.
We figure out the&total net input&to each hidden layer neuron,&squash&the
total net input using an&activation function&(here we use the&logistic
function), then repeat the process with the output layer neurons.
Total net input is also referred to as just&net input&by&.
Here’s how we calculate the total net input for&:
We then squash it using the logistic function to get the output of&:
Carrying out the same process for&&we get:
We repeat this process for the output layer neurons, using the output from the hidden layer neurons as inputs.
Here’s the output for&:
And carrying out the same process for&&we get:
Calculating the Total Error
We can now calculate the error for each output neuron using the&&and sum them to get the total error:
&refer to the target as the&ideal&and the output as the&actual.
The&&is included
so that exponent is cancelled when we differentiate later on. The result is eventually multiplied by a learning rate anyway so it doesn’t matter that we introduce a constant here [].
For example, the target output for&&is 0.01 but
the neural network output 0., therefore its error is:
Repeating this process for&&(remembering that the
target is 0.99) we get:
The total error for the neural network is the sum of these errors:
The Backwards Pass
Our goal with backpropagation is to update each of the weights in the network so that they cause the actual output to be closer the target output, thereby minimizing the error for each output neuron and the network as a whole.
Output Layer
Consider&. We want to know how much a change in&&affects
the total error, aka&.
read as “the partial derivative of&&with
respect to&“. You can also say “the
gradient with respect to&“.
By applying the&&we
know that:
Visually, here’s what we’re doing:
We need to figure out each piece in this equation.
First, how much does the total error change with respect to the output?
&is sometimes
expressed as&
When we take the partial derivative of the total error with respect to&,
the quantity&&becomes
zero because&&does not affect
it which means we’re taking the derivative of a constant which is zero.
Next, how much does the output of&&change with
respect to its total net input?
The partial&&is the output multiplied by 1 minus the output:
Finally, how much does the total net input of&&change
with respect to&?
Putting it all together:
You’ll often see this calculation combined in the form of the&:
Alternatively, we have&&and&&which
can be written as&,
aka&&(the Greek
letter delta) aka the&node delta. We can use this to rewrite the calculation above:
Therefore:
Some sources extract the negative sign from&&so
it would be written as:
To decrease the error, we then subtract this value from the current weight (optionally multiplied by some learning rate, eta, which we’ll set to 0.5):
&&use&&(alpha)
to represent the learning rate,&&&(eta), and&&even
use&&(epsilon).
We can repeat this process to get the new weights&,&,
We perform the actual updates in the neural network&after&we have the new weights leading into the hidden layer neurons (ie, we use the original weights, not the updated weights, when we continue the backpropagation algorithm below).
Hidden Layer
Next, we’ll continue the backwards pass by calculating new values for&,&,&,
Big picture, here’s what we need to figure out:
We’re going to use a similar process as we did for the output layer, but slightly different to account for the fact that the output of each hidden layer neuron contributes to the output (and therefore error) of multiple output neurons. We know that&&affects
both&&and&&therefore
the&&needs
to take into consideration its effect on the both output neurons:
Starting with&:
We can calculate&&using
values we calculated earlier:
equal to&:
Plugging them in:
Following the same process for&,
Therefore:
Now that we have&,
we need to figure out&&and
each weight:
We calculate the partial derivative of the total net input to&&with
respect to&&the same as we did for the output
Putting it all together:
You might also see this written as:
We can now update&:
Repeating this for&,&,
Finally, we’ve updated all of our weights! When we fed forward the 0.05 and 0.1 inputs originally, the error on the network was 0.. After this first round of backpropagation, the total error is now down to 0.. It might not seem like much,
but after repeating this process 10,000 times, for example, the error plummets to 0.. At this point, when we feed forward 0.05 and 0.1, the two outputs neurons generate 0. (vs 0.01 target) and 0. (vs 0.99 target).
If you’ve made it this far and found any errors in any of the above or can think of any ways to make it clearer for future readers, don’t hesitate to&. Thanks!
原文地址：http://blog.csdn.net/luxialan/article/details/
//a-step-by-step-backpropagation-example/
&&相关文章推荐
* 以上用户言论只代表其个人观点，不代表CSDN网站的观点或立场
访问：290次
排名：千里之外
(window.slotbydup = window.slotbydup || []).push({
id: '4740887',
container: s,
size: '250,250',
display: 'inlay-fix'}

51无线网