作者：chen_h

微信号 & QQ：862251340

微信公众号：coderpai

我的博客：

这篇教程是翻译写的神经网络教程，作者已经授权翻译，这是。

该教程将介绍如何入门神经网络，一共包含五部分。你可以在以下链接找到完整内容。

隐藏层

这部分教程将介绍三部分：

隐藏层设计

非线性激活函数

BP算法

在前面几个教程中，我们已经介绍了一些很简单的教程，就是单一的回归模型或者分类模型。在这个教程中，我们也将设计一个二分类神经网络模型，其中输入数据是一个维度，隐藏层只有一个神经元，并且使用非线性函数作为激活函数，模型结构能用图表示为：

我们先导入教程需要使用的软件包。

import numpy as npimport matplotlib.pyplot as pltfrom matplotlib.colors import colorConverter, ListedColormap from mpl_toolkits.mplot3d import Axes3D from matplotlib import cm复制代码

定义数据集

在这篇教程中，我们将输入数据x分类成两个类别，用蓝色表示t = 1，用红色表示t = 0。其中，红色分类样本是一个，被蓝色分类样本包围。这些数据都是一维的，但是数据之间的间隔并不是线性的分割。这些数据特性将在下图中表示出来。

这个二分类模型不会完全准确的分类处理啊，因为我们在其中加入了一个神经元，并且采用的是非线性函数。

# Define and generate the samplesnb_of_samples_per_class = 20  # The number of sample in each classblue_mean = [0]  # The mean of the blue classred_left_mean = [-2]  # The mean of the red classred_right_mean = [2]  # The mean of the red classstd_dev = 0.5  # standard deviation of both classes# Generate samples from both classesx_blue = np.random.randn(nb_of_samples_per_class, 1) * std_dev + blue_meanx_red_left = np.random.randn(nb_of_samples_per_class/2, 1) * std_dev + red_left_meanx_red_right = np.random.randn(nb_of_samples_per_class/2, 1) * std_dev + red_right_mean# Merge samples in set of input variables x, and corresponding set of# output variables tx = np.vstack((x_blue, x_red_left, x_red_right))t = np.vstack((np.ones((x_blue.shape[0],1)),                np.zeros((x_red_left.shape[0],1)),                np.zeros((x_red_right.shape[0], 1))))复制代码

# Plot samples from both classes as lines on a 1D spaceplt.figure(figsize=(8,0.5))plt.xlim(-3,3)plt.ylim(-1,1)# Plot samplesplt.plot(x_blue, np.zeros_like(x_blue), 'b|', ms = 30) plt.plot(x_red_left, np.zeros_like(x_red_left), 'r|', ms = 30) plt.plot(x_red_right, np.zeros_like(x_red_right), 'r|', ms = 30) plt.gca().axes.get_yaxis().set_visible(False)plt.title('Input samples from the blue and red class')plt.xlabel('$x$', fontsize=15)plt.show()复制代码

非线性激活函数

在这里，我们使用的非线性转换函数是 (RBF)。除了，RBF函数在神经网络中不经常被作为激活函数。比较常见的激活函数是。但我们根据设计的输入数据x，在这里RBF函数能很好地将蓝色样本数据从红色样本数据中分类出来，下图画出了RBF函数的图像。RBF函数给定义为：

RBF函数的导数为定义为：

# Define the rbf functiondef rbf(z):    return np.exp(-z**2)复制代码

# Plot the rbf functionz = np.linspace(-6,6,100)plt.plot(z, rbf(z), 'b-')plt.xlabel('$z$', fontsize=15)plt.ylabel('$e^{-z^2}$', fontsize=15)plt.title('RBF function')plt.grid()plt.show()复制代码

BP算法

在训练模型的时候，我们使用来进行模型优化，这是一种很典型的优化算法。BP算法的每次迭代分为两步：

正向传播去计算神经网络的输出。

利用神经网络得出的结果和真实结果之间的误差进行反向传播来更新神经网络的参数。

1. 正向传播

在计算正向传播中，输入数据被一层一层的计算，最后从模型中得出输出结果。

计算隐藏层的激活函数

隐藏层h经激活函数之后，输出结果为：

其中，wh是权重参数。hidden_activations(x, wh)函数实现了该功能。

计算输出结果的激活函数

神经网络的最后一层的输出，是将隐藏层的输出h作为数据参数，并且利用Logistic函数来作为激活函数。

其中，w0是输出层的权重，output_activations(h, w0)函数实现了该功能。我们在公式中添加了一个偏差项-1，因为如果不添加偏差项，那么Logistic函数只能学到一个经过原点的分类面。因为，隐藏层中的RBF函数的输入值得范围是从零到正无穷，那么如果我们不在输出层加上偏差项的话，模型不可能学出有用的分类结果，因为没有样本的值将小于0，从而归为决策树的左边。因此，我们增加了一个截距，即偏差项。正常情况下，偏差项也和权重参数一样，需要被训练，但是由于这个例子中的模型非常简单，所以我们就用一个常数来作为偏差项。

# Define the logistic functiondef logistic(z):     return 1 / (1 + np.exp(-z))# Function to compute the hidden activationsdef hidden_activations(x, wh):    return rbf(x * wh)# Define output layer feedforwarddef output_activations(h , wo):    return logistic(h * wo - 1)# Define the neural network functiondef nn(x, wh, wo):     return output_activations(hidden_activations(x, wh), wo)# Define the neural network prediction function that only returns#  1 or 0 depending on the predicted classdef nn_predict(x, wh, wo):     return np.around(nn(x, wh, wo))复制代码

2. 反向传播

在反向传播过程中，我们需要先计算出神经网络的输出与真实值之间的误差。这个误差会一层一层的反向传播去更新神经网络中的各个权重。

在每一层中，使用算法按照负方向对每个参数进行更新。

参数wh和wo利用w(k+1)=w(k)−Δw(k+1)更新，其中Δw=μ∗∂ξ/∂w，μ是学习率，∂ξ/∂w是损失函数ξ对参数w的梯度。

计算损失函数

在这个模型中，损失函数ξ与交叉熵损失函数一样，具体解释在：

损失函数对于参数wh和wo的表示如下图所示。从图中，我们发现误差面不是一个凸函数，而且沿着wh = 0这一轴，参数wh将是损失函数的一个映射。

从图中发现，沿着wh = 0，从wo > 0开始，损失函数有一个非常陡峭的梯度，并且我们要按照图形的下边缘进行梯度下降。如果学习率取得过大，那么在梯度更新的时候，可能跳过最小值，从一边的梯度方向跳到另一边的梯度方向。因为梯度的方向太陡峭了，每次对参数的更新跨度将会非常大。因此，在开始的时候我们需要将学习率取一个比较小的值。

# Define the cost functiondef cost(y, t):    return - np.sum(np.multiply(t, np.log(y)) + np.multiply((1-t), np.log(1-y)))# Define a function to calculate the cost for a given set of parametersdef cost_for_param(x, wh, wo, t):    return cost(nn(x, wh, wo) , t)复制代码

# Plot the cost in function of the weights# Define a vector of weights for which we want to plot the costnb_of_ws = 200 # compute the cost nb_of_ws times in each dimensionwsh = np.linspace(-10, 10, num=nb_of_ws) # hidden weightswso = np.linspace(-10, 10, num=nb_of_ws) # output weightsws_x, ws_y = np.meshgrid(wsh, wso) # generate gridcost_ws = np.zeros((nb_of_ws, nb_of_ws)) # initialize cost matrix# Fill the cost matrix for each combination of weightsfor i in range(nb_of_ws):    for j in range(nb_of_ws):        cost_ws[i,j] = cost(nn(x, ws_x[i,j], ws_y[i,j]) , t)# Plot the cost function surfacefig = plt.figure()ax = Axes3D(fig)# plot the surfacesurf = ax.plot_surface(ws_x, ws_y, cost_ws, linewidth=0, cmap=cm.pink)ax.view_init(elev=60, azim=-30)cbar = fig.colorbar(surf)ax.set_xlabel('$w_h$', fontsize=15)ax.set_ylabel('$w_o$', fontsize=15)ax.set_zlabel('$\\xi$', fontsize=15)cbar.ax.set_ylabel('$\\xi$', fontsize=15)plt.title('Cost function surface')plt.grid()plt.show()复制代码

输出层更新

∂ξi/∂wo是每个样本i的输出梯度，参照教程的方法，我们可以得出相应的推导公式：

其中，zoi=hi∗wo，hi是样本i经过激活函数之后输出的值，∂ξi/∂zoi=δoi是输出层误差的求导。

gradient_output(y, t)函数实现了δo，gradient_weight_out(h, grad_output)函数实现了∂ξ/∂wo。

隐藏层更新

∂ξi/∂wh是每个样本i在影藏层的梯度，具体计算如下：

其中，

∂ξi/∂zhi=δhi表示误差对于隐藏层输入的梯度。这个误差也可以解释为，zhi对于最后误差的贡献。那么，接下来我们定义一下这个误差梯度δhi：

又应为∂zhi/∂wh=xi，那么我们能计算最后的值为：

在批处理中，对每个对应参数的梯度进行累加，就是最后的梯度。

gradient_hidden(wo, grad_output)函数实现了δh。

gradient_weight_hidden(x, zh, h, grad_hidden)函数实现了∂ξ/∂wh。

backprop_update(x, t, wh, wo, learning_rate)函数实现了BP算法的每次迭代过程。

# Define the error functiondef gradient_output(y, t):    return y - t# Define the gradient function for the weight parameter at the output layerdef gradient_weight_out(h, grad_output):     return  h * grad_output# Define the gradient function for the hidden layerdef gradient_hidden(wo, grad_output):    return wo * grad_output# Define the gradient function for the weight parameter at the hidden layerdef gradient_weight_hidden(x, zh, h, grad_hidden):    return x * -2 * zh * h * grad_hidden# Define the update function to update the network parameters over 1 iterationdef backprop_update(x, t, wh, wo, learning_rate):    # Compute the output of the network    # This can be done with y = nn(x, wh, wo), but we need the intermediate     #  h and zh for the weight updates.    zh = x * wh    h = rbf(zh)  # hidden_activations(x, wh)    y = output_activations(h, wo)    # Compute the gradient at the output    grad_output = gradient_output(y, t)    # Get the delta for wo    d_wo = learning_rate * gradient_weight_out(h, grad_output)    # Compute the gradient at the hidden layer    grad_hidden = gradient_hidden(wo, grad_output)    # Get the delta for wh    d_wh = learning_rate * gradient_weight_hidden(x, zh, h, grad_hidden)    # return the update parameters    return (wh-d_wh.sum(), wo-d_wo.sum())复制代码

BP算法更新

下面的代码，我们模拟了一个50次的循环。白色的点表示，参数wh和wo在误差面上面的第k次迭代。

在更新过程中，我们不断的线性减小学习率。这是为了在更新到最后的时候，学习率能是0。这样能保证最后的参数更新不会在最小值附近徘徊。

# Run backpropagation# Set the initial weight parameterwh = 2wo = -5# Set the learning ratelearning_rate = 0.2# Start the gradient descent updates and plot the iterationsnb_of_iterations = 50  # number of gradient descent updateslr_update = learning_rate / nb_of_iterations # learning rate update rulew_cost_iter = [(wh, wo, cost_for_param(x, wh, wo, t))]  # List to store the weight values over the iterationsfor i in range(nb_of_iterations):    learning_rate -= lr_update # decrease the learning rate    # Update the weights via backpropagation    wh, wo = backprop_update(x, t, wh, wo, learning_rate)     w_cost_iter.append((wh, wo, cost_for_param(x, wh, wo, t)))  # Store the values for plotting# Print the final costprint('final cost is {:.2f} for weights wh: {:.2f} and wo: {:.2f}'.format(cost_for_param(x, wh, wo, t), wh, wo))复制代码

在我们的机器上面，最后输出的结果是：

final cost is 10.81 for weights wh: 1.20 and wo: 5.56

但由于参数初始化的不同，可能在你的机器上面运行会有不同的结果。

# Plot the weight updates on the error surface# Plot the error surfacefig = plt.figure()ax = Axes3D(fig)surf = ax.plot_surface(ws_x, ws_y, cost_ws, linewidth=0, cmap=cm.pink)ax.view_init(elev=60, azim=-30)cbar = fig.colorbar(surf)cbar.ax.set_ylabel('$\\xi$', fontsize=15)# Plot the updatesfor i in range(1, len(w_cost_iter)):    wh1, wo1, c1 = w_cost_iter[i-1]    wh2, wo2, c2 = w_cost_iter[i]    # Plot the weight-cost value and the line that represents the update     ax.plot([wh1], [wo1], [c1], 'w+')  # Plot the weight cost value    ax.plot([wh1, wh2], [wo1, wo2], [c1, c2], 'w-')# Plot the last weightswh1, wo1, c1 = w_cost_iter[len(w_cost_iter)-1]ax.plot([wh1], [wo1], c1, 'w+')# Shoz figureax.set_xlabel('$w_h$', fontsize=15)ax.set_ylabel('$w_o$', fontsize=15)ax.set_zlabel('$\\xi$', fontsize=15)plt.title('Gradient descent updates on cost surface')plt.grid()plt.show()复制代码

分类结果的可视化

下面的代码可视化了最后的分类结果。在输入空间域里面，蓝色和红色代表了最后的分类颜色。从图中，我们发现所有的样本都被正确分类了。

# Plot the resulting decision boundary# Generate a grid over the input space to plot the color of the#  classification at that grid pointnb_of_xs = 100xs = np.linspace(-3, 3, num=nb_of_xs)ys = np.linspace(-1, 1, num=nb_of_xs)xx, yy = np.meshgrid(xs, ys) # create the grid# Initialize and fill the classification planeclassification_plane = np.zeros((nb_of_xs, nb_of_xs))for i in range(nb_of_xs):    for j in range(nb_of_xs):        classification_plane[i,j] = nn_predict(xx[i,j], wh, wo)# Create a color map to show the classification colors of each grid pointcmap = ListedColormap([        colorConverter.to_rgba('r', alpha=0.25),        colorConverter.to_rgba('b', alpha=0.25)])# Plot the classification plane with decision boundary and input samplesplt.figure(figsize=(8,0.5))plt.contourf(xx, yy, classification_plane, cmap=cmap)plt.xlim(-3,3)plt.ylim(-1,1)# Plot samples from both classes as lines on a 1D spaceplt.plot(x_blue, np.zeros_like(x_blue), 'b|', ms = 30) plt.plot(x_red_left, np.zeros_like(x_red_left), 'r|', ms = 30) plt.plot(x_red_right, np.zeros_like(x_red_right), 'r|', ms = 30) plt.gca().axes.get_yaxis().set_visible(False)plt.title('Input samples and their classification')plt.xlabel('x')plt.show()复制代码

输入域的转换

为什么神经网络模型能利用最后的线性Logistic实现非线性的分类呢？关键原因是隐藏层的非线性RBF函数。RBF转换函数可以将靠近原点的样本（蓝色分类）的输出值大于0，而远离原点的样本（红色样本）的输出值接近0。如下图所示，红色样本的位置都在左边接近0的位置，蓝色样本的位置在远离0的位置。这个结果就是使用线性Logistic分类的。

同时注意，我们使用的高斯函数的峰值偏移量是0，也就是说，高斯函数产生的值是一个关于原点分布的数据。

# Plot projected samples from both classes as lines on a 1D spaceplt.figure(figsize=(8,0.5))plt.xlim(-0.01,1)plt.ylim(-1,1)# Plot projected samplesplt.plot(hidden_activations(x_blue, wh), np.zeros_like(x_blue), 'b|', ms = 30) plt.plot(hidden_activations(x_red_left, wh), np.zeros_like(x_red_left), 'r|', ms = 30) plt.plot(hidden_activations(x_red_right, wh), np.zeros_like(x_red_right), 'r|', ms = 30) plt.gca().axes.get_yaxis().set_visible(False)plt.title('Projection of the input samples by the hidden layer.')plt.xlabel('h')plt.show()复制代码

CoderPai 是一个专注于算法实战的平台，从基础的算法到人工智能算法都有设计。如果你对算法实战感兴趣，请快快关注我们吧。加入AI实战微信群，AI实战QQ群，ACM算法微信群，ACM算法QQ群。详情请关注 “CoderPai” 微信号（coderpai）。