10.2 Single Layer Neural Network
Let’s consider a dataset made of \(p\) predictors
\[X=(X_1,X_2,X_3,...,X_p)\]
and build a non linear function \(f(X)\) to predict a response \(Y\).
\[f(X)=\beta_0+\sum_{k=1}^K{\beta_kh_k(X)}\]
where \(h_k(X)\) is the expression of the hidden layers, a transformation of the input, named as the \(A_k\) function of \(X\) with \(K\) activation, \(k=1,...,K\), which are not directly observed.
\[A_k=h_k(X)\] and identify the activation: a non linear transformation of a linear function \(g(z)\)
\[A_k=h_k(X)=g(z)\]
\[A_k=h_k(X)=g(w_{k0}+\sum_{j=1}^p{w_{kj}X_j})\] to obtain an output layer which is a linear model that uses these activations \(A_k\) as inputs, resulting in a function \(f(X)\).
\[f(X)=\beta_0+\sum_{k=1}^K{\beta_kA_k}\] aech \(A_k\) is a different transformation of \(h_k(X)\)
\(\beta_0,...,B_K\) and \(w_{1,0},...,w_{K,p}\) need to be estimated from data.
What about the activation function \(g(z)\)? There are various options, but the most used ones are:
- sigmoid
\[g(z)=\frac{e^z}{1+e^z}\]
- ReLU rectified linear unit
\[g(z)=(z)_+=\left\{ \begin{array}{ll} 0 & \mbox{if z<0};\\ 1 & \mbox{otherwise}.\end{array}\right.\]
This is the structure of a single layer neural network. Here we can see the layer inputs, the hidden layers and the output layer.
In this example, we see deep learning applied to dosage/efficacy study, the model parameters with the activation function in the middle.
The parameters can be retrieved with backpropagation which optimizes weights for coefficients \(w_{kj}\) and biases for the intercepts \(w_{k0}\). We will see about that later on this notes. For now we suppose to know what is the value of the parameters, and we investigate the calculation of the deep learning model.
\[f(X)=\beta_0+\sum_{k=1}^K{\beta_kg(w_{k0}+\sum_{wkj}^p{X_j})}\]
Neural network Pt.1 Inside the black box
This is from the book pg.406, and you can see all the passages for calculating the estimated \(f(X)\) supposing that we know the value of the parameters.
Fitting a quantitative neural network to estimate the unknown parameters \(w_{kj}\) and \(\beta_{k}\) requires the squared-error loss function to be minimum.
Mean squared-error:
\[min\sum{i=1}^n{(y_i-f(x_i))^2}\]
Or to train a qualitative neural network is to minimizing the negative multinomial log-likelihood or the \[cross-entrophy\]. We see this explained in the multilayer neural network section.
Min of the negative multinomial log-likelyhood:
\[-\sum_{i=1}^n{\sum_{m=0}^K{y_{im}log(f_m(x_i))}}\]
As deep learnig models have the ability to fit a good squiggle lines to data, the estimated parameters can be applied to a special softmax function:
\[f_{m}=Pr(Y=m|X)=\frac{e^{Z_m}}{\sum_{k=0}^K{e^{Z_k}}}\]