The number of neurons in the input/output layers is related to their application.
1 neuron input with 1 neuron output: Ex. drug dosage and binary response. Like linear regression
N neuron inputs, M neuron outputs: N number of features. M number of categories in classification.
* Example: classification of images into digits. 28x28 input neuron from a 28x28px image and 10 output networks for digits.

It takes several binary inputs \(x_1,x_2,x_3\) and produces a single binary output
Each input has an associated weight \(w_1,w_2,w_3\) indicating the importance of its input to the output.
To calculate the output: \[ output = \left\{ \begin{array}{ll} 0 & \text{if } \sum_jw_jx_j \leq \text{threshold}\\ 1 & \text{if } \sum_jw_jx_j \geq \text{threshold} \\ \end{array} \right. \]
A network of perceptrons could weigh up evidence and make decisions, like computing logical functions with binary operations such as AND, OR or NAND gates.


The output is defined by the sigmoid function:
\[\sigma(z)=\frac{1}{1+e^{-z}}\] \[\sigma(w\cdot x+b)=\frac{1}{1+exp(-\sum_jw_jx_j-b)}\]
Inputs \(x_j\) and single output in the \([0,1]\) range.
Weights, \(w_j\) tell us how important each input is.
Bias \(b\) tell us how high the sum needs to be to activate the neuron.
ANN learning
From: But what is a neural network? | Deep learning chapter 1
From: But what is a neural network? | Deep learning chapter 1 Matrix operations
Goal: find weights and biases so that the output of the network approximates \(f(x)\) for all training inputs.
To evaluate how well we’re achieving this goal, we define a cost function, also referred to as loss or objective function.
Common one: mean squared error
\[ C(w,b) = \frac{1}{2n}\sum_x ||y(x)-a||^2 \] We want \(C(w,b)\approx 0\), so we want to minimize the function
In \(x,y\), the slope of the derivative is the rate of change of a function at a specific point.

With partial derivatives using the chain rule (from Leibniz)
A very simple example:

We get the derivatives of the cost function with respect to each individual \(w\) and \(b\) and update them according to a learning rate.
We repeat until the change is really small or we reach some other condition.
Training Algorithm
What do we need?