The basis of all neural networks is a simplified model of a neuron in the human body.
Perceptron (neural network)
Perceptron (neural network)
A perceptron is a function defined by two steps - first a weighted linear sum, the second an activation function. To define this we need weights for and an activation function . Where
Define two perceptron’s (from before) and with , and then
Now consider and the conjunction
Perceptron rule
Perceptron rule
Perceptron rule
This is a way to train a perceptron using the binary step activation function. To classify training data where and . We assume the perceptron we are training is a function defined by weights . For this we require a sufficiently small learning rate . The idea is to reply the data by nudging the weights () of the perceptron slowly towards an even split.
To simplify the treatment of we just extend where all of has a (i.e. ) in the additional position. Then we set and compare to zero, i.e. instead of checking if
Then checking a random training example with current weights we set where is the indicator function mapping true to and false to . The we change the weights and then redefine . Note as we have
Pseudocode
perceptron_rule(T): Input: Training data T = {((x_1, x_2, ..., x_n, x_{n+1} = -1), y)} Output: perceptron weights w_1, w_2, ...., w_n, w_{n+1} = theta.1. Set w_i = 1 for 1 = 1, 2, ..., n+1.2. hat{y}_t = set_y_hat(T, w_i)3. While there exists t in T such that hat{y}_t not = y_t, and let count be the number of intterations: 3.1. Choose random t in T such that hat{y}_t not = y_t 3.2. Set eta = 1/count 3.3. for i = 1, ... , n+1 3.3.1. Set delta w_i = eta (y - hat(y)_t) x_i 3.3.2. Set w = w_i + delta w_i 3.4. hat{y}_t = set_y_hat(T, w_i)4. Return w_iset_y_hat(T, w_i): Input: Training data T = {((x_1, x_2, ..., x_n, x_{n+1} = -1), y)} and perceptron weights w_1, w_2, ...., w_n, w_{n+1} = theta. Output: For each t in T new hat(y)_t1. for t = ((x_1, ..., x_{n_1}), y) in T: 1.1. Set hat{y}_t = bool(sum_{i=1}^{n+1} x_i >= 0)2. return hat{y}_t
Runtime
To make sure is low enough we might reduce it slowly over time (though we have to make sure the sequence of converge to zero whilst the sum of diverge to infinity) and then stop when for all our training data. In the example above I choose the sequence for our where is the iteration count.
Note is a function of the weights in this example.
Then we can find a local derivative to decide on the best direction to reduce the weights.
This is very similar to the perceptron rule. Now to completely the training using this we would travel in the opposite of the direction of found by this derivative. This method is called Gradient decent.
Gradient decent
Gradient decent
This algorithm uses differentiation to minimise error terms for models. Suppose we have a model with some parameters . Further suppose the error function is differentiable with respect to . Then we set some learning rate and we perturb the weights by until settles in some local minimum.
Note the convergence to a local minimum - rather than a global minimum. There are techniques to get around this but it is a hard problem to know you are in some global minimum.
Run time
Run time can be effected by the complexity of the model. Also there are different techniques to change the learning rate over time.
Where this function looks similar to the binary step function with one big advantage, that it is differentiable.
Which shows the function is relative stable when is close to either or but will have the largest change when it is close to .
In the Modelling framework this would simulate something with and .
To run the model you compute a value for each vertex for . For input we associate to vertex . Then for all once all of have values we set
To generate output vector .
Normally the topology of is chosen so that the vertices can be partitioned into for some . Where for all and . Where is the input layer and is the output layer all other layers are called hidden layers.
Whilst this appears like a multi perceptron with Sigmoidactivation function with two hidden layers has no restriction bias - once you have picked your hyper parameters, i.e. the number of layers and nodes, then you have restricted the number of functions you can represent.
Normally when you start back propagation the weights start with small random values. Random values helps us avoid the same local minima each run. Small values has low complexity.
Training neural networks normally have a preference for low complexity networks over high complexity ones.