Week 2 - Neural networks

The basis of all neural networks is a simplified model of a neuron in the human body.

Perceptron (neural network)

Perceptron (neural network)

A perceptron is a function defined by two steps - first a weighted linear sum, the second an activation function. To define this we need weights for and an activation function . Where

Link to original

Where we define an activation function as.

Activation function

Activation function

An activation function gets applied in a perceptron after the weighted sum of the inputs. It is the non-linear term. Classic activation functions are

Link to original

First we will just take the function to be Binary step which introduces an extra parameter .

Binary step

Binary step

The binary step is an activation function that determines what the value is in reference to a parameter .

Link to original

Binary operations with perceptron’s

Logical and ()

Let and have then if we set and then becomes

Logical or ()

Let and have then if we set and then becomes

Not

Let and have then if we set and then becomes

Xor

Define two perceptron’s (from before) and with , and then Now consider and the conjunction

Perceptron rule

Perceptron rule

Perceptron rule

This is a way to train a perceptron using the binary step activation function. To classify training data where and . We assume the perceptron we are training is a function defined by weights . For this we require a sufficiently small learning rate . The idea is to reply the data by nudging the weights () of the perceptron slowly towards an even split.

To simplify the treatment of we just extend where all of has a (i.e. ) in the additional position. Then we set and compare to zero, i.e. instead of checking if

Then checking a random training example with current weights we set where is the indicator function mapping true to and false to . The we change the weights and then redefine . Note as we have

Pseudocode

perceptron_rule(T):
	Input: Training data T = {((x_1, x_2, ..., x_n, x_{n+1} = -1), y)}
	Output: perceptron weights w_1, w_2, ...., w_n, w_{n+1} = theta.
1. Set w_i = 1 for 1 = 1, 2, ..., n+1.
2. hat{y}_t = set_y_hat(T, w_i)
3. While there exists t in T such that hat{y}_t not = y_t, and let count be the number of intterations:
	3.1. Choose random t in T such that hat{y}_t not = y_t
	3.2. Set eta = 1/count
	3.3. for i = 1, ... , n+1
		3.3.1. Set delta w_i = eta (y - hat(y)_t) x_i
		3.3.2. Set w = w_i + delta w_i
	3.4. hat{y}_t = set_y_hat(T, w_i)
4. Return w_i
 
set_y_hat(T, w_i):
	Input: Training data T = {((x_1, x_2, ..., x_n, x_{n+1} = -1), y)} and
		perceptron weights w_1, w_2, ...., w_n, w_{n+1} = theta.
	Output: For each t in T new hat(y)_t
1. for t = ((x_1, ..., x_{n_1}), y) in T:
	1.1. Set hat{y}_t = bool(sum_{i=1}^{n+1} x_i >= 0)
2. return hat{y}_t

Runtime

To make sure is low enough we might reduce it slowly over time (though we have to make sure the sequence of converge to zero whilst the sum of diverge to infinity) and then stop when for all our training data. In the example above I choose the sequence for our where is the iteration count.

This will only stop if is linearly separable. Moreover, The perceptron rule using binary step converges in finite time if the dataset is linearly separable.

Link to original

Gradient decent

With the perceptron rule we applied an activation function that made non-differentiable. Lets get rid of this for now and see if we can use differentiation to get another training method.

Let our training data have and be our weights for a perceptron as before. The define the following error function

Note is a function of the weights in this example.

Then we can find a local derivative to decide on the best direction to reduce the weights.

This is very similar to the perceptron rule. Now to completely the training using this we would travel in the opposite of the direction of found by this derivative. This method is called Gradient decent.

Gradient decent

Gradient decent

This algorithm uses differentiation to minimise error terms for models. Suppose we have a model with some parameters . Further suppose the error function is differentiable with respect to . Then we set some learning rate and we perturb the weights by until settles in some local minimum.

Note the convergence to a local minimum - rather than a global minimum. There are techniques to get around this but it is a hard problem to know you are in some global minimum.

Run time

Run time can be effected by the complexity of the model. Also there are different techniques to change the learning rate over time.

Link to original

Comparison

Whilst the perceptron rule converges in finite time for linear separable datasets it is unstable on datasets that are not linearly separable. The advantage of gradient decent is that it is stable on all datasets but it has the issue of converging only to local minimum.

Sigmoid function

Sigmoid function

Sigmoid function

The sigmoid function is an attempt to continuously approximate the binary step activation function

Sigmoid function. (2024, January 18). In Wikipedia. https://en.wikipedia.org/wiki/Sigmoid_function

Link to original

Where this function looks similar to the binary step function with one big advantage, that it is differentiable. Which shows the function is relative stable when is close to either or but will have the largest change when it is close to .

This allow us to use gradient decent on a perceptron that is using the Sigmoid function as its activation function.

Neural network

Neural network

Neural networks

A neural network consists of the following information:

  • A directed acyclic graph where the source vertices are called the input layer and the sink vertices are called the output layer , and
  • For each we have a perceptron .

In the Modelling framework this would simulate something with and . To run the model you compute a value for each vertex for . For input we associate to vertex . Then for all once all of have values we set To generate output vector .

Normally the topology of is chosen so that the vertices can be partitioned into for some . Where for all and . Where is the input layer and is the output layer all other layers are called hidden layers.

Link to original

If we use the Sigmoid function for each of the perceptrons activation function or some other differentiable activation function - the function this neural network represents is also differentiable. In this case we can run gradient decent which in this case is called Backward propagation of errors (Back propagation).

Backward propagation of errors (Back propagation)

Backward propagation of errors (Back propagation)

This is the specific implementation of gradient decent applied to neural networks. There are two stages of this calculation:

For this to work all the perceptrons need to have differentiable activation function.

Link to original

This will still have the issue with local optimum vs global optimum. Therefore there are some more advanced techniques people use:

  • Momentum: Making the learning term very large to skip over local minima,
  • Higher order derivatives,
  • Random optimization, or
  • penalty for complexity.

Complexity in a neural network is defined by:

  • more nodes,
  • more layers, and
  • larger numbers as the parameters.

Bias for neural networks

Restriction bias

This is the types of functions neural networks can represent.

Whilst this appears like a multi perceptron with Sigmoid activation function with two hidden layers has no restriction bias - once you have picked your hyper parameters, i.e. the number of layers and nodes, then you have restricted the number of functions you can represent.

Danger of overfitting

It is important to use ideas like cross validation to avoid overfitting. For neural networks it is important to avoid large values also.

Preference bias

Starting weights

Normally when you start back propagation the weights start with small random values. Random values helps us avoid the same local minima each run. Small values has low complexity.

  • Training neural networks normally have a preference for low complexity networks over high complexity ones.
  • We prefer correct answers to incorrect ones.