11 Reinforcement learning rule
Strengthen synapses responsible for behaviour that lead to a better-than-expected outcome.
Weaken synapses responsible for behaviour that lead to a worse-than-expected outcome.
Do not change synapses at all if the outcome was fully expected.
Whether or not an outcome was expected or not is captured by the prediction error which is usually denoted by \(\delta\).
A simple RL learning rule can be obtained by modifying the simple Hebbian rule as follows:
\[ \begin{align} w_{ij}(n+1) &= w_{ij}(n) + \alpha A_j(n) A_i(n) \delta(n) \\ \end{align} \]
Outcomes are usually put in terms of reward which is denoted below by \(r\).
To compute a predicted error, we simply need to know what reward we obtained – \(r_{\text{obtained}}\) – and what reward we predicted \(r_{\text{predicted}}\). The prediction error is just the difference between these two things.
\[ \begin{align} \delta = r_{\text{obtained}} - r_{\text{predicted}} \end{align} \]
- \(r_{\text{obtained}}\) is specific to the agents behaviour and the environment it is acting within. We will play around with the various ways this can be structured a bit later.
\[ \begin{align} r_{\text{obtained}} = \text{to be determined by the experiment} \end{align} \]
\(r_{\text{predicted}}\) is something the agent learns – i.e., it is the agent’s estimate of the reward that will be obtained.
A good estimate of \(r_{\text{obtained}}\) is the sample mean of all previously obtained rewards.
\[ \begin{align} r_{\text{predicted}} = \frac{1}{n} \sum_1^n r_{\text{obtained}} \end{align} \]
- The sample mean can be computed iteratively with the following:
\[ \begin{align} r_{\text{predicted}}(t) = r_{\text{predicted}}(t-1) + \gamma \delta \end{align} \]