11 Reinforcement learning rule

  • Strengthen synapses responsible for behaviour that lead to a better-than-expected outcome.

  • Weaken synapses responsible for behaviour that lead to a worse-than-expected outcome.

  • Do not change synapses at all if the outcome was fully expected.

  • Whether or not an outcome was expected or not is captured by the prediction error which is usually denoted by \(\delta\).

  • A simple RL learning rule can be obtained by modifying the simple Hebbian rule as follows:

\[ \begin{align} w_{ij}(n+1) &= w_{ij}(n) + \alpha A_j(n) A_i(n) \delta(n) \\ \end{align} \]

  • Outcomes are usually put in terms of reward which is denoted below by \(r\).

  • To compute a predicted error, we simply need to know what reward we obtained\(r_{\text{obtained}}\) – and what reward we predicted \(r_{\text{predicted}}\). The prediction error is just the difference between these two things.

\[ \begin{align} \delta = r_{\text{obtained}} - r_{\text{predicted}} \end{align} \]

  • \(r_{\text{obtained}}\) is specific to the agents behaviour and the environment it is acting within. We will play around with the various ways this can be structured a bit later.

\[ \begin{align} r_{\text{obtained}} = \text{to be determined by the experiment} \end{align} \]

  • \(r_{\text{predicted}}\) is something the agent learns – i.e., it is the agent’s estimate of the reward that will be obtained.

  • A good estimate of \(r_{\text{obtained}}\) is the sample mean of all previously obtained rewards.

\[ \begin{align} r_{\text{predicted}} = \frac{1}{n} \sum_1^n r_{\text{obtained}} \end{align} \]

  • The sample mean can be computed iteratively with the following:

\[ \begin{align} r_{\text{predicted}}(t) = r_{\text{predicted}}(t-1) + \gamma \delta \end{align} \]