Skip to content

6.1 Rewards, punishment, reinforcement learning

Reward Prediction errors

  • RPEs
    • 指实际获得的奖励和预期奖励之间的差异

Lateral habenula: negative reward signals

  • No reward, or negative RPE --> lateral habenula frequency increase, dopamine neural activity decrease
  • Positive RPE --> dopamine neural activity increase

Blocking effect

  • When a conditioned stimulus has been established, if we add another stimulus, it is hard to establish this as new conditioned stimulus

Reforcement leanring

  • Q-value: expected cumulative future reward, which can be received if at state st action at is performed. $$Q\left(s_{t}, a_{t}\right)=E\left[r(t)+\gamma r(t+1)+\gamma^{2} r(t+2)+... | s_{t}, a_{t}\right]$$

The Exploration-exploitation dilemma

  • Exploitation:
    • harvesting max, rewards
  • Exploration
    • harvesting information, may lead to even higher rewards
  • 多种策略模拟这种困境
    • Greedy strategy
      • always choose the best reward
    • $\epsilon$ -greedy strategy
      • 一定的概率选择最优结果,一小点概率选择其他结果(探索)
    • Optimistic greedy
      • start with Q values that are too big
    • Softmax action selection
      • 通过softmax函数确定选择探索或者保守的概率