6.1 Rewards, punishment, reinforcement learning
Reward Prediction errors
Lateral habenula: negative reward signals
No reward, or negative RPE --> lateral habenula frequency increase, dopamine neural activity decrease
Positive RPE --> dopamine neural activity increase
Blocking effect
When a conditioned stimulus has been established, if we add another stimulus, it is hard to establish this as new conditioned stimulus
Reforcement leanring
Q-value: expected cumulative
future reward, which can be
received if at state st action at is
performed.
$$Q\left(s_{t}, a_{t}\right)=E\left[r(t)+\gamma r(t+1)+\gamma^{2} r(t+2)+... | s_{t}, a_{t}\right]$$
The Exploration-exploitation dilemma
Exploitation:
Exploration
harvesting information, may lead to even higher rewards
多种策略模拟这种困境
Greedy strategy
always choose the best reward
$\epsilon$ -greedy strategy
一定的概率选择最优结果,一小点概率选择其他结果(探索)
Optimistic greedy
start with Q values that are too big
Softmax action selection
Back to top