MDP
maxπEπ[Gt]
vπ(s)=Eπ[Gt|St=s]
qπ(s,a)=Eπ[Gt|St=s,At=a]
connect all state values
vπ(si)=Eπ[Gt|si] =∑{a}π(a|si)⋅q(si,a) =∑{a}π(a|si)⋅Eπ[Gt|si,a]
For any optimal π∗, ∀s∈S, ∀a∈A
v∗(s)=maxaq∗(s,a)q∗(s,a)=∑s,rp(s′r|s,a)[r+γv∗(s′)]
We do not know p(s′r|s,a)
π≈π∗
Eπ[Gt|St=s]≈1C(s)M∑m=1Tm−1∑τ=0I(smτ=s)gmτ * step size α for update rule
V(smt)←V(smt)+α(gmt−V(smt))
we must explroe all state-action pairs
we must exploit known high-value pairs
MCMC solving blackjack game
image credit: Mutual Information
10 million games played
Markov Reward Process: A Markov decision process, but w/o actions
MCMC requires an episode to complete before updating
but what if an episode is long?
Replace gmt with
gmt:t+n=rmt+1+γrmt+2+⋯+γn−1rmt+n+γnV(smt+n)
updates are applied during the episoes with an n-step delay
Compared to MC, TD has
rmt+1+γmaxaQ(smt+1,a)
updates Q after each sarsa tuple (each n-step delay)
vπ(s)≈ˆv(s,w),w∈Rd * caution: updating w updates many values of s
not just the “visited states”
VE(w)=∑s∈Sμ(s)[vπ(s)−ˆv(s,w)]2
w←w+α[Ut−ˆv(St,w)]∇ˆv(St,w)
To find target Ut