1076 words
5 minutes
Markov Decision Process

Markov Decision Process (MDP) almost defines the standard format of the reinforcement learning. With the symbolic representation, almost any problem can be transfered to the MDP. Learning the reinforcement, MDP is the best start.

Definition of Reinforcement Learning#

The direct neural network training process, always feeds the data to the model directly. The target is to improve its perceptibility (like CNN) instead of the decision-making capability. The Reinforcement Learning (RL) showed up to make the agent learn from the feedback from the enviroment. With the reward from the enviroment, the dicision made by the agent can gradually meet the human’s expectation. In machine learning area, with a trainable agent, we hope it can find the optimal strategies to the certain decision problem.

Formulaically, all the reinforcement learning can be seen as a MDP. The solution to such scenario, is also the solution of the RL.

Markov Process#

In probability theory and statistics, a Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, this may be thought of as, “What happens next depends only on the state of affairs now.” A countably infinite sequence, in which the chain moves state at discrete time steps, gives a discrete-time Markov chain (DTMC). A continuous-time process is called a continuous-time Markov chain (CTMC). Markov processes are named in honor of the Russian mathematician Andrey Markov1.

A Markov process is a stochastic process that satisfies the Markov property. It’s simply “memorylessness”. The predictions can be made regarding future outcomes based solely on its present state and—most importantly—such predictions are just as good as the ones that could be made knowing the process’s full history. Formally:

P(st+1st)=P(st+1s1,s2,...,st).P(s_{t+1} | s_{t}) = P(s_{t+1}|s_1,s_2,...,s_{t}).

Markov Decision Process#

Formally, a MDP is a 4-tuple (S,A,Pa,RaS,A,P_a, R_a), where:

  • SS: the state space.
  • AA: the action space (AsA_s is the set of available actions from state ss)
  • Pa(s,s)P_a(s,s'): the paobability that action aa in state ss at time tt will lead to state ss' at time t+1t+1. Generally:Pr(st+1Sst=s,at=a)=SPa(s,s)dsPr(s_{t+1}\in S' | s_t = s, a_t=a) = \int _{S'} P_a(s,s')ds', for every SSS'\sube S.
  • Ra(s,s)R_a(s,s'): immediate reward received after transitioning from state ss to state ss'.

Usually, the optimization objective of MDP is to find a good policy for the decision maker. Thus, there exsists a policy function π\pi, a (potentially probabilistic) mapping from SS to AA. Combined with the MDP, the policy fixes the resulting combination behaves. The objective is to choose a policy π\pi, maximizes some cumulative function of the random rewards:

Eatπ(st),st+1Pat(st,st+1)[t=0γtRat(st,st+1)],E_{a_t \sim \pi (s_t), s_{t+1} \sim P_{a_t}(s_t,s_{t+1})}\big[\sum ^{\infty} _{t=0} \gamma^t R_{a_t}(s_t,s_{t+1}) \big],

where γ\gamma is the discount factor satisfying 0γ 10\leq \gamma \leq \ 1, which is usually close to 1 (for example, γ=1/(1+r)\gamma =1/(1+r) for some discount rate rr). A lower discount factor motivates the decision maker to favor taking actions early, rather than postpone them indefinitely2.

The algorithms to calculate optimal policies for finite state and sction MDPs requires storage for two arrays indexed by state: value VV and policy π\pi. At the end of the algorithm, π\pi will contain the solution and V(s)V(s) will contain the discounted sum of the rewards to be earned by following that solution from state ss.

To simplify the represatation, we introduce some new definition and notation to help understanding:

  • R(s)R(s): means the expectation of the reward gained at the next time stamp under state ss, R(s)=EaAs[Ra(s,s)]R(s) = E_{a \in A_s} [R_a(s,s')].
  • GtG_t: Return, the sum of the declined reward from a state sts_t in a Markov Chain. Gt=i=0γiR(st+1+i)G_t = \sum _ {i=0}^{\infty }\gamma ^i R(s_{t+1+i}).

Usually, the state indexed value function V(s)V(s) is called the state value fcuntion, it’s the expectation of the gain from the certain state ss in a Markov Chain, V(s)=E[GtSt=s]V(s) = E[G_t|S_t = s]

Bellman Equation#

A Bellman equation, named after Richard E. Bellman, is a technique in dynamic programming which breaks a optimization problem into a sequence of simpler subproblems, as Bellman’s “principle of optimality” prescribes3.

The introduction of Bellman equation can make the dynamic programming available. The state value function can be written as:

v(st)=E[GtSt=st]=E[Rt+1+γRt+2+γ2Rt+3...St=st]=E[Rt+1+γ(Rt+2+γRt+3+...)St=st]=E[Rt+1+γv(st+1)St=st]=E[Rt+1St=st]+γE[v(st+1)St=st]    current reward       value expectation                                 in the next time stampv(s_t) = E[G_t|S_t=s_t]\\=E[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} ... | S_t = s_t]\\=E[R_{t+1} + \gamma (R_{t+2} + \gamma R_{t+3} + ...)| S_t = s_t]\\=E[R_{t+1} + \gamma v(s_{t+1}) | S_t = s_t]\\=\underbrace{E[R_{t+1}|S_t = s_t]} + \underbrace{\gamma E[v(s_{t+1})|S_t = s_t]}\\ \ \ \ \ \text{current reward}\ \ \ \ \ \ \ \text{value expectation}\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{in the next time stamp}

Algorithm#

The mapping description is less representative on the statistic perpective. Here, we need a little abuse of notation: π(as),aAs\pi(a|s),a \in A_s represents the probability of the action aa (π(s)\pi(s) can be the probability distribution of the possible action at state ss).

NOTE

The policy defines the complete action of the agent, including all the performance and probability.

The policy only involves with the current state, not the time and the history information.

The agent can update the policy with the time.

Under certain policy π\pi, the agent satistify the certain pattern:

  • State Transformation Probability: Pπ(s,s)=aAsπ(as)Pa(s,s)P_\pi(s,s') = \sum_{a\in A_s} \pi(a|s) P_a(s,s')
  • Reward Function: Rπ(s)=aAsπ(as)Ra(s,s)R_\pi(s) = \sum_{a\in A_s}\pi (a|s) R_a(s,s')

With the involvement of the policy, we can represent the value of the agent from the state or the action perspective.

  • vπ(s)=Eaπ(s)[GtSt=s]v_\pi(s) = E_{a \sim \pi(s)}[G_t|S_t = s]
  • qπ(s,a)=Eπ[GtSt=s,At=a]q_\pi(s,a) = E_\pi[G_t|S_t = s, A_t = a]

Still, with Bellman equation, we have :qπ(s,a)=EsS[Ra(s,s)]+γaAsEπ[GtSt=s,At=a]q_\pi(s,a) = E_{s' \in S}[R_a(s,s')] + \gamma \sum_{a \in A_{s'}}E_\pi[G_t | S_t = s', A_t = a].

According to the Bayes theory, there exists: vπ(s)=Eaπ(s)[qπ(s,a)]v_\pi(s) = E_{a\sim \pi(s)} [q_\pi(s,a)] .

Solving the reinforcement learning problem means finding an optimal strategy that enables individuals to consistently gain more from interacting with the environment than any other strategy. This optimal strategy can be represented by π\pi^*. Usually, we can find the local optimal solution compared with other policies. obviously, it’s natural to find the biggest expectation of the state value function, i.e.: π=argmaxπEsS[vπ(s)]\pi_* = \arg \max _\pi E_{s\in S}[v_\pi(s)].

With the Bellman equation, the algorithm can be solved in Bellman Optimalty Equation. It has two steps: (1) value update and (2) policy update:

V(s):=sPπ(s)(s,s)(Rπ(s)(s,s)+γV(s))π(s):=argmaxa{sPa(s,s)(Ra(s,s)+γV(s))}V(s) := \sum_{s'} P_{\pi(s)} (s,s') (R_{\pi(s)}(s,s') + \gamma V(s'))\\ \pi(s) := \arg\max_a \big \{\sum_{s'} P_a (s,s') (R_a(s,s') + \gamma V(s'))\big\}

Example#

An example of MDP is the Pole-Balancing model, which comes from classic control theory. Under such example:

Footnotes#

  1. wiki/Markov_chain.

  2. wiki/Markov_decision_process.

  3. wiki/Bellman_equation.

Markov Decision Process
https://dimshimmer.github.io/posts/rl/markovdecisionprocess/
Author
d1m_sh1mm32
Published at
2025-07-21
License
CC BY-NC-SA 4.0