Markov Decision Process (MDP) almost defines the standard format of the reinforcement learning. With the symbolic representation, almost any problem can be transfered to the MDP. Learning the reinforcement, MDP is the best start.
Definition of Reinforcement Learning
The direct neural network training process, always feeds the data to the model directly. The target is to improve its perceptibility (like CNN) instead of the decision-making capability. The Reinforcement Learning (RL) showed up to make the agent learn from the feedback from the enviroment. With the reward from the enviroment, the dicision made by the agent can gradually meet the human’s expectation. In machine learning area, with a trainable agent, we hope it can find the optimal strategies to the certain decision problem.

Formulaically, all the reinforcement learning can be seen as a MDP. The solution to such scenario, is also the solution of the RL.
Markov Process
In probability theory and statistics, a Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, this may be thought of as, “What happens next depends only on the state of affairs now.” A countably infinite sequence, in which the chain moves state at discrete time steps, gives a discrete-time Markov chain (DTMC). A continuous-time process is called a continuous-time Markov chain (CTMC). Markov processes are named in honor of the Russian mathematician Andrey Markov1.
A Markov process is a stochastic process that satisfies the Markov property. It’s simply “memorylessness”. The predictions can be made regarding future outcomes based solely on its present state and—most importantly—such predictions are just as good as the ones that could be made knowing the process’s full history. Formally:
Markov Decision Process
Formally, a MDP is a 4-tuple (), where:
- : the state space.
- : the action space ( is the set of available actions from state )
- : the paobability that action in state at time will lead to state at time . Generally:, for every .
- : immediate reward received after transitioning from state to state .
Usually, the optimization objective of MDP is to find a good policy for the decision maker. Thus, there exsists a policy function , a (potentially probabilistic) mapping from to . Combined with the MDP, the policy fixes the resulting combination behaves. The objective is to choose a policy , maximizes some cumulative function of the random rewards:
where is the discount factor satisfying , which is usually close to 1 (for example, for some discount rate ). A lower discount factor motivates the decision maker to favor taking actions early, rather than postpone them indefinitely2.
The algorithms to calculate optimal policies for finite state and sction MDPs requires storage for two arrays indexed by state: value and policy . At the end of the algorithm, will contain the solution and will contain the discounted sum of the rewards to be earned by following that solution from state .
To simplify the represatation, we introduce some new definition and notation to help understanding:
- : means the expectation of the reward gained at the next time stamp under state , .
- : Return, the sum of the declined reward from a state in a Markov Chain. .
Usually, the state indexed value function is called the state value fcuntion, it’s the expectation of the gain from the certain state in a Markov Chain,
Bellman Equation
A Bellman equation, named after Richard E. Bellman, is a technique in dynamic programming which breaks a optimization problem into a sequence of simpler subproblems, as Bellman’s “principle of optimality” prescribes3.
The introduction of Bellman equation can make the dynamic programming available. The state value function can be written as:
Algorithm
The mapping description is less representative on the statistic perpective. Here, we need a little abuse of notation: represents the probability of the action ( can be the probability distribution of the possible action at state ).
NOTEThe policy defines the complete action of the agent, including all the performance and probability.
The policy only involves with the current state, not the time and the history information.
The agent can update the policy with the time.
Under certain policy , the agent satistify the certain pattern:
- State Transformation Probability:
- Reward Function:
With the involvement of the policy, we can represent the value of the agent from the state or the action perspective.
Still, with Bellman equation, we have :.
According to the Bayes theory, there exists: .
Solving the reinforcement learning problem means finding an optimal strategy that enables individuals to consistently gain more from interacting with the environment than any other strategy. This optimal strategy can be represented by . Usually, we can find the local optimal solution compared with other policies. obviously, it’s natural to find the biggest expectation of the state value function, i.e.: .
With the Bellman equation, the algorithm can be solved in Bellman Optimalty Equation. It has two steps: (1) value update and (2) policy update:
Example
An example of MDP is the Pole-Balancing model, which comes from classic control theory. Under such example:
