The Markov Decision Process (MDP) provides a mathematical framework for solving the RL problem. Almost all RL problems can be modeled as an MDP. MDPs are widely used for solving various optimization problems. In this section, we will understand what an MDP is and how it is used in RL.
To understand an MDP, first, we need to learn about the Markov property and Markov chain.
State Transition (Probability) Matrix
- For a state s and successor state s', the state transition probability is defined by
- State transition matrix P defines transition probabilities from all states s to all successor states s‘
where each row of the matrix sums to 1.
Markov Property
- a state 𝑆𝑡 is Markov (= a state 𝑆𝑡 has Markov Property)
if and only if (필요충분조건)
(the memoryless property of a stochastic process)
- The future is independent of the past and only decided by the present; Once the present state is known, the history may be thrown away.
Markov Process
- A Markov process is a memoryless random process, i.e. a sequence of random states S1, S2, … with the Markov property.
- Definition
Markov Reward Process
- A Markov Reward Process is a Markov Chain with values(rewards)
- Definition
Episode
- Episode and sampling
- Episode is a kind of story of process which starts from beginning to terminal state.
- Sampling is making an example of episodes
Return
- Purpose of reinforcement learning is maximise 'Return' not rewards
- Definition
Why discount?
- 수학적 편리성(Mathematically convenient)
- Avoids infinite returns in cyclic Markov processes
- It is sometimes possible to use undiscounted Markov reward processes (i.e. = 1), e.g. if all sequences terminate
- 사람의 선호도 반영 (Human preference)
- Animal/human behavior shows preference for immediate reward
- If the reward is financial, immediate rewards may earn more interest than delayed rewards
- 미래에 대한 불확실성 반영 (Future uncertainty)
- Uncertainty about the future may not be fully represented
(State) Value Function (상태 가치함수)
- The value function v(s) gives the long-term value of state s
- Definition
Markov Decision Process
- A Markov decision process (MDP) is a Markov reward process with decisions (by an agent in a sequential decision making problem)
- It is an environment in which all states are Markov.
- Definition
History and State
- The history is the sequence of observations, actions, rewards
- State is the information used to determine what happens next
- Formally, state is a functions of the history
What is MDP (Markov Decision Process)
- Markov decision processes formally describe an environment for reinforcement learning
- Where the environment is fully observable
- The current state completely characterizes the process
- Almost all RL problems can be formalized as MDPs
Model
- A model predicts what the environment will do next
- 𝑃 predicts the next state
- 𝑅 predicts the next (immediate) reward
Policy
- A policy is the agent's behavior
- Deterministic policy : 𝜋(𝑠) = 𝑎
- Stochastic policy : 𝜋(𝑎|𝑠) = ℙ[𝐴𝑡 = 𝑎|𝑆𝑡 = 𝑠]
- Definition
- A policy fully defines the behavior of an agent
- MDP policies depend on the current state (not the history)
- i.e. Policies are stationary (time-independent)
Value Function
- Value function is a prediction of future reward
- Used to evaluate the goodness / badness of states
state-value function
- Definition
action-value function
- Definition
Prediction and Control Tasks in RL Problem
- Prediction : 𝜋가 주어졌을 때 각 상태의 벨류를 평가하는 문제
- Control : 최적 정책(Optimal Policy)을 찾는 문제
'Reinforcement Learning' 카테고리의 다른 글
[강화학습] Temporal Difference Methods (Q-Learning) (0) | 2023.05.26 |
---|---|
[강화학습] Monte-Carlo Methods (2) | 2023.05.23 |
[강화학습] Planning by Dynamic Programming (1) | 2023.04.26 |
Bellman Equation(벨만 방정식)이란? (0) | 2023.03.30 |