개발하는 핑구
article thumbnail

The Markov Decision Process (MDP) provides a mathematical framework for solving the RL problem. Almost all RL problems can be modeled as an MDP. MDPs are widely used for solving various optimization problems. In this section, we will understand what an MDP is and how it is used in RL.

 

To understand an MDP, first, we need to learn about the Markov property and Markov chain.

Agent-Environment Interface

State Transition (Probability) Matrix

  • For a state s and successor state s', the state transition probability is defined by

  • State transition matrix P defines transition probabilities from all states s to all successor states s‘

where each row of the matrix sums to 1.

Markov Property

  • a state 𝑆𝑡 is Markov (= a state 𝑆𝑡 has Markov Property)

if and only if (필요충분조건)

(the memoryless property of a stochastic process)

  • The future is independent of the past and only decided by the present; Once the present state is known, the history may be thrown away.

Markov Process

  • A Markov process is a memoryless random process, i.e. a sequence of random states S1, S2, … with the Markov property.
  • Definition

Markov Reward Process

  • A Markov Reward Process is a Markov Chain with values(rewards)
  • Definition

 

Episode

  • Episode and sampling
  • Episode is a kind of story of process which starts from beginning to terminal state.

  • Sampling is making an example of episodes

Return

  • Purpose of reinforcement learning is maximise 'Return' not rewards
  • Definition

Why discount?

  • 수학적 편리성(Mathematically convenient)
    • Avoids infinite returns in cyclic Markov processes
    • It is sometimes possible to use undiscounted Markov reward processes (i.e. = 1), e.g. if all sequences terminate
  • 사람의 선호도 반영 (Human preference)
    • Animal/human behavior shows preference for immediate reward
    • If the reward is financial, immediate rewards may earn more interest than delayed rewards
  • 미래에 대한 불확실성 반영 (Future uncertainty)
    • Uncertainty about the future may not be fully represented

(State) Value Function (상태 가치함수)

  • The value function v(s) gives the long-term value of state s
  • Definition

Markov Decision Process

  • A Markov decision process (MDP) is a Markov reward process with decisions (by an agent in a sequential decision making problem)
  • It is an environment in which all states are Markov.
  • Definition

History and State

  • The history is the sequence of observations, actions, rewards
  • State is the information used to determine what happens next
  • Formally, state is a functions of the history

What is MDP (Markov Decision Process)

  • Markov decision processes formally describe an environment for reinforcement learning
  • Where the environment is fully observable
  • The current state completely characterizes the process
  • Almost all RL problems can be formalized as MDPs

Model

  • A model predicts what the environment will do next
    • 𝑃 predicts the next state
    • 𝑅 predicts the next (immediate) reward

Policy

  • A policy is the agent's behavior
    • Deterministic policy : 𝜋(𝑠) = 𝑎
    • Stochastic policy : 𝜋(𝑎|𝑠) = ℙ[𝐴𝑡 = 𝑎|𝑆𝑡 = 𝑠]

  • Definition

  • A policy fully defines the behavior of an agent
    • MDP policies depend on the current state (not the history)
    • i.e. Policies are stationary (time-independent)

Value Function

  • Value function is a prediction of future reward
  • Used to evaluate the goodness / badness of states

state-value function

  • Definition

action-value function

  • Definition

Prediction and Control Tasks in RL Problem

  • Prediction : 𝜋가 주어졌을 때 각 상태의 벨류를 평가하는 문제

  • Control : 최적 정책(Optimal Policy)을 찾는 문제

profile

개발하는 핑구

@sinq

포스팅이 좋았다면 "좋아요❤️" 또는 "구독👍🏻" 해주세요!