Markov Decision Process have been extensively studied and form the basis of a wide range of reinforcement learning algorithms.

Imagine you are extremely sleepy early in the morning at work and you need to figure out how to stay awake. Further, imagine you (the agent) doesn't have the prerequisite knowledge of what kind of beverages you can drink to stay awake.

So, we can be in either a sleepy state or a wide-awake state. Let this be denoted as the following: \( S = \{ sleepy, awake \}\). Further, we can take the following consumption actions: \( A = \{ Walnuts, Turkey, Rice, Coffee\}\). Now, let's denote the chances of us transferring from a sleepy state to an awake state be the following: \( P(s | a) \), where \( a \in A \text{ and } s \in S \). Essentially, we need to learn a policy such that given our sleepy state, which action to take in order to to stay awake (i.e., our reward). Here, the policy is nothing but the probability to take a particular action in a given state i.e., \( \pi(s, a)\) is the probability of taking action \(a\) in state \( s\).

Now, we can define the value function as the following: $$ V^{\pi}(s) = \mathbb{E}\{r_{t+1} + \gamma r_{t+2} + \gamma^{2}r_{t+3} + \dots | s_t = s, \pi\}$$ The intuition behind the above equation is that if a particular policy \(\pi\) is followed, the cumulative discounted reward would be \(V^{\pi}(s)\). Here, \(V^{\pi}(s)\) is also known as the value of \(s\) under the policy \( \pi \).

Now, we have the tools to define the problem statement: What is the optimal policy, i.e., policy, \( \pi^{*}\) such that the \(V^{\pi^{*}}(s)\) is maximized \(\forall s \in S\).

Let's try to decompose the value function a little bit: $$ V^{\pi}(s) = \mathbb{E}[\sum_{k=0} \gamma^{k}r_{t+k+1} | s_t = s, \pi]$$ This can be further broken down into the the one step look ahead and all possible combinations of the sequence. $$ V^{\pi}(s) = \mathbb{E}[r_{t+1} + \sum_{k=1} \gamma^{k}r_{t+k+1} | s_t = s, \pi]$$. Now, we know that the expectation can be stated as the following: $$ V^{\pi}(s) = \sum_{a \in A} \pi(s, a)R(s, a) + \pi(s, a) \gamma \sum_{s^{'} \in S}P(s'|s, a)V^{\pi}(s')$$. The above is the famous Bellman Equation.

Further, just the way assigned a value to a state, we can assign a value to an action-value pair \( (s, a)\). This value is defined as the following: $$ Q^{\pi}(s, a) = \mathbb{E}\{r_{t+1} + \gamma r_{t+2} + \dots | s_{t} = s, a_{t} = a, \pi\}$$

Now, we can do a similar expansion as above to get the following: $$ Q^{*}(s, a) = R(s, a) + \gamma \sum_{s' \in S}P(s'|s, a) max_{a' \in A_{s'}}Q^{*}(s', a')$$

Value Function Approximation: As an example DP algorithm we consider value iteration. This iterative update is also known as backup because we are approximating the value of the currents tate by backing up information from possible successor states. $$V_{k+1}(s) = max_{a \in A_s}[R(s, a) + \gamma \sum_{s' \in S} P(s'|s, a) V_{k}(s')]$$. Similarly, we can approximate the Q value function by using iterative backup updates: $$Q_{k+1}(s, a) = R(s, a) + \gamma \sum_{s' \in S} P(s'|s, a) max_{a' \in A_{s'}}Q_{k}(s', a')$$

The following snippet is an example of policy evaluation using the aforementioned iterative backups method.