Markov Chains

发布时间 2023-11-19 17:34:02作者: kaleidopink

1. Introduction

​ Let \(\{X_n,n=0,1,2,...\}\) be a stochastic process that takes on a finite number of possible values. If \(X_n=i\), then the process is said to be in state \(i\) at time \(n\).

​ We suppose that whatever the process is in state \(i\), there is a fixed probability \(P_{ij}\) that it will next be in state \(j\).

2. Chapman-Kolmogorov Equations

​ We now define the \(n\)-step transition probabilities \(P_{ij}^n\) to be the probability that a process in state \(i\) will be in state \(j\) after \(n\) additional transitions. That is,

\[P_{ij}^n = P\{X_{n+k}=j\ |\ X_k=i\},\quad n\ge0,i,j\ge0 \]

The Chapman-Kolmogorov equations tell us that

\[P_{ij}^{n+m} = \sum_{k=0}^\infin P_{ik}^nP_{kj}^m\quad\mathrm{for\ all}\ n,m\ge0,\ \mathrm{all}\ i,j\tag{2.1} \]

3. Classification of States

​ State \(j\) is said to be accessible from state \(i\) if \(P_{ij}^n>0\) for some \(n\ge0\). Two states \(i\) and \(j\) are accessible to each other are said to communicate, we write \(i\lrarr j\).

​ Obviously, the relation of communication has the following properties:

  • if \(i\lrarr j\), then \(j\lrarr i\)
  • if \(i\lrarr j\) and \(j\lrarr k\), then \(i\lrarr k\).

​ Two states that communicate are said to be in the same class. Any two classes of states are either identical or disjoint.

​ The concept of communication divides the state space into a number of separate classes. The Markov chain is said to be irreducible if there is only one class, that is, if all states communicate with each other. In other words, it is possible to move from any state to any other state.

​ For any state \(i\), we let \(f_i\) denote the probability that, starting in state \(i\), the process will ever reenter state \(i\). State \(i\) is said to be recurrent if \(f_i=1\) and transient if \(f_i<1\).

​ Each time the process enters a transient state \(i\), there will be a positive probability \(1-f_i\) that the process will never enter state \(i\) again. Therefore, starting in state \(i\), the probability that the process will be in state \(i\) for exactly \(n\) time periods equals \(f_i^{n-1}(1-f_i),n\ge1\). In other words, starting in state \(i\), the number of time periods that the process will be in state \(i\) has a geometric distribution with finite mean \(1/(1-f_i)\).

​ A transient state will only be visited a finite number of times. A finite-state Markov chain not all states can be transient.

Corollary 3.1 If state \(i\) is recurrent, and state \(i\) communicates with state \(j\), then state \(j\) is recurrent. Conversely, if \(i\) is transient and communicates with state \(j\), then state \(j\) must also be transient.

4. Limiting Probabilities

​ 2nd definition of period: The states in a recurrent class are periodic if they can be grouped into \(d>1\) groups so that all transitions from one group lead to the next group.

​ State \(i\) is said to have period \(d\) if \(P_{ii}^n=0\) whenever \(n\) is not divisible by \(d\), and \(d\) is the largest integer with this property. A state with period 1 is said to be aperiodic. Periodicity is a class property, that is, if state \(i\) has period \(d\), and \(i\) communicate with \(j\), then \(j\) also has period \(d\).

​ If state \(i\) is recurrent, then it is said to be positive recurrent if, starting in \(i\), the expected time until the process returns to state \(i\) is finite. Positive recurrent, aperiodic states are called ergodic.

Theorem 4.1 For an irreducible ergodic Markov chain \(\lim_{n\rarr\infin}P_{ij}^n\) exists and is independent of \(i\). Furthermore, letting

\[\pi_j = \lim_{n\rarr\infin}P_{ij}^n,\quad j\ge0 \]

then \(\pi_j\) is the unique nonnegative solution of

\[\begin{align} &\pi_j = \sum_{i=0}^\infin\pi_iP_{ij},\quad j\ge0,\\ &\sum_{j=0}^\infin\pi_j=1 \end{align}\tag{4.1} \]

The equation can be interpreted as such: The probability of into state \(j\) is the sum of probability of the last state \(i\) into \(j\) weighted by the probability of the transition from \(i\) to \(j\).

\(\pi_j\) can be considered as the frequency of the state \(j\) will be visited when the number of total visiting time \(N\rarr\infin\).


The workers problem

​ Consider an organization whose workers are of \(r\) distinct types. Suppose that a worker who is currently type \(i\) will in the next period become type \(j\) with probability \(q_{ij}\) for \(j=1,...,r\) or will leave the organization with probability \(1-\displaystyle\sum_{j=1}^rq_{ij}\). In addition, suppose that new workers are hired each period, and that the numbers of types \(1,...,r\) workers hired are independent Poisson random variables with mean \(\lambda_1,...,\lambda_r\). If we let \(\mathrm{\textbf{X}}_n=(X_n(1),...,X_n(r))\), where \(X_n(i)\) is the number of type \(i\) workers in the organization at the beginning of period \(n\), then \(\mathrm{\textbf{X}}_n,n\ge0\) is a Markov chain.

​ To compute its stationary probability distribution, suppose that the initial state is chosen so that the number of workers of different types are independent Poison random variables, with \(\alpha_i\) being the mean number of type \(i\) workers. Also, let \(N_j,j=1,...,r\) be the number of new type \(j\) workers being hired during the initial period. Now, fix \(i\), and for \(j=1,...,r\), let \(M_i(j)\) be the number of the \(X_0(i)\) type \(i\) workers who become type \(j\) in the next period. Then

\[X_1(j)=N_j+\sum_{i=1}^rM_i(j),\quad j=1,...,r \]

are independent Poisson random variables with means

\[E[X_1(j)] = \lambda_j+\sum_{i=1}^r\alpha_iq_{ij},\quad j=1,...,r \]

Hence, if we let

\[\alpha_j^o=\lambda_j+\sum_{i=1}^r\alpha_iq_{ij},\quad j=1,...,r \]

then the stationary distribution of the Markov chain is the distribution that takes the number of workers in each type to be independent Poisson random variables with means \(\alpha_1^o,...,\alpha_r^o\). That is

\[\lim_{n\rarr\infin}P\{X_n=(k_1,...,k_r)\} = \prod_{i=1}^r\frac{e^{-\alpha^o_i}(\alpha_i^o)^{k_i}}{k_i!} \]


Mean Pattern Times in Markov Chain Generated Data

​ Consider an irreducible Markov chain \(\{X_n,n\ge0\}\) with transition probabilities \(P_{ij}\) and stationary probabilities \(\pi_j,j\ge0\). Starting in state \(r\), we are interested in determining the expected number of transitions until the pattern \(i_1,i_2,...,i_k\) appears. That is, with

\[N(i_1,i_2,...,i_k)=\min\{n\ge k\ |\ X_{n-k+1}=i_1,...,X_n=i_k \} \]

we are interested in

\[E[N(i_1,i_2,...,i_k)\ |\ X_0=r] \]

​ Let \(\mu(i,i_1)\) be the mean number of transitions for the chain to enter state \(i_1\), given that the initial state is \(i,i\ge0\). The quantities \(\mu(i,i_1)\) can be determined as the solution of the following set of equations, obtained by conditioning on the first transition out of state \(i\):

\[\mu(i,i_1)=1+\sum_{j\ne i_1}P_{ij}\mu(j,i_1),\quad i\ge0 \]

For the Markov chain \(\{X_n,n\ge0\}\) associate a corresponding Markov chain, which we will refer to as the \(k\)-chain, whose state at any time is the sequence of the most recent \(k\) states of the original chain. Let \(\pi(j_1,...,j_k)\) be the stationary probabilities for the \(k\)-chain. Because \(\pi(j_1,...,j_k)\) is the proportion of time that the original Markov chain \(k\) units ago was \(j_1\) and the following \(k-1\) states, in sequence, were \(j_2,...,j_k\), we can conclude that

\[\pi(j_1,...,j_k) = \pi_{j_1}\prod_{l=1}^{k-1}P_{j_{l}j_{l+1}} \]

Moreover, because the mean number of transitions between successive visits of the \(k\)-chain to the state \(i_1,i_2,...,i_k\) is equal to the inverse of the stationary probability of that state, we have that

\[\begin{align} E[&\mathrm{number\ of\ transitions\ between\ visits\ to}\ i_1,i_2,...,i_k]\\ &=\frac1{\pi(i_1,...,i_k)} \end{align} \]

Let \(A(i_1,...,i_m)\) be the additional number of transitions needed until the pattern appears, given that the first \(m\) transitions have taken the chain into states \(X_1=i_1,...,X_m=i_m\).

​ Now we consider whether the pattern has overlaps, where we say that the pattern \(i_1,i_2,...,i_k\) has an overlap of size \(j,j<k\), if the sequence of its final \(j\) elements is the same as that of its first \(j\) elements.

Case 1

Later


Proposition 4.1 Let \(\{X_n,n\ge1\}\) be a irreducible Markov chain with stationary probabilities \(\pi_j,j\ge0\), and let \(r\) be a bounded function on the state space. Then

\[\lim_{N\rarr\infin}\frac{\displaystyle\sum_{n=1}^Nr(X_n)}N=\sum_{j=0}^\infin r(j)\pi_j \]

Proof If we let \(a_j(N)\) be the amount of time the Markov chain spends in state \(j\) during periods \(1,...,N\), then

\[\sum_{n=1}^Nr(X_n)=\sum_{j=0}^\infin a_j(N)r(j) \]

Since \(a_j(N)/N\rarr\pi_j\), we can divide both sides by \(N\) and let \(N\rarr\infin\).

5. Some Applications

The Gambler's Ruin Problem
A Model for Algorithmic Efficiency

6. Mean Time Spent in Transient States

​ For transient state \(i\) and \(j\), let \(s_{ij}\) denote the expected number of time periods that the Markov chain is in state \(j\), given that it starts in state \(i\). Let \(\delta_{ij}=1\) when \(i=j\) and let it be 0 otherwise. Condition on the initial transition to obtain

\[\begin{align} s_{ij} &= \delta_{ij}+\sum_kP_{ik}s_{kj}\\ &= \delta_{ij}+\sum_{k=1}^tP_{ik}s_{kj} \end{align} \]

where the final equality follows since it is impossible to go from a recurrent to a transient state, implying that \(s_{kj}=0\) when \(k\) is a recurrent state.

​ Let \(S\) denote the matrix of \(s_{ij},i,j=1,...,t\), the above equation can be written as

\[S=I+P_TS\\ S=(I-P_T)^{-1} \]

7. Branching Processes

​ Consider a population consisting of individuals able to produce offspring of the same kind. Suppose that each individual will have produced \(j\) new offsprings with probability \(P_j,j\ge0\), independently of the numbers produced by other individuals. We suppose that \(P_j<1\) for all \(j\ge0\). The size of the \(n\)th generation is denoted as \(X_n,n=0,1,...\). It follows that \(\{X_n,n=0,1,...\}\) is a Markov chain having as its state space the set of nonnegative integers.

​ Note that state \(0\) is a recurrent state, since clearly \(P_{00}=1\). Also, if \(P_0>0\), all other states are transient, since \(P_{i0}=P_0^i\), which implies that starting with \(i\) individuals there is a positive probability that no longer generation will ever consist of \(i\) individuals. This leads to the important conclusion that, if \(P_0>0\), the population will either die out or its size will converge to infinity.

​ Let \(\mu=\displaystyle\sum_{j=0}^\infin jP_j\) denote the mean number of offspring of a single individual, and let \(\sigma^2=\displaystyle\sum_{j=0}^\infin(j-\mu)^2P_j\) be the variance of the number of offspring produced by a single individual.

\(X_n\) can be written as \(X_n=\displaystyle\sum_{i=1}^{X_{n-1}}Z_i\), where \(Z_i\) represents the number of offsprings produced by the \(i\)st individual of the \((n-1)\)st generation. By conditioning on \(X_{n-1}\), we obtain

\[\begin{align} E[X_n] &= E[E[X_n|X_{n-1}]]\\ &= E\left[E\left[\sum_{i=1}^{X_{n-1}}Z_i|X_{n-1}\right]\right]\\ &=E\left[\sum_{i=1}^{X_{n-1}}E[Z_i|X_{n-1}] \right]\\ &=E[\mu X_{n-1}]\\ &=\mu E[X_{n-1}] \end{align} \]

Since \(E[X_0]=1\), the equation yields that \(E[X_n]=\mu^n\).

​ Similarly, \(\mathrm{Var}(X_n)\) may be obtained by using the conditional variance formula

\[\mathrm{Var}(X_n)=E[\mathrm{Var}(X_n|X_{n-1})] + \mathrm{Var}(E[X_n|X_{n-1}]) \]

The law of total variance or conditional variance formulas, states that if \(X\) and \(Y\) are random variables on the same probability space, and the variance of \(Y\) is finite, then

\[\mathrm{Var}(Y) = E[\mathrm{Var}(Y|X)] + \mathrm{Var}(E[Y|X]) \]

Given \(X_{n-1}\), \(X_n\) is just the sum of \(X_{n-1}\) independent random variables each having the distribution \(\{P_j,j\ge0\}\). Hence,

\[E[X_n|X_{n-1}]=\mu X_{n-1},\mathrm{Var}(X_n|X_{n-1})=X_{n-1}\sigma^2 \]

Therefore,

\[\mathrm{Var}(X_n)=\left\{\begin{align}&\sigma^2\mu^{n-1}\left(\frac{1-\mu^n}{1-\mu} \right),&&\mu\ne1\\&n\sigma^2,&&\mu=1 \end{align} \right. \]

​ Let \(\pi_0\) denote the probability that the population will die out (under the assumption that \(X_0=1\)).

\[\pi_0=\lim_{n\rarr\infin}P(X_n=0 |X_0=1) \]

Suppose that \(X_1=j\), and since all individuals are independent, then the probability of these \(j\) individuals to die out is \(\pi_0^j\), that is

\[\lim_{n\rarr\infin}P(X_n=0|X_1=j)=\pi_0^j \]

Thus the following equation holds:

\[\pi_0=\sum_{j=0}^\infin\pi_0^jP_j \]

If fact when \(\mu>1\), it can be shown that \(\pi_0\) is smallest positive number satisfying the equation.

8. Time Reversible Markov Chains

​ Suppose that starting at some time we trace the sequence of states going backward in time. That is, starting at time \(n\), consider the sequence \(X_n,X_{n-1},...\). It turns out that this sequence of states is itself a Markov chain with transition probabilities \(Q_{ij}\) defined by

\[\begin{align} Q_{ij} &= P(X_m=j|X_{m+1}=i)\\ &= \frac{P(X_m=j,X_{m+1}=i)}{P(X_{m+1}=i)}\\ &= \frac{P(X_{m+1}=i|X_m=j)P(X_m=j)}{P(X_{m+1}=i)}\\ &= \frac{P_{ji}P(X_m=j)}{P(X_{m+1}=i)}\\ &= \frac{P_{ji}\pi_j}{\pi_i} \end{align} \]

To prove that the reversed process is indeed a Markov chain, we must verify that

\[P\{X_m=j|X_{m+1}=i_1,X_{m+2}=i_2,... \}=P\{X_m=j|X_{m+1}=i \} \]

​ As independence is a mutual relationship, when we say future states \(X_{m+1},X_{m+2},...\) given the present state \(X_m\) are independent of the past state \(X_{m-1}\), it is also true that past states \(X_{m-1},X_{m-2},...\) given the present state \(X_{m}\) are independent of the future state \(X_{m+1}\).

​ Thus, the reversed process is also a Markov chain with transition probabilities given by \(Q_{ij}=\displaystyle\frac{P_{ji}\pi_j}{\pi_i}\). If \(Q_{ij}=P_{ij}\), the Markov chain is said to be time reversible. Which means

\[P_{ij}\pi_i=P_{ji}\pi_j\tag{8.1} \]

This condition can be interpreted as for all states \(i\) and \(j\), the rate at which the process goes from \(i\) to \(j\) is equal to the rate from \(j\) to \(i\).

Theorem 8.1 A Markov chain with transition probabilities \(P_{ij}\) is said to be time reversible if for any two states \(i\) and \(j\), the equation \(P_{ij}\pi_i=P_{ji}\pi_j\) holds.

​ If we can find nonnegative numbers, summing to one, that satisfy Equation (8.1), then we can say the Markov chain is time reversible. Actually, if we sum over \(i\) yields

\[\sum_iP_{ij}\pi_i=\pi_j\sum_{i}P_{ji}=\pi_j,\quad\sum_{i}\pi_i=1 \]

which is equivalent to Equation (4.1).

Theorem 8.2 An ergodic Markov chain for which \(P_{ij}=0\) whenever \(P_{ji}=0\) is time reversible if and only if starting in state \(i\), any path back to \(i\) has the same probability as the reversed path. That is, if

\[P_{ii_1}P_{i_1i_2}\cdots P_{i_kj}=P_{ji_k}P_{ji_{k-1}}\cdots P_{i_1i} \]

for all states \(i,i_1,...,i_k\). The converse is not true.

Proposition 8.1 Consider an irreducible Markov chain with transition probabilities \(P_{ij}\). If we can find positive numbers \(\pi_i,i\ge0\), summing to one, and a transition probability matrix \(Q=[Q_{ij}]\) such that

\[P_{ij}\pi_i=Q_{ji}\pi_j \]

then the \(Q_{ij}\) are the transition probabilities of the reversed chain and the \(\pi_i\) are the stationary probabilities both for the original and reversed chain.

9. Markov Chain Monte Carlo Methods

​ A Monte Carlo method usually follows the following pattern:

  1. Define a domain of possible inputs
  2. Generate inputs randomly from a probability distribution over the domain
  3. Perform a deterministic computation on the inputs
  4. Aggregate the results

​ However, when the inputs are vectors, it is difficult to generate a random vector having the specified probability mass function. In these cases, we can generate a sequence, not of independent random vectors, but of the successive states of a vector-valued Markov chain \(\mathrm{\textbf X}_1,\mathrm{\textbf X}_2,...\) whose stationary probabilities are \(P\{\mathrm{\textbf X}=\mathrm{\textbf x}_j\},j\ge1\).

​ Let \(b(j),j=1,2,...\) be positive numbers whose sum \(B=\sum_{j=1}^\infin b_j\) is finite. The following, known as Hastings-Metropolis algorithm, can be used to generate a time reversible Markov chain whose stationary probabilities are

\[\pi(j)=\frac{b(j)}B,\quad j=1,2,... \]

​ To begin, let \(Q\) be any specified irreducible Markov transition probability matrix on the integers, with \(q(i,j)\) representing the row \(i\) column \(j\) element of Q. Now define a Markov chain \(\{X_n,n\ge0\}\) as follows. When \(X_n=i\), generate a random variable \(Y\) such that \(P\{Y=j\}=q(i,j),j=1,2,...\). If \(Y=j\),then set \(X_{n+1}\) equal to \(j\) with probability \(\alpha(i,j)\), and set it equal to \(i\) with probability \(1-\alpha(i,j)\). Under these conditions, it is easy to see that the sequence of states constitutes a Markov chain with transition probabilities \(P_{ij}\) given by

\[\begin{align} &P_{ij}=q(i,j)\alpha(i,j),& j\ne i\\ &P_{ij}=q(i,i)+\sum_{k\ne i}q(i,k)(1-\alpha(i,k)) \end{align} \]

​ This Markov chain will be time reversible and have stationary probabilities \(\pi(j)\) if

\[\pi(i)P_{ij}=\pi(j)P_{ji},\quad i\ne j \]

which is equivalent to

\[\pi(i)q(i,j)\alpha(i,j)=\pi(j)q(j,i)\alpha(j,i) \]

If we take \(\pi_j=b(j)/B\) and set

\[\alpha(i,j)=\min\left(\frac{\pi(j)q(j,i)}{\pi(i)q(i,j)},1 \right) \]

For if \(\alpha(i,j)=\displaystyle\frac{\pi(j)q(j,i)}{\pi(i)q(i,j)}\), then \(\alpha(j,i)=1\); if \(\alpha(i,j)=1\), then \(\alpha(j,i)=\displaystyle\frac{\pi(i)q(i,j)}{\pi(j)q(j,i)}\).

​ Since \(\pi(j)=b(j)/B\), then

\[\alpha(i,j)=\min\left(\frac{b(j)q(j,i)}{b(i)q(i,j)},1 \right) \]

which shows that the value of \(B\) is not needed to define the Markov chain.

10. Markov Decision Processes

​ Consider a process with \(M\) possible states, after every transition, an action must be chosen. The next state of the system is determined according to the transition probabilities \(P_{ij}(a)\). Let \(X_n\) denote the process state at time \(n\), let \(a_n\) denote the action chosen at time \(a\). Then

\[P\{X_{n+1}=j|X_0,a_0,X_1,a_1,...,X_n=i,a_n=a \}=P_{ij}(a) \]

​ By a policy, we mean a rule for choosing actions. The rule shall be restricted to choosing the action at time \(n\) only depends on time \(n\) and state of process at time \(n\). On the other hand, we allow the policy to be "randomized", that it may choose actions according to a probability distribution.

​ Under any given policy \(\beta\), the sequence of states constitutes a Markov chain with transition probabilities \(P_{ij}(\beta)\) defined by

\[\begin{align} P_{ij}(\beta) &= \sum_{a}P_{ij}(a)\beta_i(a) \end{align} \]

Let us suppose for every choice of a policy \(\beta\), the resultant Markov chain is irreducible.

​ For any policy \(\beta\), let \(\pi_{ia}\) denote the stationary probability that the process will be in state \(i\) and action \(a\) will be chosen if policy \(\beta\) is employed.

\[\pi_{ia}=\lim_{n\rarr\infin}P_\beta\{X_n=i,a_n=a \} \]

The vector \(\pi=(\pi_{ia})\) must satisfy

  1. \(\pi_{ia}\ge0\) for all \(i,a\)
  2. \(\sum_i\sum_a\pi_{ia}=1\)
  3. \(\sum_a\pi_{ja}=\sum_i\sum_aP_{ij}(a)\) for all \(j\)

​ It turns out that for any policy \(\beta\), the vector satisfying these three conditions exists. The converse is also true, suppose that \(\pi=(\pi_{ia})\) is a vector that satisfies the three conditions, then let the policy \(\beta=(\beta_i(a))\) be

\[\beta_i(a)=\frac{\pi_{ia}}{\sum_a\pi_{ia}} \]

we see that \((P_{ia})\) is the unique solution of

\[\begin{align} &P_{ia}\ge0,\\ \sum_i\sum_a&P_{ia}=1,\\ &P_{ja}=\sum_i\sum_{a'}P_{ia'}P_{ij}(a')\frac{\pi_{ja}}{\sum_a\pi_{ja}} \end{align} \]

Hence, to show that \(P_{ia}=\pi_{ia}\), we need show that

\[\begin{align} &P_{ia}\ge0,\\ \sum_i\sum_a&P_{ia}=1,\\ &\pi_{ja}=\sum_i\sum_{a'}\pi_{ia'}P_{ij}(a')\frac{\pi_{ja}}{\sum_a\pi_{ja}} \end{align} \]

The third equation is equivalent to

\[\sum_a\pi_{ja}=\sum_i\sum_{a'}\pi_{ia'}P_{ij}(a') \]

​ Thus we have shown that a vector \(\pmb\beta=(\pi_{ia})\) will satisfy the three conditions if and only if there exists a policy \(\pmb\beta\) such that \(\pi_{ia}\) is equal to the steady-state probability of being in state \(i\) and choosing action \(a\) when \(\pmb\beta\) is used. In fact, the policy is defined by \(\pmb\beta=\pi_{ia}/\sum_a\pi_{ia}\).

​ The above conclusion is quite important in the determination of "optimal" policies. Consider that a reward \(R(i,a)\) is earned whenever action \(a\) is chosen in state \(i\). Let \(R(X_i,a_i)\) denote the reward earned at time \(i\). The expected average reward per time unit under policy \(\pmb\beta\) can be expressed as

\[\lim_{n\rarr\infin}E_{\pmb\beta}\left[\frac{\sum_{i=1}^nR(X_i,a_i)}n \right]=\lim_{n\rarr\infin} E[R(X_n,a_n)] \]

The limiting expected reward at time \(n\) equals

\[\lim_{n\rarr\infin}E[R(X_n,a_n)]=\sum_i\sum_a\pi_{ia}R(i,a) \]

Hence, the problem of determining the policy that maximizes the expected average reward is

\[\begin{align} \max\quad &\sum_i\sum_a\pi_{ia}R(i,a)\\ \mathrm{subject\ to}\quad &\pi_{ia}\ge0,\quad\mathrm{for\ all}\ i,a,\\ &\sum_i\sum_a\pi_{ia}=1,\\ &\sum_a\pi_{ja} = \sum_i\sum_a\pi_{ia}P_{ij}(a),\quad \mathrm{for\ all}\ j \end{align} \]

11. Hidden Markov Chains

​ Let \(\{X_n,n=1,2,...\}\) be a Markov chain with transition probabilities \(P_{i,j}\) and initial state probabilities \(p_i=P(X_1=i),i\ge0\). Suppose that there is a finite set \(\mathscr I\) of signals, and that a signal from \(\mathscr I\) is emitted each time the Markov chain enters a state. Further, suppose that when the Markov chain enters state \(j\) then, independently of previous Markov chain states, the signal emitted is \(s\) with probability \(p(s|j),\displaystyle\sum_{s\in\mathscr I}p(s|j)=1\).

​ A model of the preceding type in which the sequence of signals \(S_1,S_2,...\) is observed, while the sequence of the underlying Markov chain states \(X_1,X_2,...\) is unobserved, is called a hidden Markov chain. That is, if \(S_n\) represents the \(n\)th signals emitted, then

\[\begin{align} P(S_1=s|X_1=j)&=p(s|j),\\ P(S_n=s|S_1,X_1,...,S_{n-1},X_{n-1},X_n=j) &=p(s|j) \end{align} \]

​ Let \(S^n=\{s_1,s_2,...,s_n\}\) denote the random vector of the first \(n\) signals. For a fixed sequence of signals \(s_1,s_2,...,s_n\), let \(\mathrm{\pmb s}_k=\{s_1,...,s_k\},k\le n\). To begin, we determine the conditional probability of the Markov chain state at time \(n\) given that \(S^n=\mathrm{\pmb s}_n\). Let

\[F_n(j) = P(S^n=\mathrm{\pmb s}_n,X_n=j) \]

Then

\[\begin{align} &F_n(j)\\ =\ &P(S^{n-1}=\mathrm{\pmb s}_{n-1},S^n=\mathrm{\pmb s}_n, X_n=j)\\ =\ &\sum_i P(S^{n-1}=\mathrm{\pmb s}_{n-1},X_{n-1}=i,S^n=\mathrm{\pmb s}_n, X_n=j)\\ =\ &\sum_i P(S^n=\mathrm{\pmb s}_n, X_n=j\left|S^{n-1}=\mathrm{\pmb s}_{n-1},X_{n-1}=i \right.)F_{n-1}(i)\\ =\ &\sum_i P(S^n=\mathrm{\pmb s}_n, X_n=j\left|X_{n-1}=i \right.)F_{n-1}(i)\\ =\ &\sum_iP_{ij}P(S^n=\mathrm{\pmb s}_n|X_{n}=j,X_{n-1}=i)F_{n-1}(i)\\ =\ &\sum_iP_{ij}P(S^n=\mathrm{\pmb s}_n|X_{n}=j)F_{n-1}(i)\\ =\ &p(s_n|j)\sum_iP_{ij}F_{n-1}(j)\tag{11.1} \end{align} \]

In case you don't understand, in the preceding we used that

\[P(AB|C) = P(B|C)P(A|BC) \]

So

\[\begin{align} &P(S^n=\mathrm{\pmb s}_n,X_n=j|X_{n-1}=i)\\ =\ &P(X_n=j|X_{n-1}=i)P(S^n=\mathrm{\pmb s}_n|X_n=j,X_{n-1}=i)\\ =\ &P_{ij}\ p(s_n|j) \end{align} \]

Based on \(F_{n}(i)\), we can recursively determine \(F_{n+1}(i)\).

​ This computation of \(P\{S^n=\mathrm{\pmb s}_n\}\) by recursively determining the functions \(F_k(i)\) is known as the forward approach. There is also a backward approach, which is based on the quantities \(B_k(i)\) defined by

\[B_k(i)=P\{S_{k+1}=s_{k+1},...,S_n=s_n|X_k=i \} \]

A recursive formula for \(B_k(i)\) can be obtained by conditioning on \(X_{k+1}\).

\[\begin{align} &B_k(i)\\ = &P\{S_{k+1}=s_{k+1},...,S_n=s_n|X_k=i \}\\ = &\sum_jP\{S_{k+1}=s_{k+1},...,S_n=s_n|X_k=i,X_{k+1}=j\}P_{ij}\\ = &\sum_jP\{S_{k+1}=s_{k+1},...,S_n=s_n|X_{k+1}=j\}P_{ij}\\ = &\sum_{j}P(S_{k+1}=s_{k+1}|X_{k+1}=j)\\ &\times P\{S_{k+2}=s_{k+2},...,S_n=s_n|S_{k+1}=s_{k+1},X_{k+1}=j \}P_{ij}\\ = &\sum_jp(s_{k+1}|j)B_{k+1}(j)P_{ij}\tag{11.2} \end{align} \]

​ Starting with

\[B_{n-1}(i) = P\{S_n=s_n|X_{n-1}=i\}=\sum_jP_{ij}p(s_n|j) \]

we would then use Equation (11.2) to determine the function \(B_{n-2}(i)\), then recursively down to \(B_1(i)\). This would then yield \(P\{S^n=\mathrm{\pmb s}_n\}\) via

\[\begin{align} &P(S^n=\mathrm{\pmb s}_n)\\ = &\sum_iP\{S_1=s1,...,S_n=s_n|X_1=i \}p_i\\ = &\sum_iP\{S_1=s_1|X_1=i\}\times P\{S_2=s2,...,S_n=s_n|S_1=s_1,X_1=i \}p_i\\ = &\sum_ip(s_1|i)P\{S_2=s_2,...,S_n=s_n|X_1=i \}p_i\\ = &\sum_ip(s_1|i)B_1(i)p_i \end{align} \]

​ Another approach to obtaining \(P\{S^n=\mathrm{\pmb s}_n\}\) is to combine the forward and backward approaches. Suppose that for some \(k\) we have computed both functions \(F_k(j)\) and \(B_k(j)\). Because

\[\begin{align} &P\{S^n=\mathrm{\pmb s}_n,X_k=j \}\\ =\ &P\{S^k=s_k,X_k=j \}\\ \times\ &P\{S_{k+1}=s_{k+1},...,S_n=s_n|S^k=s_k,X_k=j \}\\ =\ &P\{S^k=s_k,X_k=j \}P\{S_{k+1}=s_{k+1},...,S_n=s_n|X_k=j \}\\ =\ &F_k(j)B_k(j) \end{align} \]

we see that

\[P(S^n=\mathrm{\pmb s}_n) =\sum_jF_k(j)B_k(j) \]

Now we may simultaneously compute the sequence of forward functions, starting with \(F_1\), as well as the sequence of backward functions, starting at \(B_{n-1}\). The parallel computations can then be stopped once we have computed both \(F_{k}\) and \(B_k\) at some \(k\).

11.1 Predicating the States

​ Suppose that the first \(n\) observed signals are \(\{S_1,S_2,...,S_n\}\), and that given this data, we want to predict the first \(n\) states of the Markov chain. The best predicator depends on what we are trying to accomplish. If our objective is to maximize the expected number of correctly predicated states, then for each \(j=1,...,n\), we compute \(P(X_k=j|S^n=\mathrm{\pmb s}_n)\) and let the value of \(j\) that maximizes this quantity be the predicator of \(X_k\).

\[\begin{align} P(X_k=j|S^n=\mathrm{\pmb s}_n) &= \frac{P(S^n=\mathrm{\pmb s}_n,X_k=j)}{P(S^n=\mathrm{\pmb s}_n)}\\ &= \frac{F_k(j)B_k(j)}{\sum_jF_k(j)B_k(j)} \end{align} \]

​ If we consider a sequence of states as an entity, then our objective is to choose a sequence of states that maximizes the conditional probability, given the sequence of signals, is maximal. Letting \(\mathrm{\pmb X}_k=(X_1,...,X_k)\) be the vector of the first \(k\) states, the problem of interest is to find the sequence of states \(i_1,...,i_n\) that maximizes \(P\{\mathrm{\pmb X}_n=(i_1,...,i_n)|S^n=\mathrm{\pmb s}_n\}\). Because

\[P\{\mathrm{\pmb X}_n=(i_1,...,i_n)|\mathrm{\pmb S}^n=\mathrm{\pmb s}_n \} = \frac{P\{\mathrm{\pmb X}_n=(i_1,...,i_n),\mathrm{\pmb S}^n=\mathrm{\pmb s}_n\} }{P(\mathrm{\pmb S}^n=\mathrm{\pmb s}_n)} \]

this is equivalent to finding the sequence \(i_1,...,i_n\) that maximizes $P{\mathrm{\pmb X}_n=(i_1,...,i_n),\mathrm{\pmb S}^n=\mathrm{\pmb s}_n} $.

​ Let, for \(k\le n\),

\[V_k(j)=\max_{i_1,...,i_{k-1}}P\{\mathrm{\pmb X}_{k-1}=(i_1,...,i_{k-1}),X_k=j,\mathrm{\pmb S}^k=\mathrm{\pmb s}_k \} \]

To recursively solve \(V_k(j)\),

\[\begin{align} &V_k(j)\\ =\ &\max_i\max_{i_1,...,i_{k-2}}P\{X_{k-2}=(i_1,...,i_{k-2}),X_{k-1}=i,\mathrm{\pmb S}^{k-1}=\mathrm{\pmb s}_{k-1},\\&X_k=j,S_k=s_k\}\\ =\ &\max_i\max_{i_1,...,i_{k-2}}P\left\{X_{k-2}=(i_1,...,i_{k-2}),X_{k-1}=i,\mathrm{\pmb S}^{k-1}=\mathrm{\pmb s}_{k-1} \right\}\\ \times\ &P\left\{X_k=j,S_k=s_k\left|X_{k-2}=(i_1,...,i_{k-2}),X_{k-1}=i,\mathrm{\pmb S}^{k-1}=\mathrm{\pmb s}_{k-1} \right.\right\}\\ =\ &\max_i\max_{i_1,...,i_{k-2}}P\left\{X_{k-2}=(i_1,...,i_{k-2}),X_{k-1}=i,\mathrm{\pmb S}^{k-1}=\mathrm{\pmb s}_{k-1} \right\}\\ \times\ &P\left\{X_k=j,S_k=s_k|X_{k-1}=i\right\}\\ =\ &\max_i P_{ij}\ p(s_k|j)\ V_{k-1}(i)\\ =\ &p(s_k|j)\max_iP_{ij}V_{k-1}(i) \end{align} \]

​ To obtain the maximizing sequence of states, we work in the reverse direction. Let \(j_n\) be the value that maximizes \(V_n(j)\). Also, for \(k<n\), let \(i_k(j)\) be a value of \(i\) that maximizes \(P_{ij}V_k(i)\). Then

\[\begin{align} &\max_{i_1,...,i_n}P\left\{\mathrm{\pmb X}_n=(i_1,...,i_n),\mathrm{\pmb S}^n=\mathrm{\pmb s}_n \right\}\\ =\ &V_n(j_n)\\ =\ &\max_{i_1,...,i_{n-1}}P\left\{\mathrm{\pmb X}_n=(i_1,...,i_{n-1},j_n),\mathrm{\pmb S}^n=\mathrm{\pmb s}_n \right\}\\ =\ &p(s_n|j_n)\max_iP_{i,j_n}V_{n-1}(i)\\ =\ &p(s_n|j_n)P_{i_{n-1}(j_n)j_n}V_{n-1}(i_{n-1}(j_n)) \end{align} \]

Thus, \(i_{n-1}(j_n)\) is the next to last state of the maximizing sequence. Continuing in this manner, the second from the last state of the maximizing sequence if \(i_{n-2}(i_{n-1}(j_n))\), and so on.

​ The preceding approach to finding the most likely sequence of states given a prescribed sequence of signals is known as the Viterbi Algorithm.