chapter6: 隐马尔科夫模型 standford tutorial, 可以说是看过的关于隐马尔科夫最好的教程了。要是看英文原版能和看中文一样easy该多好，可惜文化差异的情况下，即使是单词都认识，理解起来也会有点困难。对此，只能自己重新总结一下吧～

### 马尔可夫链 Markov chains

states | hot1 | cold2 | warm3 |
------ | -------- | -------- | -------- |
hot1 | $a_{11}$ | $a_{12}$ | $a_{13}$ |
cold2 | $a_{21}$ | $a_{22}$ | $a_{23}$ |
warm3 | $a_{31}$ | $a_{32}$ | $a_{33}$ |

start0: {$a_{01},a_{02},a_{03}$}

end4: {$a_{14},a_{24},a_{34}$}

Q代表状态集合，A是状态转移矩阵， $a_{ij}$ 表示从状态 $q_i$ 到状态 $q_j$ 的概率,那么有 $\sum_{j=1}^na_{ij}=1,i=1,2...N$

$q_0$$q_F$是初始状态和终止状态。

1. The Limited horiqon assumption 齐次假设

$Markov Assumption: P(q_i|q_1...q_{i-1})=P(q_i|q_{i-1})$

1. Stationary process assumption 静态过程假设

the conditional distribution

over next state given current state does not change over time. 对于状态之间的条件概率不会随时间的变化而改变，也就是状态转移矩阵只有一个～

$P(q_t|q_{t-1})=P(q_2|q_1);t \in 2...T$

### 隐马尔可夫模型

Imagine that you are a climatologist in the year 2799 studying the history of global warming. You cannot find any records of the weather in Baltimore, Maryland, for the summer of 2007, but you do find Jason Eisner’s diary, which lists how many ice creams Jason ate every day that summer. Our goal is to use these observations to estimate the temperature every day. We’ll simplify this weather task by assuming there are only two kinds of days: cold (C) and hot (H). So the Eisner task is as follows:

Given a sequence of observations O, each observation an integer corresponding to the number of ice creams eaten on a given day, figure

out the correct ‘hidden’ sequence Q of weather states (H or C) which caused Jason to eat the ice cream.

(1)一阶马尔科夫模型：

$Markov Assumption: P(q_i|q_1...q_{i-1})=P(q_i|q_{i-1})$

(2)条件独立，在状态 $q_i$ 的条件下，观测 $o_i$ 只取决于 $q_i$， 而与其他的状态和观测值够无关

$Output Indepence: P(o_i|q_1...q_i,...,q_T,o_1,...,o_i,...,o_T)=P(o_i|q_i)$

1. 问题1：计算似然概率。
• 前向/后向算法
1. 问题2：解码问题，又称预测问题，已知模型参数和观测序列，求最有可能的状态序列
• 维特比算法
1. 问题3：学习问题，已知观测序列，估计模型参数使得该模型下观测序列的概率最大。
• 极大似然估计，Baum-Welch, EM算法

### 概率计算：前向算法

#### 状态已知的话，是监督学习

$P(O|Q)=\prod_{i=1}^TP(o_i|q_i)$

#### 状态无法观测，无监督学习

$P(O,Q)=P(O|Q)\times P(Q)=\prod_{i=1}^Tp(o_i|q_i)\times \prod_{i=1}^TP(q_i|q_{i-1})\tag{9.10}$

$P(O)=\sum_QP(O,Q)=\sum_QP(O|Q)P(Q)\tag{9.12}$

#### forward algorithm

The forward algorithm is a kind of dynamic programming algorithm, that is, an algorithm that uses a table to store intermediate values as it builds up the probability of the observation sequence. The forward algorithm computes the observation probability by summing over the probabilities of all possible hidden state paths that could generate the observation sequence, but it does so efficiently by implicitly folding each of these paths into a single forward trellis.

$\alpha_t(j)=P(o_1,o_2,...,o_t,q_t=j|\lambda)\tag{9.13}$

$\alpha_t(j)=[\sum_{i=1}^N\alpha_{t-1}a_{ij}]b_j(o_t)$

$\alpha_2(1)=\alpha_1(1)×P(H|H)×P(1|H) + α_1(2)×P(H|C)×P(1|H)$

1. initialization step:

$\alpha_1(j)=a_{0j}b_j(o_1),1\le j \le N$

1. Recursion (since states 0 anf F are non-emittinf):

$\alpha_t(j)=[\sum_{i=1}^N\alpha_{t-1}(i)a_{ij}]b_j(o_t)$

$forward[s,t] = sum_{s'=1}^N forward[s', t-1] * a_{s',s}$

1. Termination:

$P(O|\lambda)=\alpha_T(q_F)=\sum_{i=1}^N\alpha_T(i)a_{iF}$

forward[$q_F$,T] = $\sum_{s=1}^N$ forward[s,T] * $a_{s,q_F}$

return forward[$q_F$,T] T时刻状态为 $q_F$的概率。感觉应该是T+1时刻吧。。。？？？

### 预测问题：维特比算法

Decoding: The Viterbi Algorithm

$v_t(j)=P(q_0q_1...q_{t-1},o_1,o_2,...,o_t,q_t=j|\lambda)$

$v_t(j)=max_{i=1}^N v_{t-1}(i)a_{ij}b_j(o_t)$

$v_2(2)=max(v_1(1)* a_{12}* b_2(1),v_1(2)* a_{22}* b_2(1))$

$v_2(2)=max(v_1(1)* P(H|C)* P(1|H), v_1(2)* P(H|H)* P(1|H))$

The Viterbi backtrace.

1. initialization:

$v_1(j)=a_{0j}b_j(o_1)\tag{9.20}$

$b_{t1}(j)=0\tag{9.21}$

1. Recursion(recall that states 0 and $q_F$ are non-emitting):

$v_t(j)=max_{i=1}^Nv_{t-1}(i)a_{ij}b_j(o_t);1\le j \le N, 1\le t\le T\tag{9.22}$

$b_{t_t}(j)=argmax_{i=1}^Nv_{t-1}(i)a_{ij}b_i(o_t);1\le j \le N, 1\le t\le T\tag{9.23}$

$b_{t_t}(j)表示t时刻状态为j的所有N个路径 $(q_1,q_2,...,q_{t-1})$$ 概率最大的路径的第k-1个节点。

1. Termination:

The best score:

$P*=v_T(q_F)=max_{i=1}^Nv_T(i)* a_{iF}\tag{9.24}$

The start of backtrace:

$q_T*=b_{t_T}(q_F)=argmax_{i=1}^Nv_T(i)* a_{iF}\tag{9.25}$

### 学习问题：HMM Training: The Forward-Backward Algorithm

Learning: Given an observation sequence O and the set of possible states in the HMM, learn the HMM parameters A and B.

#### 先考虑马尔可夫链

$a_{ij}=\dfrac{C(i\rightarrow j)}{\sum_{q\in Q}C(i\rightarrow q)}\tag{9.26}$

#### 隐马尔可夫模型： Baum-Welch算法

The Baum-Welch algorithm uses two neat intuitions to solve this problem. The first idea is to iteratively estimate the counts. We will start with an estimate for the transition and observation probabilities and then use these estimated probabilities to derive better and better probabilities. The second idea is that we get our estimated probabilities by computing the forward probability for an observation and then dividing that probability mass among all the different paths that contributed to this forward probability.

##### 后向算法

$\beta_t(i)=P(o_{t+1},o_{t+2},...,0_T|q_t=i,\lambda)\tag{9.27}$

1. initialization:

$\beta_T(i)=a_{iF},1\le i \le N$

1. Recursion(again since stetes 0 and $q_F$ are non-emitting)

$\beta_t(i)=\sum_{j=1}^N\beta_{t+1}(j)a_{ij}b_j(o_{t+1}),1\le j \le N, 1\le t\le T$

1. Termination:

$P(O|\lambda)=\alpha_T(q_F)=\beta_1(q_0)=\sum_{j=1}^N\beta_1(j)a_{0j}b_j(o_1)$

#### 在初始模型参数下，用大数定律估计新的模型参数，也就是极大似然估计

$\hat a_{ij}=\dfrac{expected\ number\ of\ transitions\ from\ state\ i \ to\ state\ j}{expected\ number\ of\ transitions\ from\ state\ i}$

$\zeta_t(i,j)=P(q_t=i,q_{t+1}=j|O,\lambda)\tag{9.32}$

$not-quite-\zeta_t(i,j)=P(q_t=i,q_{t+1}=j,O|\lambda)\tag{9.33}$

$\alpha_t(i)$$\beta_t(j)$ 是前向/后向算法中的定义。我们先看下前向算法计算的条件：

$not-quite-\zeta_t=\alpha_t(i)a_{ij}b_j(o_{t+1})\beta_{t+1}(j)\tag{9.34}$

$P(X|Y,Z)=\dfrac{P(X,Y,Z)}{P(Y,Z)}=\dfrac{P(X,Y,Z)}{P(Z)P(Y|Z)}=\dfrac{P(X,Y|Z)}{P(Y|Z)}\tag{9.35}$

$P(q_t=i,q_{t+1}=j|O,\lambda)=\dfrac{P(q_t=i,q_{t+1}=j,O|\lambda)}{P(O|\lambda)}=\dfrac{not-quite-\zeta_t}{P(O|\lambda)}\tag{9.36}$

$P(O|\lambda)=\alpha_T(q_F)=\beta_1(q_0)=\sum_{j=1}^N\alpha_t(j)\beta_t(j)\tag{9.37}$

$\zeta_t{i,j}=\dfrac{\alpha_t(i)a_{ij}b_j(o_{t+1})\beta_{t+1}(j)}{\alpha_T(q_F)}\tag{9.37}$

$\zeta_t$ 表示的是某一个时刻t，那么对于参数 $a_{ij}$ 的估计就是所有的时刻中i到j的总数除以i到k(k=1,2...N)的总数

$\hat a_{ij}=\dfrac{\sum_{t=1}^{T-1}\zeta_t(i,j)}{\sum_{t=1}^{T-1}\sum_{k=1}^N\zeta_t(i,k)}\tag{9.38}$

$\gamma_t(j)=P(q_t=j|O,\lambda)\tag{9.40}$

$\gamma_t(j)=\dfrac{P(q_t=j,O|\lambda)}{P(O|\lambda)}\tag{9.41}$

$\gamma_t(j)=\dfrac{\alpha_t(j)\beta_t(j)}{P(O|\lambda)}$

$\hat b_j(v_k)=\dfrac{\sum_{t=1,o_t=v_k}^T}{\sum_{t=1}^T}\gamma_t(j)$

#### EM算法

E step:

• 根据初始模型参数或是M step得到的模型参数，得到后验概率。

M step:

• 根据E step中得到的概率，估计出新的模型参数。这里直接用大数定律得到的，其实其本质原理就是极大似然估计，也就是求出使得概率最大的模型参数。

• 这里应该就是 $P(O|\lambda)$ 吧，在前向算法中有计算到在模型参数和观察序列条件下的极大似然估计。

### 总结：

1. 第一个问题，计算概率
• 已知模型参数 $\lambda$ 和观测序列 $O$，求在该模型下，出现观测序列的概率。

• 使用前向算法，一个动态回归的算法，把求长度为T的概率转换为t到t+1的概率sum

• 这一问题其实主要是为后面两个问题铺垫的，因为一般的场景都是状态未知，更不可能知道模型参数了。

1. 预测问题，又称解码问题
• 已知模型参数 $\lambda$ 和观测序列 $O$, 求概率最大的状态序列。

• 使用Viterbi算法，类似于前向算法，不过每一步不是sum，而是max，并且需要回溯backpointers

• 这个问题的应用场景就比较广了。

1. 学习问题：模型参数估计
• 已知观测序列 $O$，估计模型参数 $\lambda$, 使得观测序列的概率 $P(O|\lambda)$ 最大。

• 使用Baum-Welch(极大似然估计)或forward-backward(大数定律)算法，并使用EM算法迭代，对参数进行估计，