# 代码实现高斯混合模型

sklearn源码阅读，用em算法计算高斯混合模型GMM

github源码

### 类初始化

• n_components=1: The number of mixture components.表示混合类别的个数，也就是混合高斯分布的个数

• covariance_type=’full’: 协方差矩阵的类型。{‘full’, ‘tied’, ‘diag’, ‘spherical’} 分别对应完全协方差矩阵（元素都不为零），相同的完全协方差矩阵（HMM会用到），对角协方差矩阵（非对角为零，对角不为零），球面协方差矩阵（非对角为零，对角完全相同，球面特性），默认‘full’ 完全协方差矩阵

• tol=1e-3 收敛阈值，EM iterations will stop when the lower bound average gain is

below this threshold.也就是当下界的平均增益小于阈值时，em迭代就停止。这里的下界指的是公式

（3）中的下界凸函数。我们知道em算法分两步，e step是期望，也就是不等式相等，m setp是最大化，

• reg_covar=1e-6： Non-negative regularization added to the diagonal of

covariance.Allows to assure that the covariance matrices are all positive.

• max_iter=100: em算法的最大迭代次数

• n_init: int, defaults to 1.初始化的次数

• init_params: {‘kmeans’, ‘random’}, defaults to ‘kmeans’.

The method used to initialize the weights, the means and the precisionsself.

Must be one of::

-  'kmeans' : responsibilities are initialized using kmeans.

- 'random' : responsibilities are initialized randomly.

- 这里对应的初始化，是指的隐藏变量z的分类所占比例，也就是weight_init，kmeans表示“hard”guess， {0, 1} or {1, . . . , k})


random应该就是”soft”guess吧。

• weights_init : shape (n_components, ), optional The user-provided initial weights, defaults to None. If it None, weights are initialized using the init_params method.

• means_init : array-like, shape (n_components, n_features), optional. The user-provided initial means, defaults to None, If it None, means are initialized using the init_params method.混合高斯分布的均值初始化，注意shape=(n_components, n_features),有n_components这样的多维高斯分布，每个高斯分布有n_features维度
• precisions_init : The user-provided initial precisions (inverse of the covariance matrices), defaults to None. If it None, precisions are initialized using the ‘init_params’ method.The shape depends on ‘covariance_type’::

• (n_components,) if ‘spherical’,

• (n_features, n_features) if ‘tied’,

• (n_components, n_features) if ‘diag’,

• (n_components, n_features, n_features) if ‘full’

• 用来初始化高斯分布中的协方差矩阵，协方差矩阵代表的是n_features维向量中每一维特征与其他维度特征的关系，对于一个高斯分布来说是n_featuresn_features，n_components个混合也就是’full’。其中要学习的参数个数是(n_features+1) n_features/2.具体关于协方差矩阵参考前面那篇博客

• random_state : int, RandomState instance or None, optional (default=None) 随机数生成器
• warm_start : bool, default to False.If ‘warm_start’ is True, the solution of the last fitting is used as initialization for the next call of fit(). This can speed up convergence when fit is called several times on similar problems. 若为True，则fit（）调用会以上一次fit（）的结果作为初始化参数，适合相同问题多次fit的情况，能加速收敛，默认为False。
• verbose : int, default to 0. Enable verbose output. If 1 then it prints the current initialization and each iteration step. If greater than 1 then it prints also the log probability and the time needed for each step. 使能迭代信息显示，默认为0，可以为1或者大于1（显示的信息不同）
• verbose_interval: 与13挂钩，若使能迭代信息显示，设置多少次迭代后显示信息，默认10次。

### E step

$$w_j^{(i)}:=p(z^{(i)}=j|x^{(i)};\phi,\mu,\Sigma)=\dfrac{p(x^{(i)}|z^{(i)}=j;\mu,\Sigma)p(z^{(i)}=j;\phi)}{\sum_{l=1}^kp(x^{(i)}|z^{(i)}=l;\mu,\Sigma)p(z^{(i)}=l;\phi)}$$

• 因为分子分母中设计到正态分布，即指数形式，故先计算其log形式。然后带入到M step中取回指数形式即可。

• 对于协方差矩阵，如果n_features很大的话，计算其逆矩阵和行列式就很复杂，因此可以先计算其precision矩阵，然后进行cholesky分解，以便优化计算。

#### 先计算分子对数形式，两个对数相加：

$$logp(x^{(i)}|z^{(i)}=j;\mu,\Sigma)+logp(z^{(i)}=j;\phi)$$

1. 其中前者是高斯分布概率的对数,根据均值，协方差矩阵的cholesky分解可求得。

1. 后者是每一类高斯分布所占的权重，也就是$\phi_j$

### M setp

$$\Sigma_j:=\dfrac{\sum_{i=1}^mw_j^{(i)}(x^{(i)}-\mu_j)(x^{(i)}-\mu_j)^T}{\sum_{i=1}^mw_j^{(i)}}$$

#### 然后是正态分布的参数估计$u_j, \phi_j$

$$\phi_j:=\frac{1}{m}\sum_{i=1}^mw_j^{(i)}$$

$$\mu_j:=\dfrac{\sum_{i=1}^mw_j^{(i)}x^{(i)}}{\sum_{i=1}^mw_j^{(i)}}$$

### E step中p(x|z=j)

Precision matrix 协方差矩阵的逆矩阵：https://www.statlect.com/glossary/precision-matrix

### sklearn中实例

Although GMM are often used for clustering, we can compare the obtained clusters with the actual classes from the dataset. We initialize the means of the Gaussians with the means of the classes from the training set to make this comparison valid.

We plot predicted labels on both training and held out test data using a variety of GMM covariance types on the iris dataset. We compare GMMs with spherical, diagonal, full, and tied covariance matrices in increasing order of performance. Although one would expect full covariance to perform best in general, it is prone to overfitting on small datasets and does not generalize well to held out test data.

On the plots, train data is shown as dots, while test data is shown as crosses. The iris dataset is four-dimensional. Only the first two dimensions are shown here, and thus some points are separated in other dimensions.

Xie Pan

2018-03-29

2021-06-29