pytorch-损失函数

pytorch loss function.

Cross Entropy

简单来说,交叉熵是用来衡量在给定的真实分布 \(p_k\) 下,使用非真实分布 \(q_k\) 所指定的策略 f(x) 消除系统的不确定性所需要付出的努力的大小。交叉熵的越低说明这个策略越好,我们总是 minimize 交叉熵,因为交叉熵越小,就证明算法所产生的策略越接近最优策略,也就间接证明我们的算法所计算出的非真实分布越接近真实分布。交叉熵损失函数从信息论的角度来说,其实来自于 KL 散度,只不过最后推导的新式等价于交叉熵的计算公式:

从信息论的视角来理解: 信息量/信息熵(熵)/交叉熵/条件熵

信息量: 一个事件的信息量就是这个时间发生的概率的负对数,概率越大,所带来的信息就越少嘛。至于为什么是负对数,就要问香农了。。起码要满足\(P(X)=1\)时信息量为0,且始终大于0 \[-\log P(X)\]

信息熵, 也就是熵,是随机变量不确定性的度量,依赖于事件X的概率分布。即信息熵是信息量的期望。即求离散分布列的期望~~ \[H(p) = -\sum_{i=1}^np_i\log p_i\]

交叉熵: 回归到分类问题来,我们通过score function得到一个结果(10,1),通过softmax函数压缩成0到1的概率分布,我们称为 \(q_i=\dfrac{e^{f_{y_i}}}{\sum_je^{f_j}}\) 吧, \[H(p,q) = -\sum_{i=1}^np_i\log q_i\] 这就是我们所说的交叉熵,通过 Gibbs' inequality 知道:\(H(p,q)>=H(p)\) 恒成立,当且仅当 \(q_i\) 分布和 \(p_i\) 相同时,两者相等。

相对熵: 跟交叉熵是同样的概念,\(D(p||q)=H(p,q)-H(p)=-\sum_{i=1}^np(i)\log {\dfrac{q(i)}{p(i)}}\),又称为KL散度,表征两个函数或概率分布的差异性,差异越大则相对熵越大.

最大似然估计、Negative Log Liklihood(NLL)、KL散度与Cross Entropy其实是等价的,都可以进行互相推导,当然MSE也可以用Cross Entropy进行推导出(详见Deep Learning Book P132)。

BCELoss

Creates a criterion that measures the Binary Cross Entropy between the target and the output
用于二分类的损失函数,也就是 logistic 回归的损失函数。

对于二分类,我们只需要预测出正分类的概率 p,对应的 (1-p) 则是负分类的概率。其中 p 可使用 sigmoid 函数得到。

\[sigmoid(x) = \dfrac{1}{1+e^{(-x)}}\]

对应的损失函数可通过极大似然估计推导得到:

假设有 n 个独立的训练样本 \(\{(x_1,y_1), ...,(x_n, y_n)\}\)
y 是真实标签,\(y\in \{0,1\}\), 那么对于每一个样本的概率为: \[P(x_i, y_i)=P(y_i=1|x_i)^{y_i}P(y_i=0|x_i)^{1-y_i}\] \[=P(y_i=1|x_i)^{y_i}(1-P(y_i=1|x_i))^{1-y_i}\]

取负对数即可得: \[-y_iP(y_i=1|x_i)-(1-y_i)(1-P(y_i=1|x_i))\]

不难看出,这与常见的 softmax 多分类的 loss 计算是一致的。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
class BCELoss(_WeightedLoss):
def __init__(self, weight=None, size_average=None, reduce=None, reduction='elementwise_mean'):
"""
- weight: 手动调整权重,不太明白有啥用,用到在看吧
- size_average, reduce 弃用,直接看 reduction 即可
- reduction: "elementwise_mean"|"sum"|"none",看名字就知道啥意思了
"""
super(BCELoss, self).__init__(weight, size_average, reduce, reduction)

def forward(self, input, target):
return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction)
"""
- input: 预测概率,任意 shape, 但是值必须在 0-1 之间
- target: 真实概率, shape 与 input 相同
"""

\[loss(p,t)=−\dfrac{1}{N}\sum_{i=1}^{N}=\dfrac{1}{N}[t_i∗log(p_i)+(1−t_i)∗log(1−p_i)]\]

example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import torch
import torch.nn as nn

loss = nn.BCELoss(reduction="elementwise_mean")
input = torch.randn(5)
target = torch.ones(5)


loss = loss(torch.sigmoid(input), target)

my_loss = torch.mean(-target * torch.log(torch.sigmoid(input)) - (1-target) * torch.log((1-torch.sigmoid(input))))

# test weight parameter
loss1 = F.binary_cross_entropy(torch.sigmoid(input), target, reduction="none", weight=torch.Tensor([0,0,0,0,1]))
loss2 = F.binary_cross_entropy(torch.sigmoid(input), target, weight=torch.Tensor([0,0,0,0,1]))
print(my_loss, loss)

print(loss1, loss2*5)

# tensor(0.7590) tensor(0.7590)
# tensor([0.0000, 0.0000, 0.0000, 0.0000, 0.3104]) tensor(0.3104)

通常使用 sigmoid 函数时,我们预测得到正分类的概率,然后需要人为设置 threshold 来判断概率达到 threshold 才是正分类,有点类似于 hingle loss 哦。

torch.nn.CrossEntropyLoss

This criterion combines nn.LogSoftmax() and nn.NLLLoss() in one single class.
多分类交叉熵损失函数,可以看作是 binary_cross_entropy 的拓展。计算过程可以分为两步,log_softmax() 和 nn.NLLloss()

It is useful when training a classification problem with C classes. If provided, the optional argument weight should be a 1D Tensor assigning weight to each of the classes. This is particularly useful when you have an unbalanced training set.
在不均衡数据集中,参数 weight 会很有用。

1
2
3
4
5
6
7
8
9
10
11
class CrossEntropyLoss(_WeightedLoss):
def __init__():
"""
- weights: 给每一个类别一个权重。
- reduction: "elementwise_mean"|"sum"|"none".
"""
def forward():
"""
- input: [batch, C] or [batch, C, d_1, d_2, ..., d_k]
- target: [batch], 0 <= targte[i] <= C-1, or [batch, d_1, d_2, ..., d_k], K >= 2.
"""

example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
input = torch.randn(2, 3)
target = torch.Tensor([0, 2]).long()

# use loss function
loss_fn = nn.CrossEntropyLoss()
loss = loss_fn(input, target)

# compute loss step by step
score = torch.log_softmax(input, dim=1)
score1 = torch.log(F.softmax(input, dim=1))
print(score)
print(score1)

# use nll loss
nll_loss_fn = nn.NLLLoss()
nll_loss = nll_loss_fn(score, target)

# computer nll loss step by step
my_nll = torch.mean(-score[0][0] - score[1][2])
print(nll_loss, loss, my_nll)
1
2
3
4
5
tensor([[-0.8413, -0.7365, -2.4073],
[-0.4626, -2.0660, -1.4120]])
tensor([[-0.8413, -0.7365, -2.4073],
[-0.4626, -2.0660, -1.4120]])
tensor(1.1266) tensor(1.1266) tensor(1.1266)

torch.nn.NLLloss

The negative log likelihood loss. It is useful to train a classification problem with C class.

input 是已经通过 log_softmax 层的输入。loss 是对应样本中真实标签对应的值的负数。

1
2
3
4
5
class NLLLoss(_WeightedLoss):
def __init__():
"""
参数设置跟 CrossEntropyLoss 基本一致。
"""

NLLloss
\[\ell(x, y) = L = \{l_1,\dots,l_N\}^\top, \quad l_n = - w_{y_n} x_{n,y_n}, \quad w_{c} = \text{weight}[c] \cdot \mathbb{1}\{c \not= \text{ignore_index}\}\]

example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
loss = nn.NLLLoss()
# input is of size N x C = 3 x 5
input = torch.randn(3, 5, requires_grad=True)
# each element in target has to have 0 <= value < C
target = torch.tensor([1, 0, 4])
output = loss(torch.log_softmax(input, dim=1), target)

score = torch.log_softmax(input, dim=1)
output2 = (-score[0, 1]-score[1, 0]-score[2, 4])/3
output.backward()
# output2.backward()

print(output, output2)

# tensor(1.5658, grad_fn=<NllLossBackward>) tensor(1.5658, grad_fn=<DivBackward0>)

MultiMarginLoss

\(loss = \dfrac{1}{N}\sum_{j\ne y_i}^{N}max(0,s_j - s_{y_i}+\Delta)\)

\(s_{yi}\) 表示其真实标签对应的值,那么其他非真实分类的结果凡是大于 \(s_{yi}−\Delta\) 这个值的,都对最后的结果 \(loss\) 产生影响,比这个值小的就没事~

显然想对于 softmax 损失函数来说,softmax 考虑到了所有的错分类,而 marginloss 只考虑概率较大的错分类。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
class MultiMarginLoss(_WeightedLoss):
def __init__(self, p=1, margin=1, weight=None, size_average=None, reduce=None, reduction='elementwise_mean'):
"""
- p (int, optional): Has a default value of `1`. `1` and `2` are the only supported values
- margin (float, optional): Has a default value of `1`.
"""
super(MultiMarginLoss, self).__init__(weight, size_average, reduce, reduction)
if p != 1 and p != 2:
raise ValueError("only p == 1 and p == 2 supported")
assert weight is None or weight.dim() == 1
self.p = p
self.margin = margin

def forward(self, input, target):
return F.multi_margin_loss(input, target, p=self.p, margin=self.margin,
weight=self.weight, reduction=self.reduction)

example:

1
2
3
4
5
6
7
8
loss = nn.MultiMarginLoss()
input = torch.FloatTensor([[0, 3, 1], [0, 4, 2], [1, 5, 2], [3, 5, 1]])
target = torch.ones(4).long()

out = loss(input, target)

print(out) # 显然应该是 0,因为负分类与真实标签的 socre 差值都大于等于 1.
# tensor(0.)

nn.L1loss

\[L1(\hat{y}, y)=\dfrac{1}{m}\sum|\hat{y}_i−y_i|\]

nn.MSEloss

\[L2(\hat{y}, y)=\dfrac{1}{m}\sum|\hat{y}_i−y_i|^2\]

1
2
3
4
5
6
7
8
9
10
11
loss = nn.L1Loss()
loss2 = nn.MSELoss()

input = torch.FloatTensor([1,2,3])
target = torch.FloatTensor([1,2,9])

output = loss(input, target)
output2 = loss2(input, target)

print(output, output2)
# tensor(2.) tensor(12.)