针对想要入手深度学习的工程师,从局部入手,整理了深度学习中零散的术语。
[TOC]
# 一.神经网络基础
梯度下降
fine tuning
dropout
反向传播
feature map = = features and their locations
nms
# 三.训练
# 四.评估
# 五. 基本概念
## 1. 计算图
TensorFlow: Static Graphs[^1]
[^1]:Justin Johnson.Learning PyTorch with Examples.<https://pytorch.org/tutorials/beginner/pytorch_with_examples.html#pytorch-defining-new-autograd-functions>,2018.6.7
PyTorch autograd looks a lot like TensorFlow: in both frameworks we define a computational graph, and use automatic differentiation to compute gradients. The biggest difference between the two is that TensorFlow’s computational graphs are static and PyTorch uses dynamic computational graphs.
In TensorFlow, we define the computational graph once and then execute the same graph over and over again, possibly feeding different input data to the graph. In PyTorch, each forward pass defines a new computational graph.
Static graphs are nice because you can optimize the graph up front; for example a framework might decide to fuse some graph operations for efficiency, or to come up with a strategy for distributing the graph across many GPUs or many machines. If you are reusing the same graph over and over, then this potentially costly up-front optimization can be amortized as the same graph is rerun over and over.
One aspect where static and dynamic graphs differ is control flow. For some models we may wish to perform different computation for each data point; for example a recurrent network might be unrolled for different numbers of time steps for each data point; this unrolling can be implemented as a loop. With a static graph the loop construct needs to be a part of the graph; for this reason TensorFlow provides operators such as tf.scan for embedding loops into the graph. With dynamic graphs the situation is simpler: since we build graphs on-the-fly for each example, we can use normal imperative flow control to perform computation that differs for each input.
## 2.混合精度训练[^2]
[^2]:http://blog.sina.com.cn/s/blog_e6c12bed0102xal3.html
《Mixed Precision Training》
权值用32位保存,然后训练过程中截断为16位,而后weight和activation都用FP16来计算,最后更新还是使用FP32的权值。
两种方法来控制半精度的信息损失。
一、保存一个单精度浮点的权值备份。在训练过程中舍入到半精度。(加速训练?减少硬件开销?但是没有减少存储的参数量)
二、适当地缩放损失
结果:
- 精度没有损失,memory减少约一半,速度更快。
- 所有的过程用的是半精度FP16;没有超参需要针对调整;这种方法对比单精度没有精度损失;可以用于大部分模型(可以用于大数据集)。
**亮点**:两种方法来控制半精度的信息损失。
- 保存一个单精度浮点的权值备份。在训练过程中舍入到半精度。(加速训练、减少硬件开销、存储的参数量增加了50%,但是由于减少了过程中的activation,所以总体来说还是减少了memory的消耗);
同时FP16在硬件实现中更快。
假如单纯的使用FP16训练,精度降低了80%,所以要使用32位量化训练,但是参数更新过程使用16位。
为什么需要FP32:
- 有些梯度太小,在FP16就变成0,大约有5%的数据被忽略成0(感觉不是很合理,毕竟2^-24太小了影响不大)
- 有些数太大,即使在FP16下可以表示,当加法右移操作使它二进制点对齐时,仍然可能变为0
- 适当地缩放数值。
等于是把图三整体右移,即scaling up方法是乘以8即右移三位
