Tensorflow2 深度學習十必知

博主根據自身多年的深度學習演演算法研發經驗，整理分享以下十條必知。

含參考資料連結，部分附上相關程式碼實現。

獨樂樂不如眾樂樂，希望對各位看客有所幫助。

待回頭有時間再展開細節說一說深度學習裡的那些道道。

有什麼技術需求需要有償解決的也可以郵件或者QQ聯絡博主。

郵箱QQ同ID：[email protected]

當然除了這十條，肯定還有其他「必知」，

歡迎評論分享更多，這裡只是暫時擬定的十條，別較真哈。

主要學習其中的思路，切記，以下思路在個別場景並不適用。

1.資料迴流

[1907.05550] Faster Neural Network Training with Data Echoing

def data_echoing(factor): 
    return lambda image, label: tf.data.Dataset.from_tensors((image, label)).repeat(factor)

作用:

資料集載入後，在資料增廣前後重複當前批次進模型的次數，減少資料的載入耗時。

等價於讓模型看n次當前的資料，或者看n個增廣後的資料樣本。

2.AMP 自動精度混合

在bert4keras中使用混合精度和XLA加速訓練 - 科學空間|Scientific Spaces

    tf.config.optimizer.set_experimental_options({"auto_mixed_precision": True})

作用:

降低視訊記憶體佔用，加速訓練，將部分網路計算轉為等價的低精度計算，以此降低計算量。

3.優化器節省視訊記憶體

3.1 [1804.04235]Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

mesh/optimize.py at master · tensorflow/mesh · GitHub

3.2 [1901.11150] Memory-Efficient Adaptive Optimization

google-research/sm3 at master · google-research/google-research (github.com)

作用:

節省視訊記憶體，加速訓練，

主要是對二階動量進行特例化解構，減少視訊記憶體儲存。

4.權重標準化(歸一化)

[2102.06171] High-Performance Large-Scale Image Recognition Without Normalization

deepmind-research/nfnets at master · deepmind/deepmind-research · GitHub

class WSConv2D(tf.keras.layers.Conv2D):
    def __init__(self, *args, **kwargs):
        super(WSConv2D, self).__init__(
            kernel_initializer=tf.keras.initializers.VarianceScaling(
                scale=1.0, mode='fan_in', distribution='untruncated_normal',
            ),
            use_bias=False,
            kernel_regularizer=tf.keras.regularizers.l2(1e-4), *args, **kwargs
        )
        self.gain = self.add_weight(
            name='gain',
            shape=(self.filters,),
            initializer="ones",
            trainable=True,
            dtype=self.dtype
        )

    def standardize_weight(self, eps):
        mean, var = tf.nn.moments(self.kernel, axes=[0, 1, 2], keepdims=True)
        fan_in = np.prod(self.kernel.shape[:-1])
        # Manually fused normalization, eq. to (w - mean) * gain / sqrt(N * var)
        scale = tf.math.rsqrt(
            tf.math.maximum(
                var * fan_in,
                tf.convert_to_tensor(eps, dtype=self.dtype)
            )
        ) * self.gain
        shift = mean * scale
        return self.kernel * scale - shift

    def call(self, inputs):
        eps = 1e-4
        weight = self.standardize_weight(eps)
        return tf.nn.conv2d(
            inputs, weight, strides=self.strides,
            padding=self.padding.upper(), dilations=self.dilation_rate
        ) if self.bias is None else tf.nn.bias_add(
            tf.nn.conv2d(
                inputs, weight, strides=self.strides,
                padding=self.padding.upper(), dilations=self.dilation_rate
            ), self.bias)

作用:

通過對kernel進行標準化或歸一化，相當於對kernel做一個先驗約束，以此加速模型訓練收斂。

5.自適應梯度裁剪

deepmind-research/agc_optax.py at master · deepmind/deepmind-research · GitHub

def unitwise_norm(x):
    if len(tf.squeeze(x).shape) <= 1:  # Scalars and vectors
        axis = None
        keepdims = False
    elif len(x.shape) in [2, 3]:  # Linear layers of shape IO
        axis = 0
        keepdims = True
    elif len(x.shape) == 4:  # Conv kernels of shape HWIO
        axis = [0, 1, 2, ]
        keepdims = True
    else:
        raise ValueError(f'Got a parameter with shape not in [1, 2, 3, 4]! {x}')
    square_sum = tf.reduce_sum(tf.square(x), axis, keepdims=keepdims)
    return tf.sqrt(square_sum)


def gradient_clipping(grad, var):
    clipping = 0.01
    max_norm = tf.maximum(unitwise_norm(var), 1e-3) * clipping
    grad_norm = unitwise_norm(grad)
    trigger = (grad_norm > max_norm)
    clipped_grad = (max_norm / tf.maximum(grad_norm, 1e-6))
    return grad * tf.where(trigger, clipped_grad, tf.ones_like(clipped_grad))

作用:

防止梯度爆炸，穩定訓練。通過梯度和引數的關係，對梯度進行裁剪，約束學習率。

6.recompute_grad

[1604.06174] Training Deep Nets with Sublinear Memory Cost

google-research/recompute_grad.py at master · google-research/google-research (github.com)

bojone/keras_recompute: saving memory by recomputing for keras (github.com)

作用:

通過梯度重計算，節省視訊記憶體。

7.歸一化

[2003.05569] Extended Batch Normalization (arxiv.org)

from keras.layers.normalization.batch_normalization import BatchNormalizationBase

class ExtendedBatchNormalization(BatchNormalizationBase):
    def __init__(self,
                 axis=-1,
                 momentum=0.99,
                 epsilon=1e-3,
                 center=True,
                 scale=True,
                 beta_initializer='zeros',
                 gamma_initializer='ones',
                 moving_mean_initializer='zeros',
                 moving_variance_initializer='ones',
                 beta_regularizer=None,
                 gamma_regularizer=None,
                 beta_constraint=None,
                 gamma_constraint=None,
                 renorm=False,
                 renorm_clipping=None,
                 renorm_momentum=0.99,
                 trainable=True,
                 name=None,
                 **kwargs):
        # Currently we only support aggregating over the global batch size.
        super(ExtendedBatchNormalization, self).__init__(
            axis=axis,
            momentum=momentum,
            epsilon=epsilon,
            center=center,
            scale=scale,
            beta_initializer=beta_initializer,
            gamma_initializer=gamma_initializer,
            moving_mean_initializer=moving_mean_initializer,
            moving_variance_initializer=moving_variance_initializer,
            beta_regularizer=beta_regularizer,
            gamma_regularizer=gamma_regularizer,
            beta_constraint=beta_constraint,
            gamma_constraint=gamma_constraint,
            renorm=renorm,
            renorm_clipping=renorm_clipping,
            renorm_momentum=renorm_momentum,
            fused=False,
            trainable=trainable,
            virtual_batch_size=None,
            name=name,
            **kwargs)

    def _calculate_mean_and_var(self, x, axes, keep_dims):
        with tf.keras.backend.name_scope('moments'):
            y = tf.cast(x, tf.float32) if x.dtype == tf.float16 else x
            replica_ctx = tf.distribute.get_replica_context()
            if replica_ctx:
                local_sum = tf.math.reduce_sum(y, axis=axes, keepdims=True)
                local_squared_sum = tf.math.reduce_sum(tf.math.square(y), axis=axes,
                                                       keepdims=True)
                batch_size = tf.cast(tf.shape(y)[0], tf.float32)
                y_sum = replica_ctx.all_reduce(tf.distribute.ReduceOp.SUM, local_sum)
                y_squared_sum = replica_ctx.all_reduce(tf.distribute.ReduceOp.SUM,
                                                       local_squared_sum)
                global_batch_size = replica_ctx.all_reduce(tf.distribute.ReduceOp.SUM,
                                                           batch_size)
                axes_vals = [(tf.shape(y))[i] for i in range(1, len(axes))]
                multiplier = tf.cast(tf.reduce_prod(axes_vals), tf.float32)
                multiplier = multiplier * global_batch_size
                mean = y_sum / multiplier
                y_squared_mean = y_squared_sum / multiplier
                # var = E(x^2) - E(x)^2
                variance = y_squared_mean - tf.math.square(mean)
            else:
                # Compute true mean while keeping the dims for proper broadcasting.
                mean = tf.math.reduce_mean(y, axes, keepdims=True, name='mean')
                variance = tf.math.reduce_mean(
                    tf.math.squared_difference(y, tf.stop_gradient(mean)),
                    axes,
                    keepdims=True,
                    name='variance')
            if not keep_dims:
                mean = tf.squeeze(mean, axes)
                variance = tf.squeeze(variance, axes)
            variance = tf.math.reduce_mean(variance)
            if x.dtype == tf.float16:
                return (tf.cast(mean, tf.float16),
                        tf.cast(variance, tf.float16))
            else:
                return mean, variance

作用:

一個簡易改進版的Batch Normalization，思路簡單有效。

8.學習率策略

[1506.01186] Cyclical Learning Rates for Training Neural Networks (arxiv.org)

作用:

一個推薦的學習率策略方案，特定情況下可以取得更好的泛化。

9.重引數化

[1908.03930] ACNet: Strengthening the Kernel Skeletons for Powerful CNN via Asymmetric Convolution Blocks

https://zhuanlan.zhihu.com/p/361090497

作用：

通過同時訓練多份引數，合併權重的思路來提升模型泛化性。

10.長尾學習

[2110.04596] Deep Long-Tailed Learning: A Survey (arxiv.org)

Jorwnpay/A-Long-Tailed-Survey: 本專案是 Deep Long-Tailed Learning: A Survey 文章的中譯版 (github.com)

作用:

解決長尾問題，可以加速收斂，提升模型泛化，穩定訓練。