Hessian Free Optimization (Truncated Newton)

Hessian Free Optimizer based on Tensorflow

Hessian-Free optimization (HF) or Truncated Newton are a family of optimization algorithms designed for optimizing non-linear functions with large numbers of independent variables. However, unlike Newton’s method, which optimizes its quadratic approximations using matrix-inverses, HF performs a sub-optimization using the linear conjugate gradient algorithm (CG), which doesn’t require ever even forming the curvature matrix explicitly, let alone inverting it. It thus belongs to the broad class of approximate Newton methods that are practical for high-dimensional optimization problems (like neural network training).

Implemented Hessian-free optimization features from Martens (2010) and Martens and Sutskever (2011):

Main computation features

Computing Hessian and vector product:

# self.W: weights
# grads: Gradients of the network
# self.damp_pl: Tikhonov damping coefficient
# vec: Vector that is multiplied by the Jacobian.
grad_v = [tf.reduce_sum(g * v) for g, v in zip(grads, vec)]
Hv = tf.gradients(grad_v, self.W, stop_gradients=vec)
Hv = [hv + self.damp_pl * v for hv, v in zip(Hv, vec)]

Computing Jacobian and vector product or R-operator:

# f: Objective function
# x: Parameters with respect to which computes Jacobian matrix.
# vec: Vector that is multiplied by the Jacobian.
r = [tf.reduce_sum([tf.reduce_sum(v * tf.gradients(f, x)[i])
  for i, v in enumerate(vec)])
  for f in tf.unstack(f)]

Computing Gauss-Newton matrix (J’HJ) and vector product:

# self.output: activation of output layer
# self.W: weights
# self.damp_pl: Tikhonov damping coefficient
Jv = self.__Rop(self.output, self.W, vec)
Jv = tf.reshape(tf.stack(Jv), [-1, 1])
HJv = tf.gradients(tf.matmul(tf.transpose(tf.gradients(self.loss,
  self.output)[0]), Jv), self.output, stop_gradients=Jv)[0]
JHJv = tf.gradients(tf.matmul(tf.transpose(HJv), self.output), self.W,
  stop_gradients=HJv)
JHJv = [gv + self.damp_pl * v for gv, v in zip(JHJv, vec)]

Tested on: