Deep Learning (H2O)
H2O’s Deep Learning is based on a multi-layer feed-forward artificial neural network that is trained with stochastic gradient descent using back-propagation. The network can contain a large number of hidden layers consisting of neurons with selected activation functions. Each compute node trains a copy of the global model parameters on its local data with multi-threading (asynchronously), and contributes periodically to the global model via model averaging across the network.
Hidden layer sizes (e.g. 20, 10 will train a network with two hidden layers, with 20 and 10 neurons respectively).
Number of iterations on the dataset to train the model.
L1 regularization (can add stability through sparsity, and improve generalization; i.e. causes small weights to become 0)
L2 regularization (can add stability and improve generalization, avoids heavy weights)
Activation function for the neurons.
Maximum gradient norm (can improve training stability, especially when using a rectifier-like activation function).
Use dropout (causes sparse random activation of neurons, improving model generation by forcing all neurons to learn a more global view of the data).
Input layer dropout ratio (probability of neuron not activating). Values lower than 0.2 recommended.
Hidden layer dropout ratios (probability of neuron not activating). Values lower than 0.5 recommended. Input one value for each hidden layer, separated by commas. Example: with two hidden layers, 0.1, 0.5.
Adaptive learning rate decay factor. The "memory" the model keeps of data to compute its adaptative learning rate decay at this rate. Values between 0.9 and 0.999 recommended.
Adaptive learning rate smoothing factor (to avoid divisions by zero and allow progress). Values between 1e-4 and 1e-10 recommended.
Learning rate (higher => less stable, lower => slower convergence)
Learning rate annealing: rate / (1 + rate_annealing * samples)
The learning rate decay parameter controls the change of learning rate across layers. For example, assume the rate parameter is set to 0.01, and the rate decay parameter is set to 0.5. The learning rate for the weights connecting the input and first hidden layer is 0.01, the learning rate for the weights connecting the first and the second hidden layer is 0.005, etc.
The amount of momentum at the beginning of training.
The amount of learning for which momentum increases (assuming momentum_stable is larger than momentum_start). The ramp is measured in the number of training samples.
The momentum_stable parameter controls the final momentum value reached after momentum_ramp training samples. The momentum used for training remains the same for training beyond reaching that point.