Fine-Tuning Your Multi-Layer Perceptron: Testing the Impact of Model Parameters
Simple automated testing and analysis of the effect of different tuning parameters on a baseline Multi-Layer Perceptron.
✨ Introduction
This is an experiment conducted to test the effect of different tuning parameters on a baseline Multi-Layer Perceptron. The code developed for model analysis and testing is a collection of automated scripts to test different configurations of a model with minimal changes to the code.
You can get the code from this repository.
The code allows testing for the following model parameters:
- Activation Functions
- Batch Size
- Epochs
- Kernel Initializers
- Learning Rate
- Loss Functions
- Network Architecture
- Optimizers
The library used for implementing the MLP is Keras, so the functions and options are limited to the library.
✨ File Structure
The main directory contains the Python notebook for data processing, the dataset used, and sub-directories for all the parameters mentioned above. Each parameter sub-directory has a Python notebook with the code to test that parameter and another sub-directory called figures to store all the plots generated from the Python notebook.
✨ Data Processing
The dataset used here is already pre-processed, so there isn’t much needed to be done except split the dataset into training, validation and test sets. However, in the case of using a different dataset, the code snippet to process the dataset can be added to this file.
The dataset ‘online_shop_data’ has a total of 16626 samples. 20% of this is used as the test set with ~3326 samples. The remaining dataset is further split into train and validation sets with each consisting of ~10640 and ~2660 samples respectively.
✨ Activation Function
The effect of 9 different activation functions is tested for this experiment. The activation functions here only refer to the ones in the input and hidden layer and not the output layer. The activation functions used are:
relu, sigmoid, softmax, softplus, softsign, tanh, selu, elu, exponential
All the other parameters were kept constant to get a better idea of the effect of different activation functions:
activation function for the output layer: softmax, batch size: 32, epochs: 50, kernel initializer: he_uniform, learning rate: 0.03, loss function: categorical_crossentropy, optimizer: Adam, network architecture: 512–128–96–2
Following are the graphs demonstrating the effect of different activation functions on the MLP model:
Additionally, we can observe the mean values for loss and accuracy across the different subsets of data: training, validation and test.
Visualizing this for loss and accuracy separately, we get the following plots.
We can observe from the data frame and the graphs that the softmax activation function leads to the highest accuracy in the test set with 84.39% followed closely by the softsign activation function with an accuracy of 84.18% in the test set.
✨ Batch Size
6 different batch_size values as a part of mini-batch are tested in this experiment with the values being
16, 32, 64, 128, 256, and 512.
All the other parameters were kept constant to get a better idea of the effect of different batch sizes:
activation function: relu, activation function for the output layer: softmax, epochs: 50, kernel initializer: he_uniform, learning rate: 0.03, loss function: categorical_crossentropy, optimizer: Adam, network architecture: 512–128–96–2
Following are the graphs demonstrating the effect of different batch sizes on the MLP model:
Additionally, we can observe the mean values for loss and accuracy across the different subsets of data: training, validation and test.
Visualizing this for loss and accuracy separately, we get the following plots.
As observed from the data, a batch size of 128 samples shows the highest accuracy on test data with an accuracy of 85.26% with a batch size of 512 and 16 following close behind with an accuracy of 83.61% and 83.13% respectively on the test data.
✨ Epochs
Similar to batch size, 6 variations of epoch size are considered in the experiment.
10, 20, 50, 100, 200, 500
All the other parameters were kept constant to get a better idea of the effect of different epoch sizes:
activation function: relu, activation function for the output layer: softmax, batch size: 50, kernel initializer: he_uniform, learning rate: 0.03, loss function: categorical_crossentropy, optimizer: Adam, network architecture: 512–128–96–2
Following are the graphs demonstrating the effect of different epoch sizes on the MLP model:
Additionally, we can observe the mean values for loss and accuracy across the different subsets of data: training, validation and test.
Visualizing this for loss and accuracy separately, we get the following plots.
As we can observe from the data frame and the plots, using an epoch size of 10 has the highest accuracy on test data with an accuracy of 85.23% followed by an accuracy of 82.08% for an epoch size of 25 on the test data. All the other variations have an accuracy around the 50% mark.
✨ Kernel Initializers
There are 13 different kernel initializers available in the keras library that can be used to configure the model.
random_normal, random_uniform, truncated_normal, zeros, ones, glorot_normal, glorot_uniform, he_normal, he_uniform, identity, orthogonal, constant, variance_scaling
All the other parameters were kept constant to get a better idea of the effect of different kernel initializers:
activation function: relu, activation function for the output layer: softmax, batch size: 50, epochs:50, learning rate: 0.03, loss function: categorical_crossentropy, optimizer: Adam, network architecture: 512–128–96–2
Following are the graphs demonstrating the effect of different kernel initializers on the MLP model:
Additionally, we can observe the mean values for loss and accuracy across the different subsets of data: training, validation and test.
Visualizing this for loss and accuracy separately, we get the following plots.
We can observe from the data and plots above that we get the best accuracy of 85.35% using variance_scaling. A few other kernel_initializers like orthogonal, identity, he_uniform, he_normal, truncated_normal, random_uniform, and random_normal follow behind with accuracy in a similar range.
✨ Learning Rate
6 different learning rates ranging from small steps to large steps are utilized in this experiment.
0.001, 0.003, 0.01, 0.03, 0.1, 0.3
All the other parameters were kept constant to get a better idea of the effect of different learning rates:
activation function: relu, activation function for the output layer: softmax, batch size: 50, epochs:50, kernel_initializer: he_uniform, loss function: categorical_crossentropy, optimizer: Adam, network architecture: 512–128–96–2
Following are the graphs demonstrating the effect of different learning rates on the MLP model:
Additionally, we can observe the mean values for loss and accuracy across the different subsets of data: training, validation and test.
Visualizing this for loss and accuracy separately, we get the following plots.
It can be observed from the data that a learning rate of 0.003 demonstrates the highest accuracy of 86.68% in the experiment followed closely by learning rates of 0.001, 0.01, and 0.03 with an accuracy of 86.68%, 85.95%, and 81.62% respectively. however, the tradeoff for smaller learning rates is the time and computation required.
✨ Loss Functions
A total of 7 distinct loss functions are being tested in this experiment with 4 categorized as probabilistic losses and the other 3 being categorized as hinge losses.
probabilistic losses: binary_crossentropy, categorical_crossentropy, poisson, kl_divergence; hinge losses: hinge, squared_hinge, categorical_hinge
All the other parameters were kept constant to get a better idea of the effect of different loss functions:
activation function: relu, activation function for the output layer: softmax, batch size: 50, epochs:50, kernel_initializer: he_uniform, learning_rate: 0.03, optimizer: Adam, network architecture: 512–128–96–2
Following are the graphs demonstrating the effect of different loss functions on the MLP model:
Additionally, we can observe the mean values for loss and accuracy across the different subsets of data: training, validation and test.
Visualizing this for loss and accuracy separately, we get the following plots.
As apparent from the data and plots above, we can note that the categorical_crossentropy loss function demonstrates the highest accuracy of 83.73% followed only by the hinge loss function with an accuracy of 74.32%, while the rest of the loss functions stay around 50%.
✨ Network Architectures
There are countless variations of depth and width that can be implemented in an MLP, however, more complexity doesn’t always mean a better performance. 5 different network architectures are implemented in this experiment.
128–2, 256–2, 256–128–2, 128–256–128–2, 512–256–128–96–2
All the other parameters were kept constant to get a better idea of the effect of different loss functions:
activation function: relu, activation function for the output layer: softmax, batch size: 50, epochs:50, kernel_initializer: he_uniform, learning_rate: 0.03, loss_function: categorical_crossentropy, optimizer: Adam
Following are the graphs demonstrating the effect of different network architectures on the MLP model:
Additionally, we can observe the mean values for loss and accuracy across the different subsets of data: training, validation and test.
Visualizing this for loss and accuracy separately, we get the following plots.
We can make an observation that the simplest network architecture amongst all the architectures is the one that performs the best with an accuracy of 85.23% while the deepest and most complex network consistently has the lowest accuracy around 50%.
✨ Optimizers
10 different optimizers available in the keras library are tested in this experiment.
SGD, RMSprop, Adam, AdamW, Adadelta, Adagrad, Adamax, Adafactor, Nadam, Ftrl
All the other parameters were kept constant to get a better idea of the effect of different optimizers:
activation function: relu, activation function for the output layer: softmax, batch size: 50, epochs:50, kernel_initializer: he_uniform, learning_rate: 0.03, loss_function: categorical_crossentropy, network architecture: 512–128–96–2
Following are the graphs demonstrating the effect of different optimizers on the MLP model:
Additionally, we can observe the mean values for loss and accuracy across the different subsets of data: training, validation and test.
Visualizing this for loss and accuracy separately, we get the following plots.
From the above data and plots, we can discern that the Adamax optimizer performs the best with an accuracy of 87.01% closely followed by a few other optimizers such as Adagrad, Ftrl, Adadelta and Adam with an accuracy of 86.16%, 85.59%, 84.93% and 83.07% respectively on the test set.
✨ How to Test Different Variations?
Each Python notebook has a code cell that holds the variables for all the constant parameters and the variable parameters. These can then be changed to fit the use case and run consistent tests.
As you can see from the example snippet above, all the constant parameters are depicted by variable names in all caps while the variable parameter is a list. The items in the list can be changed to update the different variations of the variable parameter to test.
Each test iterates through the variable parameter list and builds a model using a particular instance of the variable parameter and all the constant parameters. This enables testing different variations with minimal changes to the code and significantly reduces the chances of human error.
✨ Footnote
Hey there, hope you liked the blog post. This was originally my homework to test different variations and their effect, but I decided to turn it in to make it automated — because of course, I’m lazy — and it turned out to make testing accessible with minimal changes to the code.
Consider following me on Medium, Twitter and other platforms to read more about Productivity, Design and Code.