Fine-Tuning Your Multi-Layer Perceptron: Testing the Impact of Model Parameters

13 min readApr 24, 2023

--

Simple automated testing and analysis of the effect of different tuning parameters on a baseline Multi-Layer Perceptron.

✨ Introduction

This is an experiment conducted to test the effect of different tuning parameters on a baseline Multi-Layer Perceptron. The code developed for model analysis and testing is a collection of automated scripts to test different configurations of a model with minimal changes to the code.

You can get the code from this repository.

The code allows testing for the following model parameters:

Activation Functions
Batch Size
Epochs
Kernel Initializers
Learning Rate
Loss Functions
Network Architecture
Optimizers

The library used for implementing the MLP is Keras, so the functions and options are limited to the library.

✨ File Structure

The main directory contains the Python notebook for data processing, the dataset used, and sub-directories for all the parameters mentioned above. Each parameter sub-directory has a Python notebook with the code to test that parameter and another sub-directory called figures to store all the plots generated from the Python notebook.

✨ Data Processing

The dataset used here is already pre-processed, so there isn’t much needed to be done except split the dataset into training, validation and test sets. However, in the case of using a different dataset, the code snippet to process the dataset can be added to this file.

The dataset ‘online_shop_data’ has a total of 16626 samples. 20% of this is used as the test set with ~3326 samples. The remaining dataset is further split into train and validation sets with each consisting of ~10640 and ~2660 samples respectively.

✨ Activation Function

The effect of 9 different activation functions is tested for this experiment. The activation functions here only refer to the ones in the input and hidden layer and not the output layer. The activation functions used are:

relu, sigmoid, softmax, softplus, softsign, tanh, selu, elu, exponential

All the other parameters were kept constant to get a better idea of the effect of different activation functions:

activation function for the output layer: softmax, batch size: 32, epochs: 50, kernel initializer: he_uniform, learning rate: 0.03, loss function: categorical_crossentropy, optimizer: Adam, network architecture: 512–128–96–2

Following are the graphs demonstrating the effect of different activation functions on the MLP model:

loss and accuracy for activation functions across epochs

Additionally, we can observe the mean values for loss and accuracy across the different subsets of data: training, validation and test.

mean loss and accuracy for activation functions across training, validation and test sets

Visualizing this for loss and accuracy separately, we get the following plots.

Visualization for loss and accuracy metrics for all activation functions

We can observe from the data frame and the graphs that the softmax activation function leads to the highest accuracy in the test set with 84.39% followed closely by the softsign activation function with an accuracy of 84.18% in the test set.

✨ Batch Size

6 different batch_size values as a part of mini-batch are tested in this experiment with the values being

16, 32, 64, 128, 256, and 512.

All the other parameters were kept constant to get a better idea of the effect of different batch sizes:

activation function: relu, activation function for the output layer: softmax, epochs: 50, kernel initializer: he_uniform, learning rate: 0.03, loss function: categorical_crossentropy, optimizer: Adam, network architecture: 512–128–96–2

Following are the graphs demonstrating the effect of different batch sizes on the MLP model:

loss and accuracy for batch sizes across epochs

Additionally, we can observe the mean values for loss and accuracy across the different subsets of data: training, validation and test.

mean loss and accuracy for batch sizes across training, validation and test sets

Visualizing this for loss and accuracy separately, we get the following plots.

Visualization for loss and accuracy metrics for all batch sizes

As observed from the data, a batch size of 128 samples shows the highest accuracy on test data with an accuracy of 85.26% with a batch size of 512 and 16 following close behind with an accuracy of 83.61% and 83.13% respectively on the test data.

✨ Epochs

Similar to batch size, 6 variations of epoch size are considered in the experiment.

10, 20, 50, 100, 200, 500

All the other parameters were kept constant to get a better idea of the effect of different epoch sizes:

activation function: relu, activation function for the output layer: softmax, batch size: 50, kernel initializer: he_uniform, learning rate: 0.03, loss function: categorical_crossentropy, optimizer: Adam, network architecture: 512–128–96–2

Following are the graphs demonstrating the effect of different epoch sizes on the MLP model:

loss and accuracy across different epoch sizes

Additionally, we can observe the mean values for loss and accuracy across the different subsets of data: training, validation and test.

mean loss and accuracy for epoch sizes across training, validation and test sets

Visualizing this for loss and accuracy separately, we get the following plots.

Visualization for loss and accuracy metrics for all epoch sizes

As we can observe from the data frame and the plots, using an epoch size of 10 has the highest accuracy on test data with an accuracy of 85.23% followed by an accuracy of 82.08% for an epoch size of 25 on the test data. All the other variations have an accuracy around the 50% mark.

✨ Kernel Initializers

There are 13 different kernel initializers available in the keras library that can be used to configure the model.

random_normal, random_uniform, truncated_normal, zeros, ones, glorot_normal, glorot_uniform, he_normal, he_uniform, identity, orthogonal, constant, variance_scaling

All the other parameters were kept constant to get a better idea of the effect of different kernel initializers:

activation function: relu, activation function for the output layer: softmax, batch size: 50, epochs:50, learning rate: 0.03, loss function: categorical_crossentropy, optimizer: Adam, network architecture: 512–128–96–2

Following are the graphs demonstrating the effect of different kernel initializers on the MLP model:

loss and accuracy for kernel initializers across epochs

Additionally, we can observe the mean values for loss and accuracy across the different subsets of data: training, validation and test.

mean loss and accuracy for kernel initializers across training, validation and test sets

Visualizing this for loss and accuracy separately, we get the following plots.

Visualization for loss and accuracy metrics for all kernel initializers

We can observe from the data and plots above that we get the best accuracy of 85.35% using variance_scaling. A few other kernel_initializers like orthogonal, identity, he_uniform, he_normal, truncated_normal, random_uniform, and random_normal follow behind with accuracy in a similar range.

✨ Learning Rate

6 different learning rates ranging from small steps to large steps are utilized in this experiment.

0.001, 0.003, 0.01, 0.03, 0.1, 0.3

All the other parameters were kept constant to get a better idea of the effect of different learning rates:

activation function: relu, activation function for the output layer: softmax, batch size: 50, epochs:50, kernel_initializer: he_uniform, loss function: categorical_crossentropy, optimizer: Adam, network architecture: 512–128–96–2

Following are the graphs demonstrating the effect of different learning rates on the MLP model:

loss and accuracy for learning rates across epochs

Additionally, we can observe the mean values for loss and accuracy across the different subsets of data: training, validation and test.

mean loss and accuracy for learning rates across training, validation and test sets

Visualizing this for loss and accuracy separately, we get the following plots.

Visualization for loss and accuracy metrics for all learning rates

It can be observed from the data that a learning rate of 0.003 demonstrates the highest accuracy of 86.68% in the experiment followed closely by learning rates of 0.001, 0.01, and 0.03 with an accuracy of 86.68%, 85.95%, and 81.62% respectively. however, the tradeoff for smaller learning rates is the time and computation required.

✨ Loss Functions

A total of 7 distinct loss functions are being tested in this experiment with 4 categorized as probabilistic losses and the other 3 being categorized as hinge losses.

probabilistic losses: binary_crossentropy, categorical_crossentropy, poisson, kl_divergence; hinge losses: hinge, squared_hinge, categorical_hinge

All the other parameters were kept constant to get a better idea of the effect of different loss functions:

activation function: relu, activation function for the output layer: softmax, batch size: 50, epochs:50, kernel_initializer: he_uniform, learning_rate: 0.03, optimizer: Adam, network architecture: 512–128–96–2

Following are the graphs demonstrating the effect of different loss functions on the MLP model:

loss and accuracy for loss functions across epochs

Additionally, we can observe the mean values for loss and accuracy across the different subsets of data: training, validation and test.

mean loss and accuracy for loss functions across training, validation and test sets

Visualizing this for loss and accuracy separately, we get the following plots.

Visualization for loss and accuracy metrics for all loss functions

As apparent from the data and plots above, we can note that the categorical_crossentropy loss function demonstrates the highest accuracy of 83.73% followed only by the hinge loss function with an accuracy of 74.32%, while the rest of the loss functions stay around 50%.

✨ Network Architectures

There are countless variations of depth and width that can be implemented in an MLP, however, more complexity doesn’t always mean a better performance. 5 different network architectures are implemented in this experiment.

128–2, 256–2, 256–128–2, 128–256–128–2, 512–256–128–96–2

All the other parameters were kept constant to get a better idea of the effect of different loss functions:

activation function: relu, activation function for the output layer: softmax, batch size: 50, epochs:50, kernel_initializer: he_uniform, learning_rate: 0.03, loss_function: categorical_crossentropy, optimizer: Adam

Following are the graphs demonstrating the effect of different network architectures on the MLP model:

loss and accuracy for network architectures across epochs

Additionally, we can observe the mean values for loss and accuracy across the different subsets of data: training, validation and test.

mean loss and accuracy for network architectures across training, validation and test sets

Visualizing this for loss and accuracy separately, we get the following plots.

Visualization for loss and accuracy metrics for all network architectures

We can make an observation that the simplest network architecture amongst all the architectures is the one that performs the best with an accuracy of 85.23% while the deepest and most complex network consistently has the lowest accuracy around 50%.

✨ Optimizers

10 different optimizers available in the keras library are tested in this experiment.

SGD, RMSprop, Adam, AdamW, Adadelta, Adagrad, Adamax, Adafactor, Nadam, Ftrl

All the other parameters were kept constant to get a better idea of the effect of different optimizers:

activation function: relu, activation function for the output layer: softmax, batch size: 50, epochs:50, kernel_initializer: he_uniform, learning_rate: 0.03, loss_function: categorical_crossentropy, network architecture: 512–128–96–2

Following are the graphs demonstrating the effect of different optimizers on the MLP model:

loss and accuracy for optimizers across epochs

Additionally, we can observe the mean values for loss and accuracy across the different subsets of data: training, validation and test.

mean loss and accuracy for optimizers across training, validation and test sets

Visualizing this for loss and accuracy separately, we get the following plots.

From the above data and plots, we can discern that the Adamax optimizer performs the best with an accuracy of 87.01% closely followed by a few other optimizers such as Adagrad, Ftrl, Adadelta and Adam with an accuracy of 86.16%, 85.59%, 84.93% and 83.07% respectively on the test set.

✨ How to Test Different Variations?

Each Python notebook has a code cell that holds the variables for all the constant parameters and the variable parameters. These can then be changed to fit the use case and run consistent tests.

As you can see from the example snippet above, all the constant parameters are depicted by variable names in all caps while the variable parameter is a list. The items in the list can be changed to update the different variations of the variable parameter to test.

Each test iterates through the variable parameter list and builds a model using a particular instance of the variable parameter and all the constant parameters. This enables testing different variations with minimal changes to the code and significantly reduces the chances of human error.

✨ Footnote

Hey there, hope you liked the blog post. This was originally my homework to test different variations and their effect, but I decided to turn it in to make it automated — because of course, I’m lazy — and it turned out to make testing accessible with minimal changes to the code.

Consider following me on Medium, Twitter and other platforms to read more about Productivity, Design and Code.

Twitter | Medium | LinkedIn | Bio Link

Fine-Tuning Your Multi-Layer Perceptron: Testing the Impact of Model Parameters

✨ Introduction

✨ File Structure

✨ Data Processing

✨ Activation Function

✨ Batch Size

✨ Epochs

✨ Kernel Initializers

✨ Learning Rate

✨ Loss Functions

✨ Network Architectures

✨ Optimizers

✨ How to Test Different Variations?

✨ Footnote

Written by Satwik Gawand