Using Autoencoders in Tensorflow Keras for Anomaly Detection

In anomaly detection, we create artificial networks together through unsupervised learning and try to detect anomalies in a dataset using the reconstruction error. We will do this with the help of Autoencoders in this article.

An Autoencoder uses normal data to train the model and all data to make predictions. In other words, abnormal data is not used during the training phase. For this reason, outliers are expected to have higher reconstruction errors because they differ from normal data.

In today’s article, we will use Tensorflow Keras to identify outliers using an Autoencoder.

If you’re ready, let’s start.

  1. Import Libraries
# For synthetic dataset
from sklearn.datasets import make_classification
# for data preprocessing
import numpy as np
imort pandas as pd
from collections import Counter
# for visualization
import seaborn as sns
import matplotlib.pyplot as plt
# for model performance
import tensorflow as tf
from sklearn.model_selection import layers, losses
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

2. Building Dataset With Abnormals

We will not include any redundant or repeated features in this dataset. We will determine the sample number, weight and features.

# building imbalanced dataset
X, y = make_classification(n_samples=100000, n_features=32, n_informative=32,
n_redundant=0, n_repeated=0, n_classes=2,
weights=[0.995, 0.005],
class_sep=0.5, random_state=0)

3. Train Test Data Split

We will split the dataset into 80% training — 20% test data.

# Splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Control the number of records
print('The number of records in the training dataset is', X_train.shape[0])
print

4. Autoencoder Algorithm

The autoencoder model has six stages for anomaly detection. The first three stages are for model training and the last three stages are for model prediction.

  1. Stage 1 is the encoder stage. At this stage, the basic information is extracted by a neural network model.
  2. Stage 2 is the decoder stage. At this stage, the model reconstructs the data using the extracted information.
  3. Stage 3: Repeat stage 1 and stage 2 to adjust the model to minimize the difference between input and reconstructed output until you get good reconstruction results for the training dataset.
  4. Stage 4: Make predictions on a dataset with outliers.
  5. Stage 5: Set a threshold for outliers/anomalies by comparing the differences between the autoencoder model rebuild value and the actual value.
  6. Stage 6: Identify data points that have a difference higher than the outlier or abnormal threshold.

5. Model Training

The autoencoder trains the model on the normal dataset, so we will first separate the expected data from the anomaly data. Next, we will create the input layer, encoder layers, and decoder layers. In the input layer, we determined the shape of the data set. Since the modeling dataset has 32 features, the figure here is 32.

The encoder consists of 3 layers with 16, 8 and 4 neurons, respectively. Note that the encoder requires the number of neurons to decrease with the layers. The last layer in the encoder is the size of the encoded representation, also called the bottleneck.

The decoder consists of 3 layers with 8, 16 and 32 neurons, respectively. Opposite the encoder, the decoder requires the number of neurons increasing with the layers. The output layer in the decoder has the same size as the input layer.

The Relu activation function is used for each layer except the decoder output layer. relu is a popular enable function, but you can try other enable functions and compare model performance. This part is up to you.

After defining the input, encoder and decoder layers, we create the autoencoder model to combine the layers. Let’s move on to the code.

# Keep only the normal data for the training dataset
X_train_normal = X_train[np.where(y_train == 0)]
# Input layerinput = tf.keras.layers.Input(shape=(32,))# Encoder layersencoder = tf.keras.Sequential([
layers.Dense(16, activation='relu'),
layers.Dense(8, activation='relu'),
layers.Dense(4, activation='relu')])(input)
# Decoder layers
decoder = tf.keras.Sequential([
layers.Dense(8, activation="relu"),
layers.Dense(16, activation="relu"),
layers.Dense(32, activation="sigmoid")])(encoder)
# Create the autoencoder
autoencoder = tf.keras.Model(inputs=input, outputs=decoder)

After defining the input, encoder and decoder layers, we will create the autoencoder model to combine the layers. Then creating the autoencoder model, we will compile the model with man optimizer and mae loss (Mean Absolute Error).

When fitting the autoencoder model, we can see that the input and output datasets are the same, which is only the dataset containing the normal data points. We talked about this topic at the beginning of the article.

The validation data is the test dataset containing both normal and abnormal data points.

20 epochs and a batch_size of 64 means the model uses 64 data points to update the weights at each iteration, and the model will go through the entire training dataset 20 times.

shuffle=True shuffles the dataset before each epoch.

# Compile the autoencoder
autoencoder.compile(optimizer='adam', loss='mae')
# Fit the autoencoder
history = autoencoder.fit(X_train_normal, X_train_normal,
epochs=20,
batch_size=64,
validation_data=(X_test, X_test),
shuffle=True)

The x-axis is the number of periods and the y-axis represents the losses. We can see that both training and validation losses decrease with increasing epochs.

6. Threshold

We now have an autoencoder model. Now let’s use it to predict outliers.

First, we use .predict to get the reconstruction value for the test dataset containing the usual data points and outliers. In the next step, we calculate the loss value between the real and the reconstruction using the mean absolute error. After that, a threshold is set to detect outliers. This threshold may be based on percentile, standard deviation, or other methods. This stage is problem specific and different techniques are applied in the case. In this example, we use 98% loss as the threshold to identify 2% of the data as outliers.

# for predict anomalies/outliers in the training dataset
prediction = autoencoder.predict(X_test)
# for get the mean absolute error between actual and reconstruction/prediction
prediction_loss = tf.keras.losses.mae(prediction, X_test)
# for check the prediction loss threshold for 2% of outliers
loss_threshold = np.percentile(prediction_loss, 98)
print(f'The prediction loss threshold for 2% of outliers is {loss_threshold:.2f}')
# for visualize the threshold
sns.histplot(prediction_loss, bins=30, alpha=0.8)
plt.axvline(x=loss_threshold, color='orange')

The image below shows us that the prediction loss is close to a normal distribution, with an average of around 2.5. For 2% of outliers, the prediction loss threshold is approximately 3.5.

7. Model Performance

Based on the threshold we determined in the previous step, we estimated normal data points if the loss of prediction was less than the threshold. Otherwise, we predict the data point to be an outlier or anomaly. We’ll label 0 as the normal predictor and 1 as the outlier prediction to be consistent with the baseline truth label.

# for control the model performance at 2% threshold
threshold_prediction = [0 if i < loss_threshold else 1 for i in prediction_loss]
# for control the prediction performance
print(classification_report(y_test, threshold_prediction))

A recall value of 0.01 indicates that approximately 1% of outliers are caught by the autoencoder. Let’s look at the table below.

precision    recall  f1-score   support0       0.99      0.98      0.98     198031       0.01      0.01      0.01       197accuracy                               0.97     20000
macro avg 0.50 0.50 0.50 20000
weighted avg 0.98 0.97 0.98 20000

Summary
After determining our anomaly criteria, we divided the dataset in accordance with the modeling and set up the algorithm. After performing the training with normal data, we performed the test setup with abnormal data. And finally, we examined the model performance.

References

TensorFlow. 2022. Intro to Autoencoders | TensorFlow Core.
Blog.keras.io. 2022. Building Autoencoders in Keras.
Anomagram. 2022. Anomagram: Interactive Visualization of Autoencoders.
Note : Images belong to code outputs.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alper Ersin Balcı

Alper Ersin Balcı

Industry 4.0 & Manufacturing Engineer at BSH