# Using Autoencoders in Tensorflow Keras for Anomaly Detection

In anomaly detection, we create artificial networks together through unsupervised learning and try to detect anomalies in a dataset using the reconstruction error. We will do this with the help of Autoencoders in this article.

An Autoencoder uses normal data to train the model and all data to make predictions. In other words, abnormal data is not used during the training phase. For this reason, outliers are expected to have higher reconstruction errors because they differ from normal data.

In today’s article, we will use Tensorflow Keras to identify outliers using an Autoencoder.

If you’re ready, let’s start.

**Import Libraries**

`# For synthetic dataset`

from sklearn.datasets import make_classification

# for data preprocessing

import numpy as np

imort pandas as pd

from collections import Counter

# for visualization

import seaborn as sns

import matplotlib.pyplot as plt

# for model performance

import tensorflow as tf

from sklearn.model_selection import layers, losses

from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report

**2. Building Dataset With Abnormals**

We will not include any redundant or repeated features in this dataset. We will determine the sample number, weight and features.

`# building imbalanced dataset`

X, y = make_classification(n_samples=100000, n_features=32, n_informative=32,

n_redundant=0, n_repeated=0, n_classes=2,

weights=[0.995, 0.005],

class_sep=0.5, random_state=0)

**3. Train Test Data Split**

We will split the dataset into 80% training — 20% test data.

`# Splitting`

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Control the number of records

print('The number of records in the training dataset is', X_train.shape[0])

print

**4. Autoencoder Algorithm**

The autoencoder model has six stages for anomaly detection. The first three stages are for model training and the last three stages are for model prediction.

- Stage 1 is the encoder stage. At this stage, the basic information is extracted by a neural network model.
- Stage 2 is the decoder stage. At this stage, the model reconstructs the data using the extracted information.
- Stage 3: Repeat stage 1 and stage 2 to adjust the model to minimize the difference between input and reconstructed output until you get good reconstruction results for the training dataset.
- Stage 4: Make predictions on a dataset with outliers.
- Stage 5: Set a threshold for outliers/anomalies by comparing the differences between the autoencoder model rebuild value and the actual value.
- Stage 6: Identify data points that have a difference higher than the outlier or abnormal threshold.

**5. Model Training**

The autoencoder trains the model on the normal dataset, so we will first separate the expected data from the anomaly data. Next, we will create the input layer, encoder layers, and decoder layers. In the input layer, we determined the shape of the data set. Since the modeling dataset has 32 features, the figure here is 32.

The encoder consists of 3 layers with 16, 8 and 4 neurons, respectively. Note that the encoder requires the number of neurons to decrease with the layers. The last layer in the encoder is the size of the encoded representation, also called the bottleneck.

The decoder consists of 3 layers with 8, 16 and 32 neurons, respectively. Opposite the encoder, the decoder requires the number of neurons increasing with the layers. The output layer in the decoder has the same size as the input layer.

The Relu activation function is used for each layer except the decoder output layer. relu is a popular enable function, but you can try other enable functions and compare model performance. This part is up to you.

After defining the input, encoder and decoder layers, we create the autoencoder model to combine the layers. Let’s move on to the code.

# Keep only the normal data for the training dataset

X_train_normal = X_train[np.where(y_train == 0)]# Input layerinput = tf.keras.layers.Input(shape=(32,))# Encoder layersencoder = tf.keras.Sequential([

layers.Dense(16, activation='relu'),

layers.Dense(8, activation='relu'),

layers.Dense(4, activation='relu')])(input)# Decoder layers

decoder = tf.keras.Sequential([

layers.Dense(8, activation="relu"),

layers.Dense(16, activation="relu"),

layers.Dense(32, activation="sigmoid")])(encoder)# Create the autoencoder

autoencoder = tf.keras.Model(inputs=input, outputs=decoder)

After defining the input, encoder and decoder layers, we will create the autoencoder model to combine the layers. Then creating the autoencoder model, we will compile the model with man optimizer and mae loss (Mean Absolute Error).

When fitting the autoencoder model, we can see that the input and output datasets are the same, which is only the dataset containing the normal data points. We talked about this topic at the beginning of the article.

The validation data is the test dataset containing both normal and abnormal data points.

20 epochs and a batch_size of 64 means the model uses 64 data points to update the weights at each iteration, and the model will go through the entire training dataset 20 times.

shuffle=True shuffles the dataset before each epoch.

`# Compile the autoencoder`

autoencoder.compile(optimizer='adam', loss='mae')

# Fit the autoencoder

history = autoencoder.fit(X_train_normal, X_train_normal,

epochs=20,

batch_size=64,

validation_data=(X_test, X_test),

shuffle=True)

The x-axis is the number of periods and the y-axis represents the losses. We can see that both training and validation losses decrease with increasing epochs.

**6. Threshold**

We now have an autoencoder model. Now let’s use it to predict outliers.

First, we use .predict to get the reconstruction value for the test dataset containing the usual data points and outliers. In the next step, we calculate the loss value between the real and the reconstruction using the mean absolute error. After that, a threshold is set to detect outliers. This threshold may be based on percentile, standard deviation, or other methods. This stage is problem specific and different techniques are applied in the case. In this example, we use 98% loss as the threshold to identify 2% of the data as outliers.

`# for predict anomalies/outliers in the training dataset`

prediction = autoencoder.predict(X_test)

# for get the mean absolute error between actual and reconstruction/prediction

prediction_loss = tf.keras.losses.mae(prediction, X_test)

# for check the prediction loss threshold for 2% of outliers

loss_threshold = np.percentile(prediction_loss, 98)

print(f'The prediction loss threshold for 2% of outliers is {loss_threshold:.2f}')

# for visualize the threshold

sns.histplot(prediction_loss, bins=30, alpha=0.8)

plt.axvline(x=loss_threshold, color='orange')

The image below shows us that the prediction loss is close to a normal distribution, with an average of around 2.5. For 2% of outliers, the prediction loss threshold is approximately 3.5.

**7. Model Performance**

Based on the threshold we determined in the previous step, we estimated normal data points if the loss of prediction was less than the threshold. Otherwise, we predict the data point to be an outlier or anomaly. We’ll label 0 as the normal predictor and 1 as the outlier prediction to be consistent with the baseline truth label.

`# for control the model performance at 2% threshold`

threshold_prediction = [0 if i < loss_threshold else 1 for i in prediction_loss]

# for control the prediction performance

print(classification_report(y_test, threshold_prediction))

A recall value of 0.01 indicates that approximately 1% of outliers are caught by the autoencoder. Let’s look at the table below.

precision recall f1-score support0 0.99 0.98 0.98 198031 0.01 0.01 0.01 197accuracy 0.97 20000

macro avg 0.50 0.50 0.50 20000

weighted avg 0.98 0.97 0.98 20000

**Summary**

After determining our anomaly criteria, we divided the dataset in accordance with the modeling and set up the algorithm. After performing the training with normal data, we performed the test setup with abnormal data. And finally, we examined the model performance.

**References**

TensorFlow. 2022. *Intro to Autoencoders | TensorFlow Core*.

Blog.keras.io. 2022. *Building Autoencoders in Keras*.

Anomagram. 2022. *Anomagram: Interactive Visualization of Autoencoders*.

Note : Images belong to code outputs.