Data characteristics of the SKAB dataset¶

This notebook is an appendix to our study. Its aim is to demonstrate the data characteristics of the SKAB dataset. To extract this information we are going to perform Exploratory Data Analysis (EDA) on the data, using DataPrep.EDA [1] which is an easy-to-use tool well integrated into Python and Jupyter Notebook for viewing data characteristics and understanding the data in an interactive way.

[1] Jinglin Peng, Weiyuan Wu, Brandon Lockhart, Song Bian, Jing Nathan Yan, Linghao Xu, Zhixuan Chi, Jeffrey M. Rzeszotarski, and Jiannan Wang. DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python. SIGMOD 2021.

In [1]:
import pandas as pd
from dataprep.eda import create_report
In [2]:
# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

Dataset description¶

Skoltech Anomaly Benchmark (SKAB) is a multivariate dataset designed for evaluating anomaly detection algorithms. The dataset contains both point outliers and changepoints.

The data is collected from a water circulation system testbed that simulates a real industrial scenario with its control system. Anomalies are induced in the system by partially closing valves, temperature variations, reduction of motor power, drastic water level changes and scenarios leading to cavitation (the formation of small vapor-filled cavities in the liquid).

Exploratory Data Analysis¶

In [3]:
ds = pd.read_csv('SKAB\ds.csv', index_col=0)

The data analysis can be run with the following command. The report consists of the following sections:

  • the Overview section contains basic information and insights on the dataset. These statistical data are the same as with the whole dataset.
  • The Variables section show statistical information about each feature individually. This section is responsible for Univariate analysis. More information and plots can be accessed by pressing the Show details button for the corresponding varible.
    • For numerical variable, the report shows quantile statistics, descriptive statistics, KDE plot, QQ norm plot and a Histogram is shown on the right. These can be used to describe the data distribution. The Histogram divides the data domain into intervals of equal length (bins) and counts how many values fall into a given interval. This count is displayed as a bar for each bin. A higher bar reflects that there were more values falling into the corresponding bin. The box plot shows the anomalies for that features. In the upper right corner an insight can be accessed that reports the number of anomalous values.
    • For categorical variable, the report shows text analysis, bar chart, pie chart, word cloud, word frequencies and word length. Only the protocol feature belongs to this category. Here only the Stats, PieChart and Word Frequency tabs carry information, as word length is not important in the case of this feature.
  • The Interaction and Correlations sections represent Bivariate analysis. The Correlation Map shows the correlation between each feature pair. There are multiple calculation methods available. Darker shades of red mean that the features represent each other well. This can alternatively be shown on Scatter Plots in the Interactions section. The more the two variables approximate linear regression, the more correlation there is between them. Should a high amount of correlation be present between two variables, it is useful to consider omitting one of the features.
  • The Missing Values section shows the channels that are not fully defined, thus contain missing values. These columns need to be dropped or carefully imputed. Alternatively we can drop the rows that have missing data.
In [4]:
create_report(ds)
Out[4]:
DataPrep Report
DataPrep Report Overview
Variables ≡
Accelerometer1RMS Accelerometer2RMS Current Pressure Temperature Thermocouple Voltage Volume Flow RateRMS y
Interactions Correlations Missing Values

Overview

Dataset Statistics

Number of Variables 9
Number of Rows 45001
Missing Cells 0
Missing Cells (%) 0.0%
Duplicate Rows 701
Duplicate Rows (%) 1.6%
Total Size in Memory 5.9 MB
Average Row Size in Memory 136.7 B
Variable Types
  • Numerical: 8
  • Categorical: 1

Dataset Insights

Accelerometer1RMS and Accelerometer2RMS have similar distributions Similar Distribution
Accelerometer1RMS is skewed Skewed
Accelerometer2RMS is skewed Skewed
Current is skewed Skewed
Pressure is skewed Skewed
Temperature is skewed Skewed
Voltage is skewed Skewed
Volume Flow RateRMS is skewed Skewed
Dataset has 701 (1.56%) duplicate rows Duplicates
y has constant length 3 Constant Length
Accelerometer1RMS has 24707 (54.9%) negatives Negatives
Accelerometer2RMS has 24702 (54.89%) negatives Negatives
Current has 8118 (18.04%) negatives Negatives
Temperature has 30534 (67.85%) negatives Negatives
Thermocouple has 24700 (54.89%) negatives Negatives
Volume Flow RateRMS has 25328 (56.28%) negatives Negatives
  • 1
  • 2

Variables


Accelerometer1RMS

numerical

Approximate Distinct Count 17200
Approximate Unique (%) 38.2%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 3270574
Mean -1.6485
Minimum -4
Maximum 5
Zeros 2
Zeros (%) 0.0%
Negatives 24707
Negatives (%) 54.9%
  • Accelerometer1RMS is skewed right (γ1 = 0.4885)

Quantile Statistics

Minimum -4
5-th Percentile -4
Q1 -4
Median -3.5471
Q3 0.6796
95-th Percentile 1.7435
Maximum 5
Range 9
IQR 4.6796

Descriptive Statistics

Mean -1.6485
Standard Deviation 2.628
Variance 6.9065
Sum -74185.9889
Skewness 0.4885
Kurtosis -1.1319
Coefficient of Variation -1.5941
  • Accelerometer1RMS is not normally distributed (p-value 4.0882174137600996e-22)

Accelerometer2RMS

numerical

Approximate Distinct Count 16459
Approximate Unique (%) 36.6%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 3270574
Mean -1.694
Minimum -4
Maximum 5
Zeros 1
Zeros (%) 0.0%
Negatives 24702
Negatives (%) 54.9%
  • Accelerometer2RMS is skewed right (γ1 = 0.4961)

Quantile Statistics

Minimum -4
5-th Percentile -4
Q1 -4
Median -3.6494
Q3 0.6437
95-th Percentile 1.5091
Maximum 5
Range 9
IQR 4.6437

Descriptive Statistics

Mean -1.694
Standard Deviation 2.5842
Variance 6.678
Sum -76232.4873
Skewness 0.4961
Kurtosis -1.0773
Coefficient of Variation -1.5255
  • Accelerometer2RMS is not normally distributed (p-value 4.602746112825159e-22)

Current

numerical

Approximate Distinct Count 40997
Approximate Unique (%) 91.1%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 3270574
Mean 0.004256
Minimum -0.002151
Maximum 1.0457
Zeros 1
Zeros (%) 0.0%
Negatives 8118
Negatives (%) 18.0%
  • Current is skewed right (γ1 = 35.0778)

Quantile Statistics

Minimum -0.002151
5-th Percentile -0.001036
Q1 0.00057529
Median 0.002301
Q3 0.007052
95-th Percentile 0.008857
Maximum 1.0457
Range 1.0479
IQR 0.006476

Descriptive Statistics

Mean 0.004256
Standard Deviation 0.02741
Variance 0.00075125
Sum 191.5148
Skewness 35.0778
Kurtosis 1250.1814
Coefficient of Variation 6.4404
  • Current is not normally distributed (p-value 4.226539550015572e-25)
  • Current has 35 outliers

Pressure

numerical

Approximate Distinct Count 10
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 3270574
Mean 0.5097
Minimum 0
Maximum 1.125
Zeros 7
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • Pressure is skewed left (γ1 = -0.0754)

Quantile Statistics

Minimum 0
5-th Percentile 0.375
Q1 0.5
Median 0.5
Q3 0.625
95-th Percentile 0.625
Maximum 1.125
Range 1.125
IQR 0.125

Descriptive Statistics

Mean 0.5097
Standard Deviation 0.09883
Variance 0.009767
Sum 22938.3915
Skewness -0.07541
Kurtosis 0.7088
Coefficient of Variation 0.1939
  • Pressure is not normally distributed (p-value 1.0216686645040852e-20)
  • Pressure has 1209 outliers

Temperature

numerical

Approximate Distinct Count 19961
Approximate Unique (%) 44.4%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 3270574
Mean -1.8749
Minimum -4
Maximum 1.9248
Zeros 3
Zeros (%) 0.0%
Negatives 30534
Negatives (%) 67.8%
  • Temperature is skewed left (γ1 = -0.0081)

Quantile Statistics

Minimum -4
5-th Percentile -4
Q1 -4
Median -0.8796
Q3 0.1397
95-th Percentile 0.5672
Maximum 1.9248
Range 5.9248
IQR 4.1397

Descriptive Statistics

Mean -1.8749
Standard Deviation 2.0132
Variance 4.053
Sum -84371.4542
Skewness -0.008067
Kurtosis -1.844
Coefficient of Variation -1.0738
  • Temperature is not normally distributed (p-value 4.035521305327463e-23)

Thermocouple

numerical

Approximate Distinct Count 24336
Approximate Unique (%) 54.1%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 3270574
Mean -0.06129
Minimum -1.7123
Maximum 2.4124
Zeros 1
Zeros (%) 0.0%
Negatives 24700
Negatives (%) 54.9%
  • Thermocouple is skewed right (γ1 = 0.111)

Quantile Statistics

Minimum -1.7123
5-th Percentile -1.683
Q1 -0.7663
Median -0.402
Q3 0.8671
95-th Percentile 1.1343
Maximum 2.4124
Range 4.1247
IQR 1.6334

Descriptive Statistics

Mean -0.06129
Standard Deviation 0.9165
Variance 0.8399
Sum -2757.9979
Skewness 0.111
Kurtosis -1.08
Coefficient of Variation -14.9534
  • Thermocouple is not normally distributed (p-value 0.00025736891514853823)

Voltage

numerical

Approximate Distinct Count 26414
Approximate Unique (%) 58.7%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 3270574
Mean 0.907
Minimum -0.001631
Maximum 1.01
Zeros 1
Zeros (%) 0.0%
Negatives 6
Negatives (%) 0.0%
  • Voltage is skewed left (γ1 = -4.6654)

Quantile Statistics

Minimum -0.001631
5-th Percentile 0.8235
Q1 0.8837
Median 0.9087
Q3 0.9328
95-th Percentile 0.9854
Maximum 1.01
Range 1.0116
IQR 0.04908

Descriptive Statistics

Mean 0.907
Standard Deviation 0.05043
Variance 0.002543
Sum 40814.3653
Skewness -4.6654
Kurtosis 81.3505
Coefficient of Variation 0.0556
  • Voltage is not normally distributed (p-value 3.0761639773522516e-12)
  • Voltage has 753 outliers

Volume Flow RateRMS

numerical

Approximate Distinct Count 727
Approximate Unique (%) 1.6%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 3270574
Mean -1.9065
Minimum -4
Maximum 1.5153
Zeros 3
Zeros (%) 0.0%
Negatives 25328
Negatives (%) 56.3%
  • Volume Flow RateRMS is skewed right (γ1 = 0.2596)

Quantile Statistics

Minimum -4
5-th Percentile -4
Q1 -4
Median -4
Q3 0.7727
95-th Percentile 0.9659
Maximum 1.5153
Range 5.5153
IQR 4.7727

Descriptive Statistics

Mean -1.9065
Standard Deviation 2.3741
Variance 5.6365
Sum -85794.0342
Skewness 0.2596
Kurtosis -1.9227
Coefficient of Variation -1.2453
  • Volume Flow RateRMS is not normally distributed (p-value 8.736287819737804e-23)

y

categorical

Approximate Distinct Count 2
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 5610626
  • The largest value (0.0) is over 2.44 times larger than the second largest value (1.0)

Length

Mean 3
Standard Deviation 0
Median 3
Minimum 3
Maximum 3

Sample

1st row 0.0
2nd row 0.0
3rd row 0.0
4th row 0.0
5th row 0.0

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 90002
  • The top 2 categories (0.0, 1.0) take over 50.0%
  • The largest value (00) is over 2.44 times larger than the second largest value (10)
  • y has words of constant length

Interactions

Correlations

Missing Values

Report generated with DataPrep

Evaluating the results¶

The reports shows that all of the 8 features are numerical (excluding the target variable y). The dataset has no missing values. The empirical distirution of the features looks to be a union of multiple Normal distributions with multiple Bell-curves present on the histograms. This could be a sign of data drift or possible anomalies (however, it may be explained by normal behavior too).

Looking at the Correlation Matrix we observe that about a third of the features are highly correlated while the rest of the pairings show very little correlation.