Data characteristics of the SMD dataset¶

This notebook is an appendix to our study. Its aim is to demonstrate the data characteristics of the SMD dataset. To extract this information we are going to perform Exploratory Data Analysis (EDA) on the data, using DataPrep.EDA [1] which is an easy-to-use tool well integrated into Python and Jupyter Notebook for viewing data characteristics and understanding the data in an interactive way.

[1] Jinglin Peng, Weiyuan Wu, Brandon Lockhart, Song Bian, Jing Nathan Yan, Linghao Xu, Zhixuan Chi, Jeffrey M. Rzeszotarski, and Jiannan Wang. DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python. SIGMOD 2021.

In [1]:
import pandas as pd
from dataprep.eda import create_report
In [2]:
# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

Dataset description¶

The Server Machine Dataset (SMD) [2] is a minute-based sampled dataset collected over 5 weeks at a large Internet company by Su et. al. The anomalies have been labeled by domain experts based on incident reports. There are 38 channels in the dataset in total.

SMD is a multi-entity dataset of 28 entities where each of them is a different physical unit of the same type. All entites share the same dimensionality and same type of features. For the sake of this characteristic report, we concatenate all the entities into a single dataframe.

Exploratory Data Analysis¶

In [3]:
ds = pd.read_csv('SMD\ds.csv', index_col=0)

The data analysis can be run with the following command. The report consists of the following sections:

  • the Overview section contains basic information and insights on the dataset. These statistical data are the same as with the whole dataset.
  • The Variables section show statistical information about each feature individually. This section is responsible for Univariate analysis. More information and plots can be accessed by pressing the Show details button for the corresponding varible.
    • For numerical variable, the report shows quantile statistics, descriptive statistics, KDE plot, QQ norm plot and a Histogram is shown on the right. These can be used to describe the data distribution. The Histogram divides the data domain into intervals of equal length (bins) and counts how many values fall into a given interval. This count is displayed as a bar for each bin. A higher bar reflects that there were more values falling into the corresponding bin. The box plot shows the anomalies for that features. In the upper right corner an insight can be accessed that reports the number of anomalous values.
    • For categorical variable, the report shows text analysis, bar chart, pie chart, word cloud, word frequencies and word length. Only the protocol feature belongs to this category. Here only the Stats, PieChart and Word Frequency tabs carry information, as word length is not important in the case of this feature.
  • The Interaction and Correlations sections represent Bivariate analysis. The Correlation Map shows the correlation between each feature pair. There are multiple calculation methods available. Darker shades of red mean that the features represent each other well. This can alternatively be shown on Scatter Plots in the Interactions section. The more the two variables approximate linear regression, the more correlation there is between them. Should a high amount of correlation be present between two variables, it is useful to consider omitting one of the features.
  • The Missing Values section shows the channels that are not fully defined, thus contain missing values. These columns need to be dropped or carefully imputed. Alternatively we can drop the rows that have missing data.
In [4]:
create_report(ds)
Out[4]:
DataPrep Report
DataPrep Report Overview
Variables ≡
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 y
Interactions Correlations Missing Values

Overview

Dataset Statistics

Number of Variables 39
Number of Rows 1.4168e+06
Missing Cells 0
Missing Cells (%) 0.0%
Duplicate Rows 0
Duplicate Rows (%) 0.0%
Total Size in Memory 432.4 MB
Average Row Size in Memory 320.0 B
Variable Types
  • Numerical: 37
  • Categorical: 2

Dataset Insights

16 and 17 have similar distributions Similar Distribution
16 and 26 have similar distributions Similar Distribution
16 and 28 have similar distributions Similar Distribution
16 and 37 have similar distributions Similar Distribution
17 and 26 have similar distributions Similar Distribution
17 and 28 have similar distributions Similar Distribution
17 and 37 have similar distributions Similar Distribution
20 and 21 have similar distributions Similar Distribution
20 and 27 have similar distributions Similar Distribution
20 and 30 have similar distributions Similar Distribution
21 and 27 have similar distributions Similar Distribution
21 and 30 have similar distributions Similar Distribution
26 and 28 have similar distributions Similar Distribution
26 and 37 have similar distributions Similar Distribution
27 and 30 have similar distributions Similar Distribution
28 and 37 have similar distributions Similar Distribution
34 and 35 have similar distributions Similar Distribution
0 is skewed Skewed
1 is skewed Skewed
2 is skewed Skewed
3 is skewed Skewed
4 is skewed Skewed
5 is skewed Skewed
6 is skewed Skewed
8 is skewed Skewed
9 is skewed Skewed
10 is skewed Skewed
11 is skewed Skewed
12 is skewed Skewed
13 is skewed Skewed
14 is skewed Skewed
15 is skewed Skewed
16 is skewed Skewed
17 is skewed Skewed
18 is skewed Skewed
19 is skewed Skewed
20 is skewed Skewed
21 is skewed Skewed
22 is skewed Skewed
23 is skewed Skewed
24 is skewed Skewed
25 is skewed Skewed
26 is skewed Skewed
27 is skewed Skewed
28 is skewed Skewed
29 is skewed Skewed
30 is skewed Skewed
31 is skewed Skewed
32 is skewed Skewed
33 is skewed Skewed
34 is skewed Skewed
35 is skewed Skewed
36 is skewed Skewed
37 is skewed Skewed
7 has constant value "0.0" Constant
7 has constant length 3 Constant Length
y has constant length 3 Constant Length
4 has 22454 (1.58%) negatives Negatives
5 has 17535 (1.24%) negatives Negatives
6 has 37258 (2.63%) negatives Negatives
25 has 31029 (2.19%) negatives Negatives
4 has 1024993 (72.34%) zeros Zeros
8 has 278472 (19.65%) zeros Zeros
9 has 1038811 (73.32%) zeros Zeros
10 has 560574 (39.57%) zeros Zeros
12 has 879758 (62.09%) zeros Zeros
16 has 1412781 (99.71%) zeros Zeros
17 has 1414488 (99.84%) zeros Zeros
24 has 264534 (18.67%) zeros Zeros
26 has 1416779 (100.0%) zeros Zeros
28 has 1416399 (99.97%) zeros Zeros
29 has 290219 (20.48%) zeros Zeros
31 has 349628 (24.68%) zeros Zeros
32 has 995593 (70.27%) zeros Zeros
33 has 98221 (6.93%) zeros Zeros
34 has 116955 (8.25%) zeros Zeros
35 has 110732 (7.82%) zeros Zeros
36 has 1313347 (92.7%) zeros Zeros
37 has 1369420 (96.65%) zeros Zeros
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

Variables


0

numerical

Approximate Distinct Count 1844
Approximate Unique (%) 0.1%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.237
Minimum -0.1
Maximum 5
Zeros 70459
Zeros (%) 5.0%
Negatives 850
Negatives (%) 0.1%
  • 0 is skewed right (γ1 = 2.5409)

Quantile Statistics

Minimum -0.1
5-th Percentile 0.0101
Q1 0.05263
Median 0.1528
Q3 0.35
95-th Percentile 0.7368
Maximum 5
Range 5.1
IQR 0.2974

Descriptive Statistics

Mean 0.237
Standard Deviation 0.2633
Variance 0.06932
Sum 335809.3549
Skewness 2.5409
Kurtosis 12.8042
Coefficient of Variation 1.1108
  • 0 is not normally distributed (p-value 6.49479988923106e-17)
  • 0 has 56874 outliers

1

numerical

Approximate Distinct Count 36881
Approximate Unique (%) 2.6%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.1709
Minimum -0.02017
Maximum 5
Zeros 14547
Zeros (%) 1.0%
Negatives 153
Negatives (%) 0.0%
  • 1 is skewed right (γ1 = 4.3766)

Quantile Statistics

Minimum -0.02017
5-th Percentile 0.00075736
Q1 0.01439
Median 0.07727
Q3 0.2341
95-th Percentile 0.6179
Maximum 5
Range 5.0202
IQR 0.2197

Descriptive Statistics

Mean 0.1709
Standard Deviation 0.259
Variance 0.06706
Sum 242125.4835
Skewness 4.3766
Kurtosis 39.1504
Coefficient of Variation 1.5153
  • 1 is not normally distributed (p-value 1.5515317143908033e-21)
  • 1 has 83337 outliers

2

numerical

Approximate Distinct Count 35087
Approximate Unique (%) 2.5%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.218
Minimum -0.02836
Maximum 5
Zeros 2409
Zeros (%) 0.2%
Negatives 686
Negatives (%) 0.0%
  • 2 is skewed right (γ1 = 5.0615)

Quantile Statistics

Minimum -0.02836
5-th Percentile 0.001128
Q1 0.02235
Median 0.1173
Q3 0.3042
95-th Percentile 0.7431
Maximum 5
Range 5.0284
IQR 0.2819

Descriptive Statistics

Mean 0.218
Standard Deviation 0.3286
Variance 0.108
Sum 308916.6374
Skewness 5.0615
Kurtosis 45.7576
Coefficient of Variation 1.5072
  • 2 is not normally distributed (p-value 6.681022522028204e-19)
  • 2 has 67193 outliers

3

numerical

Approximate Distinct Count 34010
Approximate Unique (%) 2.4%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.2595
Minimum -0.03027
Maximum 5
Zeros 31854
Zeros (%) 2.2%
Negatives 944
Negatives (%) 0.1%
  • 3 is skewed right (γ1 = 5.2495)

Quantile Statistics

Minimum -0.03027
5-th Percentile 0.00097495
Q1 0.02277
Median 0.1421
Q3 0.3893
95-th Percentile 0.8087
Maximum 5
Range 5.0303
IQR 0.3665

Descriptive Statistics

Mean 0.2595
Standard Deviation 0.391
Variance 0.1529
Sum 367714.7863
Skewness 5.2495
Kurtosis 46.9588
Coefficient of Variation 1.5064
  • 3 is not normally distributed (p-value 3.8558821304404804e-19)
  • 3 has 48169 outliers

4

numerical

Approximate Distinct Count 296
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.1194
Minimum -4
Maximum 1.2267
Zeros 1024993
Zeros (%) 72.3%
Negatives 22454
Negatives (%) 1.6%
  • 4 is skewed left (γ1 = -4.1754)

Quantile Statistics

Minimum -4
5-th Percentile 0
Q1 0
Median 0
Q3 0.1579
95-th Percentile 1
Maximum 1.2267
Range 5.2267
IQR 0.1579

Descriptive Statistics

Mean 0.1194
Standard Deviation 0.6249
Variance 0.3906
Sum 169122.2118
Skewness -4.1754
Kurtosis 27.3869
Coefficient of Variation 5.2355
  • 4 is not normally distributed (p-value 1.3358414760906714e-24)
  • 4 has 312538 outliers

5

numerical

Approximate Distinct Count 163678
Approximate Unique (%) 11.6%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.699
Minimum -4
Maximum 2.2833
Zeros 70
Zeros (%) 0.0%
Negatives 17535
Negatives (%) 1.2%
  • 5 is skewed left (γ1 = -5.6769)

Quantile Statistics

Minimum -4
5-th Percentile 0.15
Q1 0.5324
Median 0.7887
Q3 0.9713
95-th Percentile 1
Maximum 2.2833
Range 6.2833
IQR 0.4389

Descriptive Statistics

Mean 0.699
Standard Deviation 0.4761
Variance 0.2266
Sum 990381.451
Skewness -5.6769
Kurtosis 53.4048
Coefficient of Variation 0.6811
  • 5 is not normally distributed (p-value 5.19121333577517e-16)
  • 5 has 27147 outliers

6

numerical

Approximate Distinct Count 79900
Approximate Unique (%) 5.6%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.4705
Minimum -4
Maximum 5
Zeros 235
Zeros (%) 0.0%
Negatives 37258
Negatives (%) 2.6%
  • 6 is skewed left (γ1 = -2.4515)

Quantile Statistics

Minimum -4
5-th Percentile 0.07098
Q1 0.2921
Median 0.4885
Q3 0.7359
95-th Percentile 1.0286
Maximum 5
Range 9
IQR 0.4438

Descriptive Statistics

Mean 0.4705
Standard Deviation 0.7978
Variance 0.6365
Sum 666647.0109
Skewness -2.4515
Kurtosis 22.3142
Coefficient of Variation 1.6956
  • 6 is not normally distributed (p-value 3.721765674875193e-11)
  • 6 has 62918 outliers

7

categorical

Approximate Distinct Count 1
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 96344100

Length

Mean 3
Standard Deviation 0
Median 3
Minimum 3
Maximum 3

Sample

1st row 0.0
2nd row 0.0
3rd row 0.0
4th row 0.0
5th row 0.0

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 2833650
  • 7 has words of constant length

8

numerical

Approximate Distinct Count 29175
Approximate Unique (%) 2.1%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.04985
Minimum -0.002441
Maximum 5
Zeros 278472
Zeros (%) 19.7%
Negatives 375
Negatives (%) 0.0%
  • 8 is skewed right (γ1 = 8.5283)

Quantile Statistics

Minimum -0.002441
5-th Percentile 0
Q1 0.00022
Median 0.005878
Q3 0.04263
95-th Percentile 0.2648
Maximum 5
Range 5.0024
IQR 0.04241

Descriptive Statistics

Mean 0.04985
Standard Deviation 0.1141
Variance 0.01302
Sum 70622.6875
Skewness 8.5283
Kurtosis 230.0325
Coefficient of Variation 2.2895
  • 8 is not normally distributed (p-value 7.558619131899602e-25)
  • 8 has 212261 outliers

9

numerical

Approximate Distinct Count 39779
Approximate Unique (%) 2.8%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.007438
Minimum 0
Maximum 5
Zeros 1038811
Zeros (%) 73.3%
Negatives 0
Negatives (%) 0.0%
  • 9 is skewed right (γ1 = 36.8187)

Quantile Statistics

Minimum 0
5-th Percentile 0
Q1 0
Median 0
Q3 1.9e-05
95-th Percentile 0.00931
Maximum 5
Range 5
IQR 1.9e-05

Descriptive Statistics

Mean 0.007438
Standard Deviation 0.1072
Variance 0.01149
Sum 10538.7316
Skewness 36.8187
Kurtosis 1589.1999
Coefficient of Variation 14.4079
  • 9 is not normally distributed (p-value 4.233659647136274e-25)
  • 9 has 312219 outliers

10

numerical

Approximate Distinct Count 203316
Approximate Unique (%) 14.3%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.08062
Minimum -0.004525
Maximum 5
Zeros 560574
Zeros (%) 39.6%
Negatives 2
Negatives (%) 0.0%
  • 10 is skewed right (γ1 = 8.5608)

Quantile Statistics

Minimum -0.004525
5-th Percentile 0
Q1 0
Median 0.03075
Q3 0.1108
95-th Percentile 0.3329
Maximum 5
Range 5.0045
IQR 0.1108

Descriptive Statistics

Mean 0.08062
Standard Deviation 0.1451
Variance 0.02106
Sum 114228.4369
Skewness 8.5608
Kurtosis 191.2875
Coefficient of Variation 1.8
  • 10 is not normally distributed (p-value 1.2172389213128854e-23)
  • 10 has 102713 outliers

11

numerical

Approximate Distinct Count 4245
Approximate Unique (%) 0.3%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.09614
Minimum -0.02174
Maximum 5
Zeros 36028
Zeros (%) 2.5%
Negatives 10
Negatives (%) 0.0%
  • 11 is skewed right (γ1 = 3.8444)

Quantile Statistics

Minimum -0.02174
5-th Percentile 0.000222
Q1 0.02264
Median 0.06122
Q3 0.125
95-th Percentile 0.32
Maximum 5
Range 5.0217
IQR 0.1024

Descriptive Statistics

Mean 0.09614
Standard Deviation 0.1155
Variance 0.01333
Sum 136214.57
Skewness 3.8444
Kurtosis 55.2878
Coefficient of Variation 1.2011
  • 11 is not normally distributed (p-value 1.6286400323016454e-21)
  • 11 has 98480 outliers

12

numerical

Approximate Distinct Count 827
Approximate Unique (%) 0.1%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.04265
Minimum -0.02941
Maximum 5
Zeros 879758
Zeros (%) 62.1%
Negatives 174
Negatives (%) 0.0%
  • 12 is skewed right (γ1 = 10.0452)

Quantile Statistics

Minimum -0.02941
5-th Percentile 0
Q1 0
Median 0
Q3 0.04494
95-th Percentile 0.25
Maximum 5
Range 5.0294
IQR 0.04494

Descriptive Statistics

Mean 0.04265
Standard Deviation 0.1106
Variance 0.01223
Sum 60433.3107
Skewness 10.0452
Kurtosis 269.9524
Coefficient of Variation 2.5924
  • 12 is not normally distributed (p-value 1.3275828986827148e-24)
  • 12 has 170446 outliers

13

numerical

Approximate Distinct Count 307718
Approximate Unique (%) 21.7%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.1448
Minimum -0.07564
Maximum 5
Zeros 37
Zeros (%) 0.0%
Negatives 5592
Negatives (%) 0.4%
  • 13 is skewed right (γ1 = 2.7447)

Quantile Statistics

Minimum -0.07564
5-th Percentile 0.004073
Q1 0.03103
Median 0.07188
Q3 0.1838
95-th Percentile 0.5656
Maximum 5
Range 5.0756
IQR 0.1528

Descriptive Statistics

Mean 0.1448
Standard Deviation 0.1855
Variance 0.03441
Sum 205213.1103
Skewness 2.7447
Kurtosis 21.6071
Coefficient of Variation 1.2807
  • 13 is not normally distributed (p-value 4.356695617026614e-20)
  • 13 has 141815 outliers

14

numerical

Approximate Distinct Count 98543
Approximate Unique (%) 7.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.07671
Minimum -0.01048
Maximum 3.5318
Zeros 28780
Zeros (%) 2.0%
Negatives 260
Negatives (%) 0.0%
  • 14 is skewed right (γ1 = 3.3068)

Quantile Statistics

Minimum -0.01048
5-th Percentile 0.000185
Q1 0.007271
Median 0.02802
Q3 0.1066
95-th Percentile 0.2883
Maximum 3.5318
Range 3.5423
IQR 0.0993

Descriptive Statistics

Mean 0.07671
Standard Deviation 0.1166
Variance 0.0136
Sum 108683.6301
Skewness 3.3068
Kurtosis 20.7888
Coefficient of Variation 1.5205
  • 14 is not normally distributed (p-value 2.364297239767447e-23)
  • 14 has 89183 outliers

15

numerical

Approximate Distinct Count 236974
Approximate Unique (%) 16.7%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.1095
Minimum -0.01854
Maximum 5
Zeros 122
Zeros (%) 0.0%
Negatives 297
Negatives (%) 0.0%
  • 15 is skewed right (γ1 = 9.0982)

Quantile Statistics

Minimum -0.01854
5-th Percentile 0.0007693
Q1 0.01572
Median 0.06394
Q3 0.1356
95-th Percentile 0.3563
Maximum 5
Range 5.0185
IQR 0.1198

Descriptive Statistics

Mean 0.1095
Standard Deviation 0.1873
Variance 0.03508
Sum 155120.7804
Skewness 9.0982
Kurtosis 163.1029
Coefficient of Variation 1.7108
  • 15 is not normally distributed (p-value 1.925690835580539e-21)
  • 15 has 90427 outliers

16

numerical

Approximate Distinct Count 286
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.00046563
Minimum 0
Maximum 5
Zeros 1412781
Zeros (%) 99.7%
Negatives 0
Negatives (%) 0.0%
  • 16 is skewed right (γ1 = 113.9587)

Quantile Statistics

Minimum 0
5-th Percentile 0
Q1 0
Median 0
Q3 0
95-th Percentile 0
Maximum 5
Range 5
IQR 0

Descriptive Statistics

Mean 0.00046563
Standard Deviation 0.04034
Variance 0.001627
Sum 659.7188
Skewness 113.9587
Kurtosis 13582.1409
Coefficient of Variation 86.6288
  • 16 is not normally distributed (p-value 4.226516058055491e-25)

17

numerical

Approximate Distinct Count 348
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.00058212
Minimum 0
Maximum 5
Zeros 1414488
Zeros (%) 99.8%
Negatives 0
Negatives (%) 0.0%
  • 17 is skewed right (γ1 = 108.1668)

Quantile Statistics

Minimum 0
5-th Percentile 0
Q1 0
Median 0
Q3 0
95-th Percentile 0
Maximum 5
Range 5
IQR 0

Descriptive Statistics

Mean 0.00058212
Standard Deviation 0.03762
Variance 0.001415
Sum 824.7551
Skewness 108.1668
Kurtosis 13168.8969
Coefficient of Variation 64.6238
  • 17 is not normally distributed (p-value 4.2265304105215105e-25)

18

numerical

Approximate Distinct Count 749827
Approximate Unique (%) 52.9%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.2776
Minimum -0.1237
Maximum 5
Zeros 116
Zeros (%) 0.0%
Negatives 1579
Negatives (%) 0.1%
  • 18 is skewed right (γ1 = 2.1436)

Quantile Statistics

Minimum -0.1237
5-th Percentile 0.003805
Q1 0.04892
Median 0.1929
Q3 0.4463
95-th Percentile 0.816
Maximum 5
Range 5.1237
IQR 0.3974

Descriptive Statistics

Mean 0.2776
Standard Deviation 0.289
Variance 0.0835
Sum 393244.6778
Skewness 2.1436
Kurtosis 12.7118
Coefficient of Variation 1.0411
  • 18 is not normally distributed (p-value 5.662918806900734e-17)
  • 18 has 22201 outliers

19

numerical

Approximate Distinct Count 668912
Approximate Unique (%) 47.2%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.2507
Minimum -0.1202
Maximum 5
Zeros 493
Zeros (%) 0.0%
Negatives 2089
Negatives (%) 0.1%
  • 19 is skewed right (γ1 = 1.7121)

Quantile Statistics

Minimum -0.1202
5-th Percentile 0.004766
Q1 0.03823
Median 0.1618
Q3 0.4182
95-th Percentile 0.7492
Maximum 5
Range 5.1202
IQR 0.38

Descriptive Statistics

Mean 0.2507
Standard Deviation 0.2637
Variance 0.06955
Sum 355254.2388
Skewness 1.7121
Kurtosis 7.9645
Coefficient of Variation 1.0518
  • 19 is not normally distributed (p-value 3.1424660210829224e-19)
  • 19 has 19570 outliers

20

numerical

Approximate Distinct Count 371989
Approximate Unique (%) 26.3%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.2868
Minimum -0.1211
Maximum 5
Zeros 238
Zeros (%) 0.0%
Negatives 1471
Negatives (%) 0.1%
  • 20 is skewed right (γ1 = 2.0053)

Quantile Statistics

Minimum -0.1211
5-th Percentile 0.00908
Q1 0.06422
Median 0.2032
Q3 0.4674
95-th Percentile 0.7797
Maximum 5
Range 5.1211
IQR 0.4032

Descriptive Statistics

Mean 0.2868
Standard Deviation 0.281
Variance 0.07899
Sum 406290.3428
Skewness 2.0053
Kurtosis 11.6484
Coefficient of Variation 0.9801
  • 20 is not normally distributed (p-value 5.568767475325049e-15)
  • 20 has 19770 outliers

21

numerical

Approximate Distinct Count 372735
Approximate Unique (%) 26.3%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.2851
Minimum -0.1333
Maximum 3.6841
Zeros 431
Zeros (%) 0.0%
Negatives 1819
Negatives (%) 0.1%
  • 21 is skewed right (γ1 = 1.3783)

Quantile Statistics

Minimum -0.1333
5-th Percentile 0.01154
Q1 0.06369
Median 0.203
Q3 0.4677
95-th Percentile 0.7955
Maximum 3.6841
Range 3.8174
IQR 0.404

Descriptive Statistics

Mean 0.2851
Standard Deviation 0.2698
Variance 0.07281
Sum 403886.5177
Skewness 1.3783
Kurtosis 3.6382
Coefficient of Variation 0.9466
  • 21 is not normally distributed (p-value 1.2779473966768583e-13)
  • 21 has 17798 outliers

22

numerical

Approximate Distinct Count 282912
Approximate Unique (%) 20.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.2734
Minimum -0.1636
Maximum 5
Zeros 33562
Zeros (%) 2.4%
Negatives 11053
Negatives (%) 0.8%
  • 22 is skewed right (γ1 = 2.2549)

Quantile Statistics

Minimum -0.1636
5-th Percentile 0.001422
Q1 0.03847
Median 0.1667
Q3 0.4101
95-th Percentile 0.9357
Maximum 5
Range 5.1636
IQR 0.3716

Descriptive Statistics

Mean 0.2734
Standard Deviation 0.3113
Variance 0.09688
Sum 387331.2179
Skewness 2.2549
Kurtosis 15.884
Coefficient of Variation 1.1386
  • 22 is not normally distributed (p-value 4.942711451755045e-14)
  • 22 has 47714 outliers

23

numerical

Approximate Distinct Count 57646
Approximate Unique (%) 4.1%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.4545
Minimum -4
Maximum 5
Zeros 13307
Zeros (%) 0.9%
Negatives 13500
Negatives (%) 1.0%
  • 23 is skewed right (γ1 = 0.0941)

Quantile Statistics

Minimum -4
5-th Percentile 0.004011
Q1 0.1596
Median 0.4166
Q3 0.7825
95-th Percentile 0.9587
Maximum 5
Range 9
IQR 0.6229

Descriptive Statistics

Mean 0.4545
Standard Deviation 0.3717
Variance 0.1382
Sum 643989.2317
Skewness 0.09406
Kurtosis 5.6031
Coefficient of Variation 0.8178
  • 23 is not normally distributed (p-value 4.2476627764296936e-10)
  • 23 has 8204 outliers

24

numerical

Approximate Distinct Count 10911
Approximate Unique (%) 0.8%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.1858
Minimum -0.1243
Maximum 5
Zeros 264534
Zeros (%) 18.7%
Negatives 1418
Negatives (%) 0.1%
  • 24 is skewed right (γ1 = 2.4317)

Quantile Statistics

Minimum -0.1243
5-th Percentile 0
Q1 0.01967
Median 0.08812
Q3 0.2794
95-th Percentile 0.636
Maximum 5
Range 5.1243
IQR 0.2597

Descriptive Statistics

Mean 0.1858
Standard Deviation 0.2286
Variance 0.05226
Sum 263184.4349
Skewness 2.4317
Kurtosis 18.319
Coefficient of Variation 1.2307
  • 24 is not normally distributed (p-value 1.3173578119092558e-20)
  • 24 has 54914 outliers

25

numerical

Approximate Distinct Count 45952
Approximate Unique (%) 3.2%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.4243
Minimum -4
Maximum 5
Zeros 26584
Zeros (%) 1.9%
Negatives 31029
Negatives (%) 2.2%
  • 25 is skewed left (γ1 = -0.0247)

Quantile Statistics

Minimum -4
5-th Percentile 0.00047047
Q1 0.1443
Median 0.3867
Q3 0.6884
95-th Percentile 0.9843
Maximum 5
Range 9
IQR 0.5441

Descriptive Statistics

Mean 0.4243
Standard Deviation 0.3673
Variance 0.1349
Sum 601168.6073
Skewness -0.02471
Kurtosis 5.8152
Coefficient of Variation 0.8657
  • 25 is not normally distributed (p-value 1.8406710887445934e-10)
  • 25 has 8630 outliers

26

numerical

Approximate Distinct Count 16
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 2.4122e-05
Minimum 0
Maximum 2
Zeros 1416779
Zeros (%) 100.0%
Negatives 0
Negatives (%) 0.0%
  • 26 is skewed right (γ1 = 227.1958)

Quantile Statistics

Minimum 0
5-th Percentile 0
Q1 0
Median 0
Q3 0
95-th Percentile 0
Maximum 2
Range 2
IQR 0

Descriptive Statistics

Mean 2.4122e-05
Standard Deviation 0.004775
Variance 2.2801e-05
Sum 34.1765
Skewness 227.1958
Kurtosis 58723.3344
Coefficient of Variation 197.956
  • 26 is not normally distributed (p-value 4.226514112528877e-25)

27

numerical

Approximate Distinct Count 356059
Approximate Unique (%) 25.1%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.2886
Minimum -0.147
Maximum 5
Zeros 170
Zeros (%) 0.0%
Negatives 1766
Negatives (%) 0.1%
  • 27 is skewed right (γ1 = 1.936)

Quantile Statistics

Minimum -0.147
5-th Percentile 0.01211
Q1 0.06682
Median 0.2045
Q3 0.4644
95-th Percentile 0.7957
Maximum 5
Range 5.147
IQR 0.3976

Descriptive Statistics

Mean 0.2886
Standard Deviation 0.2818
Variance 0.0794
Sum 408828.4148
Skewness 1.936
Kurtosis 10.4887
Coefficient of Variation 0.9765
  • 27 is not normally distributed (p-value 2.239830689018362e-12)
  • 27 has 20467 outliers

28

numerical

Approximate Distinct Count 380
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 4.9993e-05
Minimum 0
Maximum 1.2138
Zeros 1416399
Zeros (%) 100.0%
Negatives 0
Negatives (%) 0.0%
  • 28 is skewed right (γ1 = 138.0468)

Quantile Statistics

Minimum 0
5-th Percentile 0
Q1 0
Median 0
Q3 0
95-th Percentile 0
Maximum 1.2138
Range 1.2138
IQR 0

Descriptive Statistics

Mean 4.9993e-05
Standard Deviation 0.00565
Variance 3.192e-05
Sum 70.8318
Skewness 138.0468
Kurtosis 20984.151
Coefficient of Variation 113.0107
  • 28 is not normally distributed (p-value 4.226514232309892e-25)

29

numerical

Approximate Distinct Count 1420
Approximate Unique (%) 0.1%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.2019
Minimum -0.1013
Maximum 5
Zeros 290219
Zeros (%) 20.5%
Negatives 9
Negatives (%) 0.0%
  • 29 is skewed right (γ1 = 2.3134)

Quantile Statistics

Minimum -0.1013
5-th Percentile 0
Q1 0.008
Median 0.08621
Q3 0.25
95-th Percentile 1
Maximum 5
Range 5.1013
IQR 0.242

Descriptive Statistics

Mean 0.2019
Standard Deviation 0.2878
Variance 0.08281
Sum 286080.9242
Skewness 2.3134
Kurtosis 9.9679
Coefficient of Variation 1.4252
  • 29 is not normally distributed (p-value 8.568212375186991e-15)
  • 29 has 137111 outliers

30

numerical

Approximate Distinct Count 359684
Approximate Unique (%) 25.4%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.2897
Minimum -0.1333
Maximum 3.8545
Zeros 291
Zeros (%) 0.0%
Negatives 2075
Negatives (%) 0.1%
  • 30 is skewed right (γ1 = 1.4425)

Quantile Statistics

Minimum -0.1333
5-th Percentile 0.01247
Q1 0.06882
Median 0.2179
Q3 0.4656
95-th Percentile 0.7968
Maximum 3.8545
Range 3.9879
IQR 0.3968

Descriptive Statistics

Mean 0.2897
Standard Deviation 0.2693
Variance 0.07252
Sum 410428.0661
Skewness 1.4425
Kurtosis 4.2629
Coefficient of Variation 0.9296
  • 30 is not normally distributed (p-value 5.674602313974337e-12)
  • 30 has 19826 outliers

31

numerical

Approximate Distinct Count 6847
Approximate Unique (%) 0.5%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.1928
Minimum -0.09112
Maximum 5
Zeros 349628
Zeros (%) 24.7%
Negatives 1024
Negatives (%) 0.1%
  • 31 is skewed right (γ1 = 3.013)

Quantile Statistics

Minimum -0.09112
5-th Percentile 0
Q1 0.006543
Median 0.1329
Q3 0.3086
95-th Percentile 0.6137
Maximum 5
Range 5.0911
IQR 0.3021

Descriptive Statistics

Mean 0.1928
Standard Deviation 0.2223
Variance 0.0494
Sum 273130.6066
Skewness 3.013
Kurtosis 37.9967
Coefficient of Variation 1.1529
  • 31 is not normally distributed (p-value 6.174044705349451e-13)
  • 31 has 20206 outliers

32

numerical

Approximate Distinct Count 238
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.05249
Minimum -0.25
Maximum 5
Zeros 995593
Zeros (%) 70.3%
Negatives 28
Negatives (%) 0.0%
  • 32 is skewed right (γ1 = 8.1927)

Quantile Statistics

Minimum -0.25
5-th Percentile 0
Q1 0
Median 0
Q3 0.02326
95-th Percentile 0.25
Maximum 5
Range 5.25
IQR 0.02326

Descriptive Statistics

Mean 0.05249
Standard Deviation 0.1239
Variance 0.01535
Sum 74362.7734
Skewness 8.1927
Kurtosis 235.8816
Coefficient of Variation 2.3609
  • 32 is not normally distributed (p-value 1.2010744230854874e-24)
  • 32 has 278867 outliers

33

numerical

Approximate Distinct Count 2242
Approximate Unique (%) 0.2%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.08535
Minimum -0.3999
Maximum 5
Zeros 98221
Zeros (%) 6.9%
Negatives 1957
Negatives (%) 0.1%
  • 33 is skewed right (γ1 = 7.1375)

Quantile Statistics

Minimum -0.3999
5-th Percentile 0
Q1 0.01087
Median 0.0375
Q3 0.08591
95-th Percentile 0.3684
Maximum 5
Range 5.3999
IQR 0.07504

Descriptive Statistics

Mean 0.08535
Standard Deviation 0.1531
Variance 0.02345
Sum 120921.7883
Skewness 7.1375
Kurtosis 148.7047
Coefficient of Variation 1.7943
  • 33 is not normally distributed (p-value 1.813083449345787e-18)
  • 33 has 125058 outliers

34

numerical

Approximate Distinct Count 80798
Approximate Unique (%) 5.7%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.2932
Minimum -0.2485
Maximum 5
Zeros 116955
Zeros (%) 8.3%
Negatives 2009
Negatives (%) 0.1%
  • 34 is skewed right (γ1 = 1.6971)

Quantile Statistics

Minimum -0.2485
5-th Percentile 0
Q1 0.08751
Median 0.2123
Q3 0.4731
95-th Percentile 0.8137
Maximum 5
Range 5.2485
IQR 0.3856

Descriptive Statistics

Mean 0.2932
Standard Deviation 0.264
Variance 0.0697
Sum 415434.0529
Skewness 1.6971
Kurtosis 14.2395
Coefficient of Variation 0.9004
  • 34 is not normally distributed (p-value 2.27430307000152e-11)
  • 34 has 4716 outliers

35

numerical

Approximate Distinct Count 81247
Approximate Unique (%) 5.7%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.289
Minimum -0.2485
Maximum 5
Zeros 110732
Zeros (%) 7.8%
Negatives 2008
Negatives (%) 0.1%
  • 35 is skewed right (γ1 = 1.685)

Quantile Statistics

Minimum -0.2485
5-th Percentile 0
Q1 0.07296
Median 0.208
Q3 0.4719
95-th Percentile 0.8049
Maximum 5
Range 5.2485
IQR 0.3989

Descriptive Statistics

Mean 0.289
Standard Deviation 0.2652
Variance 0.07033
Sum 409511.5784
Skewness 1.685
Kurtosis 14.0013
Coefficient of Variation 0.9175
  • 35 is not normally distributed (p-value 4.7627745139792535e-12)
  • 35 has 4100 outliers

36

numerical

Approximate Distinct Count 51874
Approximate Unique (%) 3.7%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.02269
Minimum 0
Maximum 1.5474
Zeros 1313347
Zeros (%) 92.7%
Negatives 0
Negatives (%) 0.0%
  • 36 is skewed right (γ1 = 4.6221)

Quantile Statistics

Minimum 0
5-th Percentile 0
Q1 0
Median 0
Q3 0
95-th Percentile 0.2557
Maximum 1.5474
Range 1.5474
IQR 0

Descriptive Statistics

Mean 0.02269
Standard Deviation 0.09487
Variance 0.009
Sum 32151.211
Skewness 4.6221
Kurtosis 22.0397
Coefficient of Variation 4.1807
  • 36 is not normally distributed (p-value 4.268438247530034e-25)
  • 36 has 103478 outliers

37

numerical

Approximate Distinct Count 31728
Approximate Unique (%) 2.2%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 22669200
Mean 0.01113
Minimum 0
Maximum 1.1255
Zeros 1369420
Zeros (%) 96.7%
Negatives 0
Negatives (%) 0.0%
  • 37 is skewed right (γ1 = 6.9768)

Quantile Statistics

Minimum 0
5-th Percentile 0
Q1 0
Median 0
Q3 0
95-th Percentile 0
Maximum 1.1255
Range 1.1255
IQR 0

Descriptive Statistics

Mean 0.01113
Standard Deviation 0.06904
Variance 0.004766
Sum 15764.503
Skewness 6.9768
Kurtosis 51.4064
Coefficient of Variation 6.2045
  • 37 is not normally distributed (p-value 4.231046471553706e-25)
  • 37 has 47405 outliers

y

categorical

Approximate Distinct Count 2
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 96344100
  • The largest value (0.0) is over 47.12 times larger than the second largest value (1.0)

Length

Mean 3
Standard Deviation 0
Median 3
Minimum 3
Maximum 3

Sample

1st row 0.0
2nd row 0.0
3rd row 0.0
4th row 0.0
5th row 0.0

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 2833650
  • The top 2 categories (0.0, 1.0) take over 50.0%
  • The largest value (00) is over 47.12 times larger than the second largest value (10)
  • y has words of constant length

Interactions

Correlations

Missing Values

Report generated with DataPrep

Evaluating the results¶

The reports shows that 37 of 38 input features are numerical, with the one exception being just a unique value in the whole dataset (for every timestep and every entity). The features show a Normal distribution of values, with many of them where only half of the Bell curve is present (the values are centered around 0 but only positive measurements are possible). The box-plot representations show a high number of potential anomalies.The dataset has no missing values.

Looking at the Correlation Matrix we observe that the vast majority of features are positively correlated, with a number of them showing a high amount of correlation.