Automate the exploratory data analysis (EDA) to understand the data faster and easier

Mochamad Kautzar Ichramsyah
CodeX
Published in
11 min readJul 11, 2023

--

Photo by Charlotte Karlsen on Unsplash

What is EDA?

EDA is one of the most important things we need to do as an approach to understand the dataset better. Almost all data analytics or data science professionals do this process before generating insights or doing data modeling. In real life, this process took a lot of time, depending on the complexity and completeness of the dataset we have. Of course, more variables make us explore more to get the summary we need before doing the next steps.

That’s why using R or Python, the most common programming language to do data analysis, some packages help to do that process faster and easier, but not better. Why not better? Because it only shows us a summary, before we focus to explore deeper any variables we find “interesting”.

The “80/20 rule” applies: 80 percent of a data analyst or scientist’s valuable time is spent simply finding, cleansing, and organizing data, leaving only 20 percent to perform analysis.

Which libraries we can use?

In R, we can use these libraries:

  1. dataMaid
  2. DataExplorer
  3. SmartEDA

In Python, we can use these libraries:

  1. ydata-profiling
  2. dtale
  3. sweetviz
  4. autoviz

Let’s try each library listed above to know what they look like and how they can help us do exploratory data analysis! In this post, I will use the iris dataset which is common to be used to learn how to code in R or Python.

In R, you can use this code to load the iris dataset:

# iris is part of R's default, no need to load any packages
df = iris
# use "head()" to show the first 6 rows
head(df)
Image 1. Load the `iris` dataset in R

In Python, you can use this code to load the iris dataset:

# need to import these things first 
from sklearn.datasets import load_iris
import numpy as np
import pandas as pd
# use load_iris
iris = load_iris()
# convert into a pandas data frame
df = pd.DataFrame(
data= np.c_[iris['data'], iris['target']],
columns= iris['feature_names'] + ['species']
)
# set manually the species column as a categorical variable
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
# use ".head()" to show the first 5 rows
df.head()
Image 2. Load the `iris` dataset in Python

R: dataMaid

First, we need to execute the simple code below:

# install the dataMaid library
install.packages("dataMaid")
# load the dataMaid library
library(dataMaid)
# use makeDataReport with HTML as output
makeDataReport(df, output = "html", replace = TRUE)

From the first snapshot (Image 3), we already get a lot of information about the iris dataset:

  1. The number of observations is 150.
  2. The number of variables is 5.
  3. Variable checks were performed, depending on the data type of each variable, such as identifying miscoded missing values, levels with < 6 obs, and outliers.
Image 3. The first snapshot of the created report using `dataMaid` of the iris dataset

From the second snapshot (Image 4):

  1. The summary table of the variables includes variable class, unique values, missing observations, and any problems detected. We can see that Sepal.Width and Petal.Length variables have problems detected.
  2. Sepal.Length central measurements were provided including the histogram to give us the univariate distribution.
  3. Sepal.Width has possible outlier values as listed. That’s why the summary table says problems detected.
Image 4. The second snapshot of the created report using `dataMaid` of the iris dataset

From the third snapshot (Image 5):

  1. Petal.Length has possible outlier values as listed.
  2. Petal.Width central measurements were provided including the histogram to give us the univariate distribution.
  3. Species as target variable detected as a factor and the count of the data is equal for each type, which is 50.
Image 5. The third snapshot of the created report using `dataMaid` of the iris dataset

Based on the data report above created using dataMaid in R, we already get a lot of information regarding the iris dataset, only by executing a one-line code. 😃

R: DataExplorer

First, we need to execute the simple code below:

# install the DataExplorer library
install.packages("DataExplorer")
# load the DataExplorer library
library(DataExplorer)
# use create_report
create_report(df)

From the first until the sixth snapshot (Images 6, 7, 8, 9, 10, 11), the information we got was not much different from the previous package.

Image 6. The first snapshot of the created report using `DataExplorer` of the iris dataset
Image 7. The second snapshot of the created report using `DataExplorer` of the iris dataset
Image 8. The third snapshot of the created report using `DataExplorer` of the iris dataset
Image 9. The fourth snapshot of the created report using `DataExplorer` of the iris dataset
Image 10. The fifth snapshot of the created report using `DataExplorer` of the iris dataset
Image 11. The sixth snapshot of the created report using `DataExplorer` of the iris dataset

From the seventh snapshot (Image 12), we got the QQ plot for each numerical variable in the iris dataset.

Image 12. The seventh snapshot of the created report using `DataExplorer` of the iris dataset

From the eighth snapshot (Image 13), we got the correlation matrix for each variable in the iris dataset. We can see some information such as:

  1. Petal.Width and Petal.Length has a strong positive correlation of 0.96, which means in the iris dataset, the wider the petal width, the longer the petal length.
  2. Species_setosa and Petal.Length has a strong negative correlation of -0.92, which means in the iris dataset, the shorter the petal length, the higher possibility the species is setosa.
  3. Using the examples above, please provide your findings using this correlation matrix.
Image 13. The eighth snapshot of the created report using `DataExplorer` of the iris dataset

The ninth snapshot (Image 14), using principal component analysis (PCA), provides the percentage of the variance explained by principal components with a note that labels indicate the cumulative percentage of explained variance, it shows 62%, the higher the better. For the explanation of PCA, I think I need another post for this. 😆

Image 14. The ninth snapshot of the created report using `DataExplorer` of the iris dataset

The tenth snapshot (Image 15) provides the relative importance of each variable, it shows that Petal.Length has the highest importance which is almost 0.5, followed by Petal.Width , and so on.

Image 15. The tenth snapshot of the created report using `DataExplorer` of the iris dataset

R: SmartEDA

First, we need to execute the simple code below:

# install the SmartEDA library
install.packages("SmartEDA")
# load the SmartEDA library
library(SmartEDA)
# use ExpReport
ExpReport(df, op_file = 'SmartEDA_df.html')

From Images 16, 17, 18, 23, and 24, the information we got was not much different from the previous package.

Image 16. The first snapshot of the created report using `SmartEDA` of the iris dataset
Image 17. The second snapshot of the created report using `SmartEDA` of the iris dataset
Image 18. The third snapshot of the created report using `SmartEDA` of the iris dataset

From Image 19, shows us the density plot of each variable including the skewness and kurtosis measurement, which is used to tell us whether the data is normally distributed or not. The explanation of skewness and kurtosis also needs another post I guess 😅

Image 19. The fourth snapshot of the created report using `SmartEDA` of the iris dataset

From Image 20, 21, and 22, shows us the scatter plot between numerical variables available in the iris dataset which is telling us the correlation, visually. It gives us similar information to the correlation matrix which is in numeric format.

Image 20. The fifth snapshot of the created report using `SmartEDA` of the iris dataset
Image 21. The sixth snapshot of the created report using `SmartEDA` of the iris dataset
Image 22. The seventh snapshot of the created report using `SmartEDA` of the iris dataset
Image 23. The ninth snapshot of the created report using `SmartEDA` of the iris dataset
Image 24. The tenth snapshot of the created report using `SmartEDA` of the iris dataset

R: Conclusion

Using the three packages above, we got a lot of information about the iris dataset, much faster than we trying to create it manually, but it’s not enough, that’s why I said in the title “…faster and easier…”, because it only gives us a glimpse of the iris dataset, but at least it gives us which things we can start to working on rather than looking for starting point, such as:

  1. No missing variables / no miscoded variables, we can skip these steps.
  2. The outlier detected in some variables, we can start to clean the data by using any proper methods to handle the outlier values, rather than looking for which variables have outlier values one by one manually.
  3. We can start to handle variables that are not normally distributed if needed.
  4. Based on the correlation matrix and scatter plots, we got a glimpse of which variables strongly or weakly correlated.
  5. Using principal component analysis (PCA) provides the percentage of the variance explained by principal components with a note that labels indicate the cumulative percentage of explained variance
  6. The relative importance of each feature of the iris dataset is also shown in this automated EDA.

Python: ydata-profiling

First, we need to execute the simple code below:

# install the ydata-profiling package
pip install ydata-profiling
# load the ydata_profiling package
from ydata_profiling import ProfileReport
# use ProfileReport
pr_df = ProfileReport(df)
# show pr_df
pr_df

Mostly, it shows similar information. I will try to mention some information that is quite different from previous packages:

  1. In image 26, we got a summary in sentences about which variables have a high correlation.
  2. Overall, the output is more interactive compared to previous packages, because we can click to move to other tabs, and select specific columns to be displayed.
Image 25. The first snapshot of the created report using `ydata_profiling` of the iris dataset
Image 26. The second snapshot of the created report using `ydata_profiling` of the iris dataset
Image 27. The third snapshot of the created report using `ydata_profiling` of the iris dataset
Image 28. The fourth snapshot of the created report using `ydata_profiling` of the iris dataset
Image 29. The fifth snapshot of the created report using `ydata_profiling` of the iris dataset
Image 30. The sixth snapshot of the created report using `ydata_profiling` of the iris dataset
Image 31. The seventh snapshot of the created report using `ydata_profiling` of the iris dataset
Image 32. The eighth snapshot of the created report using `ydata_profiling` of the iris dataset
Image 33. The ninth snapshot of the created report using `ydata_profiling` of the iris dataset

Python: dtale

First, we need to execute the simple code below:

# install the dtale package
pip install dtale
# load the dtale
import dtale
# use show
dtale.show(df)

The output of this package is very different from previous packages, in terms of how to use it, the content is quite similar, but it makes us can explore better.

Image 34. The first snapshot of the created report using `dtale` of the iris dataset
Image 35. The second snapshot of the created report using `dtale` of the iris dataset
Image 36. The third snapshot of the created report using `dtale` of the iris dataset
Image 37. The fourth snapshot of the created report using `dtale` of the iris dataset

Python: sweetviz

First, we need to execute the simple code below:

# install the sweetviz package
pip install sweetviz
# load the sweetviz
import sweetviz
# use analyze
analyze_df = sweetviz.analyze([df, "df"], target_feat = 'species')
# then show
analyze_df.show_html('analyze.html')

Using this package, the UI and UX are very different, please enjoy the show!

Image 38. The first snapshot of the created report using `sweetviz` of the iris dataset
Image 39. The second snapshot of the created report using `sweetviz` of the iris dataset

Human beings are visual creatures, Which means that the human brain processes images 60,000 times faster than text, and 90 percent of information transmitted to the brain is visual. Visual information makes it easier to collaborate, and generate new ideas that impact organizational performance. That’s the only reason that data analyst spends their maximum time in data visualization.

Python: autoviz

First, we need to execute the simple code below:

# install the dtale package
pip install autoviz
# load the autoviz
from autoviz import AutoViz_Class
# set AutoViz_Class()
av = AutoViz_Class()
# produce AutoVize_Class of df
avt = av.AutoViz(
"",
sep = ",",
depVar = "",
dfte = df,
header = 0,
verbose = 1,
lowess = False,
chart_format = "server",
max_rows_analyzed = 10000,
max_cols_analyzed = 10,
save_plot_dir=None
)

Using the code above, some tabs are generated in the browser. The new things that we can see using this package:

  1. The output is generated in multiple tabs in the browser, previous packages display all the outputs in one tab.
  2. Violin plot of each variable. It’s a hybrid version of the boxplot and kernel density plot. Still shows similar information compared to previous packages.
Image 40. The first snapshot of the created report using `autoviz` of the iris dataset
Image 41. The second snapshot of the created report using `autoviz` of the iris dataset
Image 41. The third snapshot of the created report using `autoviz` of the iris dataset
Image 42. The fourth snapshot of the created report using `autoviz` of the iris dataset
Image 43. The fifth snapshot of the created report using `autoviz` of the iris dataset

Python: Conclusion

Using the four packages above, we got a lot of information about the iris dataset, not too much difference compared to R packages, but still when having more perspective is usually better than having less perspective. Some notes:

  1. The output of Python packages is mostly more interactive compared to R packages.
  2. When installing the packages, some errors may occur. For the dtale, the common error is about jinja and escape. You can get the solution by referring to this post.
  3. In some packages, the code is not as simple as in R packages, but I think it’s not a major problem, as long as we are not lazy to read the manual instruction, I think everything is okay.

Conclusion

Which one do I have to use? Which one is the best? Which one is the most compatible with my dataset?

It depends. I think we can cut the time we need to do EDA is already a good thing. Let’s try to explore each package explained above and use it wisely, not as the main solution. In my humble opinion, exploring the data should be the “fun” part of data analysis, so don’t be afraid to get “dirty” by doing the EDA manually, sometimes the non-automated method is still the best. 👍

Thank you for reading!

Woah, I’ve just realized this post contains 43 images. If you reach this point, I’m grateful that you want to read and learn about how to automate the EDA process through my post. I hope you enjoy this post and learn how to use it in your journey as a data analytics/science professional.

I am learning to write, mistakes are unavoidable, even when I try my best. If you find any problems/mistakes, please let me know!

--

--

Mochamad Kautzar Ichramsyah
CodeX

Data analytics professional with 9 years of experience at tech companies in Indonesia.