Data Visualization 101 — Part I

Data visualization using python and common libraries

by FreePNGimg

In order to work with data effectively, it is crucial to understand it’s basics. With data visualization we can improve our understanding of the information by presenting it in a visual context (such as graphs, charts, etc’), which allows trends and patterns to be more easily seen. That way we can get a clearer picture that will help us gain better insights and make better decisions.

Python offers many data visualization tools, but the most common ones are for no doubt Matplotlib and Seaborn. Seaborn is based on Matplotlib ( kind of a “wrapper library”), and provides more visualization options, alongside the fact that it is highly useful statistics visualization. The two libraries work extremely well with Pandas DataFrames and Series. Both are easy-to-use, intuitive and very well documented.

For all of those reasons, throughout this article we will mainly use those two libraries and our datasets will be presented as pandas DataFrames. The datasets we will be using are mostly known datasets, namely the famous Titanic dataset, Tips dataset, Boston Housing dataset, Iris dataset, but also other datasets such as the Heart Disease UCI dataset and the Airbnb Amsterdam Listing dataset. All datasets could be found at Kaggle.

We will be focusing on data visualization of the given data, and not visualization of the model results. Here we will cover the basic graph types and will go over their qualities, advantages and disadvantages.

We will be focusing on data visualization of the given data, and not visualization of the model results. Here we will cover the basic graph types and will go over their qualities, advantages and disadvantages.

If you would like to try it all yourself and see additional functions and plots code, check out this GitHub link

First, we’ll import the necessary libraries:

# importing all the libraries we need:import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

1. choosing the right graph

When we first come across a dataset, we should classify each feature type, categorical (nominal and ordinal) or numerical (discrete and continuous). For each type of feature there are different types of graph which can be helpful.

For example, with the iris dataset, the target feature is “species”, which is a nominal categorical feature. Another feature of the iris dataset is “petal length (cm)”, a continuous numerical feature.
In order to examine each of those features’ distributions, we should use different graphs. But if we would like to see how one feature affects the other one, we will need to use a different kind of graph altogether.

  • For the distribution of a numerical feature it is common to use Histogram for example. Histogram is a frequency distribution plot where the x-axis contains the total range of the variable values, and each bin (“bar”) contains a specified sub-range of values. The y-axis represents the frequency number for each bin. Histograms are highly used because they are very easy to understand.
  • For the distribution of a categorical feature, it is recommended to use a Count Plot. A count plot is a method that helps to show the total counts of observations in each categorical bin using bars. In some way, it is an histogram for categorical features, where the x-axis contains all the values of the categorical feature and the y-axis shows the number of times that this value appeared.
  • When we wish to evaluate how 1 feature might alter another feature, we could use a Bar Plot. That is true for the most part, when at least one feature is categorical.
count_bar_hist(iris, "species", "petal length (cm)", iris_target_names.values(), 
["Count Plot\nof the target feature",
"Bar Plot\nof the target to another feature",
"Histogram\nof a numeric feature"],
major_title = "3 Types of basic distribution plots")
different feature type different graph kind: count plot, bar graph & Histogram

This function organizes the 3 plot types neatly in a row, but we can simply use just part of that function code in order to plot only one of these graphs.

2. Distribution of a single numeric feature

To showcase the distribution of a numerical feature we can use, as mentioned earlier, a Histogram, or even a KDE (Kernel Density Estimate) plot.
Similar to a histogram, the values range in the x-axis of a KDE plot is of the variable values range, but the y-axis represents the probability density, this way a KDE plot presents a probability density curve of the data. One of the advantages of KDE plot is that unlike Histogram, we don’t lose information due to binning.
Histograms can also have density at their y-axis (simply change stat to ”density” at seaborn.histplot as shown at the code below).

A seaborn.distplot is a plot that presents a histogram (with density at the y-value) and a KDE curve on top of it. That way you could get the benefits of both Histogram and a KDE plot.

hist_kde_dist(iris, "petal length (cm)", 
["Histogram (Density)", "KDE plot", "Distplot"],
"3 Basic Types of Numeric Feature Distribution Plots",
Histogram & KDE

3. Distribution of one categorical feature

When it comes to categorical data distribution, beside count plot, a good old Pie Chart can do the trick. On top of the pie chart it’s very common to add annotations at the percentages of each value. Additionally, we can upgrade the classic appearance of the pie chart in different ways, such as making it in a ring shape (aka donut chart), like at the following example:

We can even create “gaps” between the different parts of the pie chart while adding the “explode” parameter to the matplotlib.pie function.

round_pie_chart(tips, "day",
title = "Ring-shape Pie Chart With Gapping",
Donut chart

New research has shown that people tend to react better to data that is presented in a count plot (bar plots in general) compared to pie charts. The reason for that is that people struggle to understand which of the parts is greater and by how much, unlike bar plots, which are very easy to distinguish each group size and by how much they differ.

We can conclude that a bar chart can be beneficial in precise evaluation of the size and percentage of each part. Alongside a pie chart is usually used if the sum of all same-value-cases add up to a meaningful size, therefore it is mainly built to visualize the contributions of each part on the whole.

palette, style = "Set3", "white"
df, cat_col = tips, "day"
figure, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(12,4))
figure.suptitle("Diffrent ways to visualize categorical data", y=1.08, fontsize=20)
ax1 = sns.countplot(x=cat_col, data=df, palette=palette, ax=ax1)
ax1.set_title("classic Bar Plot", y=1.08, fontsize=15)
# Hide the right and top spines
values, counts = zip(*dict(tips['day'].value_counts().sort_index()).items())
colors = list(sns.color_palette(palette))[:len(values)]
ax2.pie(counts, colors=colors, labels=values)
ax2.set_title("classic Pie Chart", y =1.08, fontsize=15)
round_pie_chart(df, cat_col, ax=ax3, title="ring shaped Pie Chart")plt.tight_layout();
single-categorical-feature graph

At the end, to pick which of the two (or three) is better it’s your choice to make.

4. Plots for only two categorical features

Sometimes we wish to demonstrate the distribution of 2 categorical features at the same time. In this case we will have a few options:

  • The Side-by-Side Bar Chart (also known as grouped or double bar chart) is used to show how the data is distributed across different values of 2 categorical features, unlike the classic bar chart where there is only 1 categorical feature. This kind of graph allows exhibition of a primary and a secondary distribution of data, so we can see how the second categorical feature changes within each value of the first categorical feature.
  • Stacked Bar Chart is very similar to the double bar chart, only instead of a secondary level bar chart of one feature for each value of the second feature, each of the second feature bars is broken into colored-subsections that represent the proportion of the first feature values. In a certain way, it is kind of a combination between a bar chart on the primary level and a pie chart on the secondary level. So on the one hand, this kind of bar is easier-on-the-eyes because it has less bars on the horizontal axis. But on the other hand, like in a pie chart, the second level distribution becomes more difficult to understand and compare.
  • The Nested Pie Chart (aka double or multi-level pie chart) is an advanced representation of a classic pie chart. This kind of chart contains a set of concentric rings, where the sizes of the secondary features values are proportional to the total size of each value of the primary features. Similar to a nested pie chart, there is a Nested Donut Chart (also called multi-level doughnut chart), which in this article we will still refer to as a double pie chart (because this is what it is basically). Additionally, this kind of chart doesn’t have to display simply 2 features as presented here, but as the number of layers is higher so is it harder for the reader to understand the chart. Therefore I will recommend to use only up to 2 features comparison at once.
figure, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15,5))
"Diffrent ways to visualize multi-featured categorical data", y=1.08, fontsize=20)
ax1 = sns.countplot(x="day", hue="sex",
data=tips, palette="Pastel1", ax=ax1)
ax1.set_title("Side-by-side Bar plot", y=1.1, fontsize=15)
ax1.spines['right'].set_visible(False) # Hide right spine
ax1.spines['top'].set_visible(False) # Hide top spine
ax2 = sns.histplot(data=tips, x="day", hue="sex", multiple="stack",
palette="Pastel1", edgecolor="w", lw=20, ax=ax2)
ax2.set_title("Stacked Bar plot (Barh)", y=1.1, fontsize=15)
double_pie_chart(tips, "day", "sex", ax=ax3,
title="Double Pie chart")

Knowing all types of plots strengths and weaknesses will help us determine which of the plots will be better suited to our needs.

multi-categorical-feature graphs

If you want to know how to generate the double pie chart presented above, the code for the function below is what you are looking for:

double_pie_chart(tips, "day", "sex", title= "A Double Pie Chart")
Nested pie chart

5. bar plots comparison for different DataFrames

In many cases we would like to split our data to train and test sets or train validation and test sets. The split can be done in many ways, such as: time series split, random split, stratified split, and so on.

Due to the split technique we choose, the data may not be evenly distributed at all the features — which may affect our model performance. For that reason, we should check for even distribution across all data sets.

Let’s take the following example: we would create 2 train-dev-test sets from the titanic dataset, each split by a different method — one by a random split and the other one by stratified of the sex feature split.

So we get the following dataFrames :

from sklearn.model_selection import train_test_split# randomly splited titanic datasets:
rand_set = [[X_train_r, X_val_r, X_test_r],
[y_train_r, y_val_r, y_test_r]]
# stratify splitted by sex titanic datasets:
strat_set = [[X_train_s, X_val_s, X_test_s],
[y_train_s, y_val_s, y_test_s]]

To simply see if the data was evenly distributed between all data sets of both splits, we can use the following function that will present the results as a bar plot.

For the purpose of that article, I choose to present the results for the “Embarked” column.

plot_bar_compare(rand_set[0], 'embarked', 
title='Embarked distribution by random splitting',
plot_bar_compare(strat_set[0], 'embarked',
title='Embarked distribution by sex-based splitting',
comparison via simple bar charts

So in our example, we can see that all data sets from the random split were divided evenly when looking at the “Embarked” feature. This is not the case for the sex-based split data sets, that were evenly divided at the “Sex” feature, but not so much at the “Embarked” feature.

This kind of analysis can help us a lot in understanding how to choose the right model we should be working with, or better yet, to make an educated decision with how to split or handle our data.

In the last example we compared the train validation and test sets of both splits. In the next example, we will check the data distribution between the datasets of the same split, and then we will set the two against each other. Hence we will use a grouped bar plot.

Just to make it easier for our analysis, we can merge all the same-split data sets into 1 whole DataFrame (like it used to be originally), only this time we added a column (“group”) that claims each sample to it’s previous dataset, like this:

bar_comparison_same_data([strat_split_df, random_split_df],
'embarked', 'group',
["Sex-based split", 'Random split'],
"Different split types comparison");
comparison via grouped bar charts

Here we can see a slightly better that the 2 splits gave similar results for the “Embarked” feature, but there are a few variations between the the datasets, where the variations in the sex-based split datasets are more notable.

6. Plots for two numeric features

When we have a pair of numerical features, and we want to see if they are related or to examine how one numerical feature influences the other, there are numerous plotting options.
That being the case, we can use one of the following plot:

  • A Scatter Plot is a common way to display the relationship between different numerical features. This kind of plot can be extremely helpful when we want to determine whether or not there is any kind of correlation or some kind of a pattern between the two numerical features. The data is presented by x and y coordinates for each sample. The x-axis value for each point will be determined by the value of the first numerical feature of that sample, and the y-axis value will be determined by the second numerical feature accordingly. Ergo the name “Scatter” — because all data points seem like they are scattered across the graph.
  • A Regression Line shows the overall trend of the data. It is based on a statistical method that assists with modeling the relation between 2 numerical features. This plot is formed on the basics of a scatter plot data (on several x,y data points). In general the regression line helps to estimate points on the graph where part of the data is missing.
    The main advantage of this plot is that it smoothen the noise that we get in a scatter plot so we can see a clear linear trend of the data.
  • Regression lines are many times added on top of another plot (as some kind of annotation). The most familiar format is the combination between a scatter plot and a regression line, which Seaborn called RegPlot. Thereby we can get both the general trend of the data without losing any information at the same time.

Let’s see an example for all three using the tips dataset:

scatter_line_regplot(tips, "total_bill", "tip",
["Simple Scatter plot",
"Simple Regression line",
"Regplot: Scatter + Regression line"],
"3 Basic Types of 2 Numeric Features Plots",
scatter plot & regression line

7. Multi numeric features scatter plot

The following function is an advanced scatter plot, where we can see the relationship between several numerical features.

As before, the x and y axis represents 2 numerical features, but now, the color and the size of the dot, represents an additional 2 numeric features, which will result in a more complex and a fuller picture of how the data acts.

multi_feature_scatter(iris, "sepal length (cm)", "sepal width (cm)",
"petal length (cm)", "petal width (cm)",
title = "4 numeric features scatter plot")
multi-numeric-features scatter plot

If we wish to include an additional categorical feature to this graph, we could do it as follows:

multi_feature_scatter(tips, "total_bill", "tip", "size", 
cat_col= "sex", title = "5 features scatter
plot:\n4 numerical features and 1 categorical")
multi-features scatter plot

Now the categorical feature values are represented as different marks (O’s and X’s).

Now the scatter plot doesn’t contain only numerical features, but a mixture of both numerical and categorical features.

8. Heatmaps: A cross-numeric features plot

Heatmaps are an easy-to-read visual tool that helps determine correlation between different features at the same time. The strength and direction of each feature pair is portrayed by color from a colorbar. This 2D correlation matrix is an important tool for data analysis, and it is useful especially while looking at multiple numeric features at once.

Checking the correlation between different features to the target feature or to simply check the correlation between all features can help us with feature importance tasks and to give us a sense of what are the most relevant features to our case.

Let’s use our Boston housing dataset to see an example for a feature correlation heat map:

# using only a small portion of the Boston housing datasetheatmap_corr(boston.iloc[:, :5], y_ticks_rotation=0, 
title=' spearman correlation\n matrix'.upper());
a correlation heatmap

To Sum it all up,

It is important to choose the right plot for our data which will emphasize the case we want to portray and will be suited to the data simultaneously. Therefore we carefully need to choose the graph which will do just that in the best way possible.

And just remember —

“Visualization gives you answers to questions you didn’t know you had.”

— Ben Schneiderman

For more information about the code above (+ some extra functions) please check out the following notebook