Data Visualization — with use cases

Subham Kumar Sahoo
9 min readApr 10, 2021

Here we will be covering 6 popular statistical plots along with their use cases in statistics, data science and machine learning.

A data scientist spends around 70 % time in EDA (Exploratory Data Analysis) and data pre-processing. In the EDA part we use visualization heavily to draw some intuition out of the data.

Here we will work with some basic and popular plots and try to draw some conclusions out of them regarding the data. We will also see where we can use those plots during data pre-processing.

Bar Plot

What kind of data shown in a bar plot??

  • Categorical data: Both Ordinal (which have order like grades A>B>C) and Nominal (do not have a particular order like gender).
  • Numerical data: Discrete (as bins).

In first two bar plots we have categorical data.

  • The first plot is about “no. of people of a locality who prefer vanilla or strawberry flavor more”. This is nominal data.
  • The second plot is about “no. of students in a class who score grade A, B or C in exam”. This is ordinal data.

The third plot has numerical data. So, the data collected have height of students in a class. But as height is continuous in nature, we have to bin such data in ranges like 150–160, 160–170 cm etc.

✔️✔️NOTE: Technically the third plot is a histogram as we are showing the distribution of numeric values. But these terms are used interchangeably.

Now let’s go through an example of Count plot (Similar to bar plot and histogram) to determine the case of missing data.

Count Plot

It shows the counts of observations in each categorical bin using bars. It’s like a histogram for categorical variables.

I’ll be using a dummy data set to demonstrate it’s use.

The data has been collected using a google form having Gender as a mandatory field to fill and Age as non-mandatory.

There are 5 missing values in Gender and 14 missing values in Age feature.

Let’s try to draw some insights using count plot.

We will be counting the number of male and female in the Gender feature (ignoring missing values). We use hue parameter to further divide the counts based on another feature/column.

Here we can see more Age values are missing in case of Females compared to Males. We might draw an inference that many ladies did not want to mention their age.

Here the probability of observation (Age) missing depends on available information (Gender), so the values are not Missing Completely At Random (MCAR). So, when we will be replacing missing values in Gender, rather than imputing the missing values with most frequent value we can have a look at respective Age values. If the Age value is missing there is higher chance that the person is a Female.

Histogram

Represents the distribution of our data (numerical). It gives us the number or proportion of our data points in a particular numerical range called bins. In below plots 150–160 is a bin.

Types

  • Histogram of counts (plot A): Shows the number of data points in a particular range. Plot-A shows that there are 30, 20, 10 students in range of 150–160, 160–170, 160–170 cm height range.
  • Histogram of proportion (plot B): Shows the percentage of data points in a particular range. Like 50%, 33.3%, 16.6% of students in respective ranges.

We can get bar heights on Proportion histogram by dividing the bar heights on Counts histogram by sum of all bars and multiplying by 100 (for %).

Difference between bar plot and histogram

  • Generally in a bar plot there are categorical values on X-axis like in our first 2 plots (flavors and grades) but in case of histogram, there are numerical ranges or bins like our 3rd plot about height of students.
  • In bar plot, we can not change the X-axis category values but in histogram we can decide the number of bins and bin size.
  • In bar plot, the order of categories (on X-axis) and shape of plot does not matter, but in histogram bin values should be in increasing order (left->right) and the plot shape also provide us valuable insights.

✔️✔️NOTE : The number of bins we should go for depends on the quality and features of the data. We will explore this in upcoming blog.

🚀Practical time

You might have heard about skewness. Skewness is the asymmetry of the distribution or the deviation from a normal distribution.

Effect: So in skewed data, the tail region may act as an outlier for the statistical model and we know that outliers adversely affect the model’s performance especially regression-based models.

As well as we should track the data skewness and try to minimize it to have a more normalized data for better model performance.

There are two types of skewness: left and right.

From: https://www.sigmamagic.com/blogs/how-to-interpret-skewness-and-kurtosis/

Let’s visualize the LoanAmount feature of Loan prediction dataset.

Here we can see the distribution is not normal and it is right skewed. This indicates that more number of people have applied for 0–250K loan compared to higher amount.

Similarly, we can find other skewed features and apply transformations on them to obtain normal distribution.

We can use Seaborn library have a distribution curve on this histogram for better interpretation.

Box plot

Go-to plot for viewing outliers and data distribution.

Sometimes a box plot is named a box-and-whisker plot. Any box shows the quartiles of the data set while the whiskers extend to show the rest of the distribution.

These are very useful when you want to compare data between two groups. We can easily visualize the Outliers in a data set using this. An Outlier is an observation point that is distant from other observations or that does not follow same trend as other data points.

Let me start with bit of basics.

  • Q1: 1st quartile i.e. 25 percentile of the data.
  • Q2: 2nd quartile i.e. 50 percentile of the data. Also called median.
  • Q3: 3rd quartile i.e. 75 percentile of the data.
  • Whiskers: Data ranges outside of which the data points are considered outliers.

But how to calculate these values??

For Quartiles we can use the respective percentiles.

Ex: Percentiles for the values in a given data set can be calculated using the formula: n = (P/100) x N

where N = number of values in the data set, P = percentile, and n = ordinal rank of a given value (with the values in the data set sorted from smallest to largest). For example, take a class of 20 students that earned the following scores on their most recent test: 75, 77, 78, 78, 80, 81, 81, 82, 83, 84, 84, 84, 85, 87, 87, 88, 88, 88, 89, 90. These scores can be represented as a data set with 20 values (in increasing order: {75, 77, 78, 78, 80, 81, 81, 82, 83, 84, 84, 84, 85, 87, 87, 88, 88, 88, 89, 90}.

We can find the score that marks the 25th percentile by plugging in known values into the formula and solving for n:

n = (25/100) x 20

n = 5

The fourth value in the data set is the score 80. This means that 80 marks the 25th percentile; of the students in the class, 25 percent earned a score of 80 or lower.

For whiskers the formulas are:

  • left/bottom whisker: Q1–1.5*IQR
  • right/top whisker: Q3+1.5*IQR

where IQR (Inter-quartile range) = Q3-Q1

So, data greater than right whisker value or less than left whisker value are outliers.

🚀Practical time

Here we will be plotting a box-plot on sepal width feature of Iris data set categorized by Species.

So, the black dots in Virginica box-plot are the Outliers. In this way we can easily figure out which features have outliers and do further treatment accordingly.

✔️✔️NOTE: Before outlier detection and treatment it’s better to normalize the data.

Pie chart

Represents the fraction of unique values in a categorical field.

We will now use this to determine if a feature is balanced and quasi-constant.

🚀Practical time

Case 1

Let’s visualize the target feature of the Iris data set to see if the data set is balanced or not.

So, what is a balanced data set??
A balanced data set is the one that contains equal or almost equal number of samples from different classes (here setosa, versicolor, virginica).

If the data set is not balanced then we can get predictions at the end that might be favoring a particular class (species of flower).

Here we can see the data set is perfectly balanced as there is equal distribution of target variable.

Case 2

We can use this pie chart also to see the categorical division in a feature. We will see if that feature is Quasi-constant or not. If a feature is Quasi-constant then we can drop that feature to reduce dimensionality as it will have negligible contribution to prediction.

Quasi-constant feature: Feature having same values across most of the observation. We generally keep the threshold at 98–99 % duplicates.

For that let’s build a dummy data set with single feature.

We can see here one value F is covering more than 99 % of the feature. So, the feature is Quasi-constant which can be dropped.

Scatter Plot

This plots the y values with respective x values on a 2-D plot. This helps in tracking the variation or spread of one feature, compared to another.

🚀Practical time

(reducing number of features of a data set)

Let’s visualize the scatter plot with Iris data set.

Actually there are 4 features/columns, based on which we need to classify the flower into a particular species but let’s see if we can have a significance separation between them using any two features. Then we can use only 2 features/columns to get well enough performance in classification or clustering algorithms.

So, we are not able to get a significance separation between the flower species based on sepal_length and sepal_width.

We can see the separation by cluster formation between the species by plotting (sepal length vs. petal length) and (petal length vs. petal width) etc.

So, we can take these combinations of features in classification as well as clustering algorithms which will result in fairly good prediction. This helps in reducing the number of features and hence model complexity.

Heat map

The heat map is a way of representing the data in a 2-dimensional form. The data values are represented as colors in the graph. The goal of the heat map is to provide a colored visual summary of information.

It is a good way to visualize matrix data. One of it’s popular use is to visualize correlation matrix. When there are a lot of matrix elements, it provides an easy way of comparing them by associating colors w.r.t. the magnitudes.

Correlation is any relation between quantitative variables (numerical fields). If the correlation between two variables is positive, then if value of one increases, so increases the value of other variable. And if it is negative, then one increases with decrease in other and vice-versa. Higher the magnitude of correlation, stronger the relation. The correlation value varies between -1 to 1.

🚀Practical time

(reducing number of features of a data set)

Let’s make the correlation matrix of Iris data set features and plot a heat map.

Like here we can see the petal length and petal width have correlation 0.96 i.e. one will increase with increase in other and decrease with decrease in other.

But seeing and interpreting these decimal numbers can be bit tedious. So let’s add some colors to it.

Here the brighter blocks show positive correlation while darker blocks show low or negative correlation.

✔️✔️NOTE: Along with the color of heat map also keep an eye on correlation value associated with it.

But how to use this correlation in data science??

  • See, if two variables/columns are very much correlated with magnitude nearly 1 or -1, they kind of contribute the same information to the model. So, keeping only one of them helps in reducing the number of features /columns /dimensions of the data set.
  • And those features/columns which are more correlated to target value (magnitude wise), contribute more towards deciding the target value. So, we should have those features in our data set.

References:

We will be discussing more about visualization and statistics in upcoming blogs. Stay tuned..

Feel free to post your questions and feedback. Happy Data Science!! 😇

Connect with me at LinkedIn.

--

--