Box Plots Using Python’s Seaborn Library: a powerful visualization for EDA
The box plot, “the what?” some folks might say or for people who don’t use statistics in their everyday jobs or it has been a while since that introductory statistics class you took way back when, these visualizations may be difficult to comprehend at first. The box plot was first introduced in the 70’s by statistician John Tukey and has since become a mainstay in exploratory data analysis (EDA). In this post, I will walk you through what a box plot is and how to visualize them using Pythons’ seaborn library and Jupyter Notebooks.
First, let’s import seaborn which will be our visualization library and the Random module which will allow us to ask the computer to generate a list of random numbers within a range.
import random
import seaborn as sns
random_numbers = random.sample(range(1, 1001), 10)
Now let’s generate a boxplot using one line of code using seaborn:
sns.boxplot( random_numbers )
Voila! There is our box plot. As you can see, there is a lot of information packed into this one graphic. The ability of the box plot to incorporate so much information into a single visualization is what makes it such a powerful tool. Now Let’s break the box plot down into its different parts. As you can see from the graphic, we have the 25th (Q1), 50th (median) or Q2, and 75th (Q3) percentiles labeled along with the range. Each percentile represents a distinct point in the data, for example the 25th percentile or first quartile is the midway point between the minimum and the median or 50th percentile. The median, the second quartile, and the 50th percentile all mean the same thing and represent the middle of the data set. Finally, the third quartile or the 75th percentile that is midway between the median and the maximum value. In certain circumstances such as where the maximum and minimum values vary greatly, the interquartile range or IQR may represent the center of the data better than the range. Together these values provide ways to describe our data and provide valuable insight into what our data looks like at first glance. Now that we have a good understanding of what a box plot represents let’s look at a few ways to customize box plots using seaborn to fit our needs.
Change box plot color, orientation, and background:
sns.set_style(“darkgrid”)
sns.boxplot(y = random_numbers, color = ‘tomato’)
sns.set_style(“whitegrid”)
sns.boxplot(y = random_numbers, color = ‘tomato’)
In order to plot multiple box plots that represent various columns in a data set, we will need a different data set like the commonly used iris data set.
iris = sns.load_dataset(‘iris’)
Now that we have a data set with multiple columns lets look at how we can visualize them all at once. The notch parameter is useful for drawing the eye to the median while the set context parameters in seaborn adjust the graphic for particular uses such as talk, poster, notebook, and paper.
sns.set_style(“whitegrid”)
sns.set_context(“talk”)
sns.boxplot( x=iris[“species”], y=iris[“petal_width”], notch = True )
Now that you know what a box plot is and how to alter the visualizations to your particular needs, have a go on your own data set and have fun changing the parameters as they suite your needs.
references: