In this day and age of big data and artificial intelligence, the disciplines of data science and machine learning have emerged as crucial components of many branches of the scientific and technological community. Working with data requires a skill set that includes the capacity to explain, summarize, and graphically display the material being worked with. You will find that dealing with data using Python’s statistics packages, which are extensive, popular, and commonly used tools, will be much easier. The following topics will be covered in this guide:
-
What
numerical quantities
you can use to describe and summarize your datasets -
How to
calculate
descriptive statistics in pure Python -
How to get
descriptive statistics
with available Python libraries -
How to
visualize
your datasets
Understanding Descriptive Statistics
One of the purposes of descriptive statistics is to summarize and describe the data. It makes use of two primary
approaches:
-
The quantitative approach
describes and summarizes data numerically. -
The visual approach
illustrates data with charts, plots, histograms, and other graphs.
You may use descriptive statistics on a single dataset or on several different variables at the same time. Univariate analysis refers to the process of analyzing just one variable at a time by describing and summarizing that variable. You are doing a bivariate analysis when you look for statistical associations between a pair of variables. In a similar vein, a multivariate analysis takes into account a number of different factors simultaneously.
once.
Types of Measures
The following categories of descriptive measurements will be covered in this lesson’s accompanying tutorial:
statistics:
-
Central tendency
tells you about the centers of the data. Useful measures include the mean, median, and mode. -
Variability
tells you about the spread of the data. Useful measures include variance and standard deviation. -
Correlation or joint variability
tells you about the relation between a pair of variables in a dataset. Useful measures include covariance and the
correlation coefficient
.
With the help of this tool, you will get the ability to comprehend and compute each of these metrics.
Python.
Population and Samples
In the field of statistics, a population is a collection of all the aspects or things that are of interest to the researcher. Populations are often quite large, which renders them unsuitable for the purposes of data collection and analysis. Because of this, statisticians will often attempt to draw some conclusions about a population by selecting and analyzing a subset of that population that is intended to be representative of the whole population.
The term “sample” refers to this specific fraction within a population. In a perfect world, the sample would be able to maintain, to a reasonable degree, the fundamental statistical characteristics of the population. You will then be able to utilize the sample to draw inferences about the whole situation.
population.
Outliers
A data point is said to be an outlier if it deviates from the majority of the other data points gathered from a sample or population by a large margin. The following are only a handful of the many probable explanations for outliers; however, there are many more.
off:
-
Natural variation
in data -
Change
in the behavior of the observed system -
Errors
in data collection
Errors that occurred during the data collecting process are one of the most common causes of outliers. For instance, it may simply not be possible to acquire the right data due to the constraints imposed by the measuring tools or techniques being used. Miscalculations, contaminated data, human error, and other types of mistakes are all potential sources of further errors.
It is not possible to provide a precise mathematical definition of the term “outliers.” If you want to evaluate whether or not a data point is an outlier and how to deal with it, you have to depend on your experience, your understanding of the topic of interest, and your common sense.
it.
Choosing Python Statistics Libraries
Although there are numerous Python statistics libraries available for you to work with, you will be learning about some of the more well-known and often used ones in this course.
ones:
-
Python’s
statistics
is a built-in Python library for descriptive statistics. You can use it if your datasets are not too large or if you can’t rely on importing other libraries. -
NumPy
is a third-party library for numerical computing, optimized for working with single- and multi-dimensional arrays. Its primary type is the array type called
ndarray
. This library contains many
routines
for statistical analysis. -
SciPy
is a third-party library for scientific computing based on NumPy. It offers additional functionality compared to NumPy, including
scipy.stats
for statistical analysis. -
pandas
is a third-party library for numerical computing based on NumPy. It excels in handling labeled one-dimensional (1D) data with
Series
objects and two-dimensional (2D) data with
DataFrame
objects. -
Matplotlib
is a third-party library for data visualization. It works well in combination with NumPy, SciPy, and pandas.
Note that Series and DataFrame objects may, in many situations, be used in lieu of NumPy arrays. This is something to keep in mind. Most of the time, you can get away with just passing them on to a statistical function written in NumPy or SciPy. In addition, you have the option of retrieving the unlabeled data from a Series or DataFrame in the form of a np.ndarray object by using the.values or.to numpy methods ()
.
Getting Started With Python Statistics Libraries
The built-in Python statistics package only includes a select few of the most crucial statistical analysis routines. The official paperwork is an important resource that can be used to get the data. If you can only use Python, then the Python Statistics Library is probably the best option for you.
The official User Guide, in particular the “quickstart” and “basics” parts, is an excellent resource to consult when getting started with learning about NumPy. You may find that consulting the official reference helps you better understand certain NumPy ideas and concepts. As you are reading this lesson, you may want to look at the statistics section as well as the official documentation for the scipy.stats package. Note:
Check out these resources if you want to find out more about NumPy.
resources:
-
Look Ma, No For-Loops: Array Programming With NumPy
-
Pythonic Data Cleaning With pandas and NumPy
-
NumPy arange(): How to Use np.arange()
If you are interested in learning more about pandas, the official website that explains how to get started is a great place to start. You will be able to have a better understanding of Series and DataFrame, two essential data types, by reading the introduction to data structures. In a similar vein, the excellent official introduction tutorial’s primary objective is to provide you with sufficient knowledge to begin making efficient use of pandas in your everyday work. Note:
Check out these resources if you want to find out more about pandas.
resources:
-
Using pandas and Python to Explore Your Dataset
-
pandas DataFrames 101
-
Idiomatic pandas: Tricks & Features You May Not Know
-
Fast, Flexible, Easy and Intuitive: How to Speed Up Your pandas Projects
You may go deep into the specifics of using matplotlib by consulting the library’s official User’s Guide, which is quite detailed and extensive. Beginners who are interested in beginning to work with matplotlib and the libraries that are associated with it will find Anatomy of Matplotlib to be a very helpful resource. Note:
Check out these resources if you’re interested in learning more about data visualization.
resources:
-
Python Plotting With Matplotlib (Guide)
-
Python Histogram Plotting: NumPy, Matplotlib, pandas & Seaborn
-
Interactive Data Visualization in Python With Bokeh
-
Plot With pandas: Python Data Visualization for Beginners
Let’s get started with using these Python statistics, shall we?
libraries!
Calculating Descriptive Statistics
Importing all of the necessary packages is the first step.
need:
>>>
>>> import math
>>> import statistics
>>> import numpy as np
>>> import scipy.stats
>>> import pandas as pd
Calculating statistics in Python will need all of these different packages to be installed. Even though you probably won’t use Python’s built-in math module very much, you should familiarize yourself with it for the sake of this course. Afterwards, you’ll import matplotlib.pyplot for data visualization.
Let’s start by generating some data to use in our analysis. You will begin with lists written in Python, each of which contains some arbitrary numeric
data:
>>>
>>> x = [8.0, 1, 2.5, 4, 28.0]
>>> x_with_nan = [8.0, 1, 2.5, math.nan, 4, 28.0]
>>> x
[8.0, 1, 2.5, 4, 28.0]
>>> x_with_nan
[8.0, 1, 2.5, nan, 4, 28.0]
You should now have the lists x and x with nan in your possession. The only real difference between them is that the x with nan variable stores a nan value. Other than that, they are almost identical. It is essential to have a solid understanding of how the Python statistics functions react when they are presented with a not-a-number value, abbreviated as nan. In the field of data science, missing values are rather prevalent, and it is usual practice to replace them with the value nan. Note: How do you acquire a nan value?
You are free to make use of any of them in Python.
following:
You are able to make use of each of these functions.
interchangeably:
>>>
>>> math.isnan(np.nan), np.isnan(math.nan)
(True, True)
>>> math.isnan(y_with_nan[3]), np.isnan(y_with_nan[3])
(True, True)
It is clear that all of the functions are equal to one another. But, if you compare two nan values to determine whether or not they are equal, you will get a result that is not true. To put it another way, the statement that math.nan == math.nan is not true!
Next, create objects of type np.ndarray and pd.Series that correspond to the variables x and x with nan.
:
>>>
>>> y, y_with_nan = np.array(x), np.array(x_with_nan)
>>> z, z_with_nan = pd.Series(x), pd.Series(x_with_nan)
>>> y
array([ 8. , 1. , 2.5, 4. , 28. ])
>>> y_with_nan
array([ 8. , 1. , 2.5, nan, 4. , 28. ])
>>> z
0 8.0
1 1.0
2 2.5
3 4.0
4 28.0
dtype: float64
>>> z_with_nan
0 8.0
1 1.0
2 2.5
3 NaN
4 4.0
5 28.0
dtype: float64
At this point, you are in possession of two NumPy arrays (y and y with nan) as well as two pandas Series (z and z with nan). These are all sequences of values in a single dimension. Note: While we will be using lists throughout this lesson, it is important to keep in mind that tuples may be used in almost all situations in the same manner as lists. Please keep this in mind.
It is possible to supply an optional label for each value in the z and z with nan arrays.
.
Measures of Central Tendency
The measures of central tendency display the values that are most central to the datasets, often known as the middle values. The notion of what constitutes the “center” of a dataset may be interpreted in a number of different ways. You will learn how to identify and compute various measures of central tendency throughout the course of this session.
tendency:
- Mean
- Weighted mean
- Geometric mean
- Harmonic mean
- Median
- Mode
Mean
The
sample mean, often known as the mean of the samples
the sample arithmetic mean or, more often, the
The term “average” refers to the arithmetic mean of all of the individual values included in a dataset. The mathematical expression for the mean of a dataset x is ixi/n, where I may represent any value from 1 to n. To put it another way, it is the total number of items in the dataset x divided by the total number of components in the dataset.
The following five data points are used to demonstrate the mean of an example sample:
The numbers 1, 2.5, 4, 8, and 28 are represented by the green dots on the graph. Their mean may be calculated as (1 + 2.5 + 4 + 8 + 28) / 5 = 8.7, which is shown by the dashed red line.
You may do the calculation of the mean using just Python.
sum() and
len(), without the need for importing
libraries:
>>>
>>> mean_ = sum(x) / len(x)
>>> mean_
8.7
You may also use the built-in statistics in Python, despite the fact that this solution is clean and beautiful.
functions:
>>>
>>> mean_ = statistics.mean(x)
>>> mean_
8.7
>>> mean_ = statistics.fmean(x)
>>> mean_
8.7
You have used the functions in the program.
mean() and median()
fmean(), which is part of Python’s built-in statistics package, and I obtained the same result as you did when I was just using pure Python. fmean() was first shown in.
Python 3.8 as a more expedient alternative to the mean() function. It always returns a number with floating-point representation.
But, if your data contains nan values, then statistics.mean() and statistics.fmean() will return nan as the result for the mean and the median, respectively.
output:
>>>
>>> mean_ = statistics.mean(x_with_nan)
>>> mean_
nan
>>> mean_ = statistics.fmean(x_with_nan)
>>> mean_
nan
This conclusion is compatible with the behavior of the sum() function, which likewise produces nan, as shown by the fact that sum(x with nan) also returns nan.
You may calculate the mean of your data using if you use NumPy.
np.mean()
:
>>>
>>> mean_ = np.mean(y)
>>> mean_
8.7
The function mean() is shown in the preceding example; however, you may also use the method that corresponds to it.
.mean() in place of
well:
>>>
>>> mean_ = y.mean()
>>> mean_
8.7
The result that is produced by the NumPy functions mean() and.mean() is the same as that produced by the statistics.mean() function. When there are nan values present within your data, this is likewise the case.
data:
>>>
>>> np.mean(y_with_nan)
nan
>>> y_with_nan.mean()
nan
You often do not need getting a nan value as the end result. You have the option of using if you would rather disregard nan values.
np.nanmean()
:
>>>
>>> np.nanmean(y_with_nan)
8.7
The nanmean() function will simply disregard any and all nan values. If you were to apply it to the dataset without the nan values, it would yield the same result that mean() would if you did so.
pd.Series objects have the technique available to them as well.
.mean()
:
>>>
>>> mean_ = z.mean()
>>> mean_
8.7
As can be seen, the application of this is rather comparable to that of NumPy. Nevertheless, the.mean() function from pandas does not take nan values into consideration.
default:
>>>
>>> z_with_nan.mean()
8.7
This behavior is a consequence of the value that is used for the optional parameter skipna when it is not used. You have the ability to alter this parameter to affect the
behavior.
Weighted Mean
The
weighted mean , also termed the
calculation of the weighted arithmetic mean or
The weighted average is a generalization of the arithmetic mean that allows you to determine the relative contribution of each data point to the final result. It does this by allowing you to quantify the importance of each data point.
You define one
weight wi for each data point xi in the dataset x, where I = 1, 2,…, n and n is the number of items in x. n is the total number of items in the dataset. After that, you multiply each data point by the weight that corresponds to it, add up all of the products, and then divide the acquired amount by the sum of the weights using this formula: i(wixi) / iwi.
Note:
It is helpful, and almost always the case, that none of the weights have a negative value.
, wi is greater than zero, and that the total of these numbers, shown by the notation iwi = 1, is one.
When you require the mean of a dataset that contains elements that occur with specific relative frequency, the weighted mean is an extremely helpful tool to have at your disposal. Take, for instance, a set in which 20% of the things are equivalent to the number 2, 50% of the items are equivalent to the number 4, and the remaining 30% of the items are equivalent to the number 8. The mean of such a collection may be determined using the formula:
this:
>>>
>>> 0.2 * 2 + 0.5 * 4 + 0.3 * 8
4.8
While determining the weights for this, you must take into consideration the frequencies. While using this approach, it is not necessary for you to be aware of the overall quantity of goods.
With just pure Python, you can calculate the weighted mean by combining the sum() function with either
range() or
zip()
:
>>>
>>> x = [8.0, 1, 2.5, 4, 28.0]
>>> w = [0.1, 0.2, 0.3, 0.25, 0.15]
>>> wmean = sum(w[i] * x[i] for i in range(len(x))) / sum(w)
>>> wmean
6.95
>>> wmean = sum(x_ * w_ for (x_, w_) in zip(x, w)) / sum(w)
>>> wmean
6.95
You do not need to import any libraries in order to use this implementation since it is so clean and sophisticated.
NumPy, on the other hand, is likely to provide a superior solution in situations in which huge datasets are involved. You may utilize
the np.average() function may be used to calculate the weighted mean of NumPy arrays or pandas Series.
:
>>>
>>> y, z, w = np.array(x), pd.Series(x), np.array(w)
>>> wmean = np.average(y, weights=w)
>>> wmean
6.95
>>> wmean = np.average(z, weights=w)
>>> wmean
6.95
The outcome is the same as it was when the implementation was carried out entirely in Python. You can also use this approach to regular lists and tuples if you want to.
One other possibility is to apply the element-wise product w * y to the problem.
np.sum() or
.sum()
:
>>>
>>> (w * y).sum() / w.sum()
6.95
That wraps it up! You’ve done the calculation to get the weighted mean.
Nonetheless, use caution if your dataset includes nan values.
values:
>>>
>>> w = np.array([0.1, 0.2, 0.3, 0.0, 0.2, 0.1])
>>> (w * y_with_nan).sum() / w.sum()
nan
>>> np.average(y_with_nan, weights=w)
nan
>>> np.average(z_with_nan, weights=w)
nan
In this particular instance, the average() function will return nan, which is consistent with the np.mean value ()
.
Harmonic Mean
The
n is the number of items in the dataset x, and I may be any positive integer between 1 and the total number of items in the dataset. the harmonic mean is the reciprocal of the mean of the reciprocals of all of the items in the dataset. One variation of the implementation of the harmonic mean using pure Python is as follows:
this:
>>>
>>> hmean = len(x) / sum(1 / item for item in x)
>>> hmean
2.7613412228796843
The value of the arithmetic mean for the same data set x, which you computed to be 8.7, is considerably different from the number that we have here.
Another method for calculating this measurement is using.
statistics.harmonic mean()
:
>>>
>>> hmean = statistics.harmonic_mean(x)
>>> hmean
2.7613412228796843
The above example demonstrates one possible use of the statistics.harmonic mean() function. In the event that a dataset has a nan value, the result will also be nan. If there is at least one zero, then it will return 0; otherwise, it will not. If you provide at least one negative number, you will be given the phrase “at least one negative number.”
statistics.StatisticsError
:
>>>
>>> statistics.harmonic_mean(x_with_nan)
nan
>>> statistics.harmonic_mean([1, 0, 2])
0
>>> statistics.harmonic_mean([1, 2, -2]) # Raises StatisticsError
While using this strategy, be sure to keep these three potential outcomes in mind!
Using may be considered a third method for calculating the harmonic mean.
scipy.stats.hmean()
:
>>>
>>> scipy.stats.hmean(y)
2.7613412228796843
>>> scipy.stats.hmean(z)
2.7613412228796843
Once again, this is a really simple method of implementation. If, on the other hand, your dataset includes nan, 0, a negative integer, or any other value that is not positive,
numbers, after which you will get a
ValueError
!
Geometric Mean
The
The geometric mean is calculated by taking the nth root of the product of all n items xi in a dataset, which is written as n(ixi), where I may range from 1 to n. The arithmetic, harmonic, and geometric means of a dataset are shown in the picture that is presented below:
To reiterate, the green dots signify the data points 1, 2.5, 4, 8, and 28 respectively. The mean is shown by the dashed red line. The harmonic mean is represented by the blue dashed line, while the geometric mean is represented by the yellow dashed line.
You are able to use pure Python to implement the geometric mean like so:
this:
>>>
>>> gmean = 1
>>> for item in x:
... gmean *= item
...
>>> gmean **= 1 / len(x)
>>> gmean
4.677885674856041
As can be seen, the value of the geometric mean in this instance is somewhat different from the values of the arithmetic mean (8.7) and the harmonic mean (2.76), all of which are derived from the same dataset x.
Python version 3.8 has been released.
statistics.geometric mean() is a function that, after all values have been converted to floating-point integers, returns the values’ geometric mean.
mean:
>>>
>>> gmean = statistics.geometric_mean(x)
>>> gmean
4.67788567485604
You have arrived at the same conclusion as in the first illustration, although with a smaller margin for rounding error.
If you provide data that contains nan values, then the statistics.geometric mean() function will operate in the same manner as the vast majority of other methods of its kind and return nan.
:
>>>
>>> gmean = statistics.geometric_mean(x_with_nan)
>>> gmean
nan
In point of fact, this conforms to what is expected from the operation of the statistics.mean(), statistics.fmean(), and statistics.harmonic mean() functions. The statistics.geometric mean() function will throw an exception in the form of a statistics.StatisticsError if any of your data include a zero or a negative value.
In addition, you may calculate the geometric mean using.
scipy.stats.gmean()
:
>>>
>>> scipy.stats.gmean(y)
4.67788567485604
>>> scipy.stats.gmean(z)
4.67788567485604
You were successful in achieving the same outcome with both versions of the Python implementation.
When a dataset contains nan values, the gmean() function will return nan as the result. Returning 0.0 and displaying a warning if there is at least one zero in the input causes it to do so. You will get nan and the if you submit at least one negative number.
warning.
Median
The
The sample median is the value that represents the center position in a sorted dataset. The dataset may be sorted either in ascending or descending order, depending on the user’s preference. The value that corresponds to the middle place in the dataset is known as the median when the number of items in the dataset, n, is an odd number. If n is an even number, then the median is the value that corresponds to the arithmetic mean of the two values that are located in the center, namely the items that are located at the places 0.5n and 0.5n + 1.
For instance, if the data points that you have are 2, 4, 1, 8, and 9, then the median value is 4, which places it in the exact center of the sorted dataset (1, 2, 4, 8, 9). If the values of the data points are 2, 4, 1, and 8, then the median, which is defined as the average of the two components that make up the sorted sequence in the center, is 3. (2 and 4). The following diagram provides an illustration of this point:
The green dots represent individual data points, while the purple lines indicate the median value for each dataset. The value four is the top dataset’s median value, which includes the numbers 1, 2.5, 4, 8, and 28. If you eliminate the outlier value 28 from the lower dataset, then the arithmetic average between the values 2.5 and 4 becomes 3.25, which is the new median value.
The following diagram depicts the mean as well as the median for the data points 1, 2.5, 4, 8, and 28:
To reiterate, the mean is represented by the dashed red line, while the median is represented by the solid purple line.
The behavior of the mean and the median varies significantly with respect to datasets, which is the primary distinction between the two.
outliers or
extremes . The mean is profoundly influenced by outliers, but the median is just marginally or not at all influenced by them at all. Consider the following graphical representation:
The numbers 1, 2.5, 4, 8, and 28 are included once again in the top dataset. You have seen that the mean is 8.7, and the median is 5. Both of these numbers may be seen previously. The dataset below demonstrates what occurs when the rightmost point’s value is changed but the dataset above does not.
28:
-
If you increase its value (move it to the right)
, then the mean will rise, but the median value won’t ever change. -
If you decrease its value (move it to the left)
, then the mean will drop, but the median will remain the same until the value of the moving point is greater than or equal to 4.
A comparison of the mean and the median is one method that may be used to identify outliers and asymmetry within your data. The specifics of the issue you’re trying to solve will determine whether the mean value or the median value will be of more use to you.
The following is only one example of the many potential pure Python implementations of the.
median:
>>>
>>> n = len(x)
>>> if n % 2:
... median_ = sorted(x)[round(0.5*(n-1))]
... else:
... x_ord, index = sorted(x), round(0.5 * n)
... median_ = 0.5 * (x_ord[index-1] + x_ord[index])
...
>>> median_
4
The first and second most crucial stages of its implementation are as follows:
follows:
-
Sorting
the elements of the dataset -
Finding
the middle element(s) in the sorted dataset
You may calculate the median by using.
statistics.median()
:
>>>
>>> median_ = statistics.median(x)
>>> median_
4
>>> median_ = statistics.median(x[:-1])
>>> median_
3.25
As the sorted version of x reads [1, 2.5, 4, 8.0, 28.0], the value 4 corresponds to the position in the center of the array. The sorted form of the expression x[:-1], which is x minus the very last item 28.0, is represented by the expression [1, 2.5, 4, 8.0]. Now there are two components in the center, namely 2.5 and 4. The average score for them is 3.25.
median low(), in addition to
The median low() and median high() are two additional procedures in the Python statistics package that are linked to the median. They consistently bring back a component from the
dataset:
-
If the number of elements is odd
, then there’s a single middle value, so these functions behave just like
median()
. -
If the number of elements is even
, then there are two middle values. In this case,
median_low()
returns the lower and
median_high()
the higher middle value.
These routines may be used in the same way that you would use median ()
:
>>>
>>> statistics.median_low(x[:-1])
2.5
>>> statistics.median_high(x[:-1])
4
To reiterate, the sorted representation of the expression x[:-1] is [1, 2.5, 4, 8.0]. 2.5 (low) and 4 (high) make up the two components that are located in the center (high).
When there are nan values present in the data, the majority of the methods found in the Python statistics package will return nan; however, the median(), median low(), and median high() functions will not.
points:
>>>
>>> statistics.median(x_with_nan)
6.0
>>> statistics.median_low(x_with_nan)
4
>>> statistics.median_high(x_with_nan)
8.0
Be cautious about engaging in this activity since it could not provide you the results you seek!
You may also calculate the median using the data.
np.median()
:
>>>
>>> median_ = np.median(y)
>>> median_
4.0
>>> median_ = np.median(y[:-1])
>>> median_
3.25
Both the statistics.median() and np.median() functions have returned the identical numbers for you.
On the other hand, np.median() will provide an error message if your dataset contains a nan value.
RuntimeWarning, where nan is the value that is returned. You have the option of using if this kind of behavior is not what you desire.
nanmedian() to disregard all nan
values:
>>>
>>> np.nanmedian(y_with_nan)
4.0
>>> np.nanmedian(y_with_nan[:-1])
3.25
The results that were obtained are the same as those that were produced when the functions statistics.median() and np.median() were used to the datasets x and y.
series objects in pandas have access to the method.
.median() that does not take into account nan values by
default:
>>>
>>> z.median()
4.0
>>> z_with_nan.median()
4.0
While using pandas,.median() behaves similarly to.mean() in terms of its operation. You have the ability to alter this behavior by using the parameter skipna, which is optional.
.
Mode
The
The value in the dataset that is most often sampled is referred to as the sample mode. If there is not even a single value of this kind, then the set is empty.
multimodal as it has various modal values. For instance, the number 2 is the mode in the set that consists of the points 2, 3, 2, 8, and 12. This is due to the fact that the number 2 appears twice, while the other elements only appear once in the set.
This is the method for obtaining the mode while using pure.
Python:
>>>
>>> u = [2, 3, 2, 8, 12]
>>> mode_ = max((u.count(item), item) for item in set(u))[1]
>>> mode_
2
You may find out how many times each item appears in u by using the u.count() function. The item that appears the most often is the one that we refer to as the mode. Take note that it is not required that you utilize the set(u) function. Instead, you might switch it out for the letter u alone and then iterate over the full list.
Note:
set(u) produces a Python set
using just one of each unique object in u. Using this tip will allow you to optimize your work with bigger data sets, in particular when you anticipate seeing a significant number of duplicates.
You may get the mode by doing the following:
statistics.mode(), in addition to
statistics.multimode()
:
>>>
>>> mode_ = statistics.mode(u)
>>> mode_
>>> mode_ = statistics.multimode(u)
>>> mode_
[2]
As can be seen, the mode() function only returns a single value, but the multimode() function returns a list that includes the result. Yet, this is not the only distinction that can be found between the two roles. In the event that there is more than one modal value, mode() will throw StatisticsError, however multimode() will return a list containing all of the modal values.
modes:
>>>
>>> v = [12, 15, 12, 15, 21, 15, 12]
>>> statistics.mode(v) # Raises StatisticsError
>>> statistics.multimode(v)
[12, 15]
While deciding between these two duties, you need to pay extra close attention to this case and use caution in your decision-making.
nan values are treated as ordinary values by statistics.mode() and statistics.multimode(), and both functions may return nan as the modal value.
value:
>>>
>>> statistics.mode([2, math.nan, 2])
2
>>> statistics.multimode([2, math.nan, 2])
[2]
>>> statistics.mode([2, math.nan, 0, math.nan, 5])
nan
>>> statistics.multimode([2, math.nan, 0, math.nan, 5])
[nan]
The number 2 appears twice in the first example, making it the value that most often occurs. In the second example, nan is the value that is considered to be modal since it appears twice while all of the other values only appear once.
Note:
Python 3.8 marks the debut of the statistics.multimode() function.
.
Furthermore, you may get the mode by using.
scipy.stats.mode()
:
>>>
>>> u, v = np.array(u), np.array(v)
>>> mode_ = scipy.stats.mode(u)
>>> mode_
ModeResult(mode=array([2]), count=array([2]))
>>> mode_ = scipy.stats.mode(v)
>>> mode_
ModeResult(mode=array([12]), count=array([3]))
The object containing the modal value, together with the frequency with which it appears, is returned by this function. In the event that the dataset has numerous values that are modal, then only the
least value is returned.
You may retrieve the mode as well as the number of times it occurs as NumPy arrays by using the dot operator.
notation:
>>>
>>> mode_.mode
array([12])
>>> mode_.count
array([3])
This piece of code makes use of.mode to return the array v’s mode with the lowest bit value (12), and.count to retrieve the total number of occurrences of that mode ( 3 ). The scipy.stats.mode() function is also versatile when dealing with nan numbers. With the nan policy argument, which is optional, you are able to describe the desired behavior for the system. This parameter’s value may either be set to “propagate,” which will cause an error, or “raise,” which will omit the value.
series objects in pandas have access to the method.
.mode() that is capable of efficiently handling multimodal data and disregards nan values by
default:
>>>
>>> u, v, w = pd.Series(u), pd.Series(v), pd.Series([2, 2, math.nan])
>>> u.mode()
0 2
dtype: int64
>>> v.mode()
0 12
1 15
dtype: int64
>>> w.mode()
0 2.0
dtype: float64
As can be seen, the.mode() function produces a new pd.Series object that stores all of the modal value information. If you want.mode() to take into consideration nan values, then all you have to do is supply the optional parameter dropna=False to it.
.
Measures of Variability
The metrics of central tendency are insufficient to provide an adequate description of the data. You are also going to require the measures of variability, which characterize the range of the data points. You are going to learn how to recognize and compute the following variability as you go through this part.
measures:
- Variance
- Standard deviation
- Skewness
- Percentiles
- Ranges
Variance
The
The sample variance is a measure that may be used to quantify the dispersion of the data. It provides a numerical representation of the distance, in degrees, that each data point is from the mean. The sample variance of the dataset x that contains n items may be expressed mathematically as s2 = i(xi mean(x))2 / (n 1), where I = 1, 2,…, n and mean(x) is the sample mean of x. This expression allows you to describe the sample variance in terms of the sample mean. You may go into more into about why you divide the total with n 1 rather than n if you want a more in-depth understanding of why you do what you do.
The amendment offered by Bessel.
The example that follows will demonstrate to you why it is essential to take into account the variance while describing datasets:
In this, there are two different datasets.
figure:
-
Green dots:
This dataset has a smaller variance or a smaller average difference from the mean. It also has a smaller range or a smaller difference between the largest and smallest item. -
White dots:
This dataset has a larger variance or a larger average difference from the mean. It also has a bigger range or a bigger difference between the largest and smallest item.
Take note that the mean and median of these two datasets are identical, despite the fact that they seem to have quite different contents. This disparity cannot be adequately described using either the mean or the median. You can see why the metrics of variability are necessary now.
The following is a formula for calculating the sample variance using pure
Python:
>>>
>>> n = len(x)
>>> mean_ = sum(x) / n
>>> var_ = sum((item - mean_)**2 for item in x) / (n - 1)
>>> var_
123.19999999999999
This method is adequate and produces accurate results when calculating the sample variance. To invoke the function that is already there is, however, the approach that is both simpler and more beautiful.
statistics.variance()
:
>>>
>>> var_ = statistics.variance(x)
>>> var_
123.2
You have arrived at the same conclusion about the variance that was presented before. If you supply the mean directly as the second parameter in the following expression: statistics.variance(x, mean_), then the variance() function will not calculate the mean.
If any of your data points are nan, then the result of the statistics.variance() function will also be nan.
:
>>>
>>> statistics.variance(x_with_nan)
nan
This behavior is in line with what one would expect from mean() and the vast majority of the other methods included in the Python statistics module.
NumPy may also be used to do the calculation of the sample variance. You need to make advantage of the function.
either np.var() or the function that corresponds to it
.var()
:
>>>
>>> var_ = np.var(y, ddof=1)
>>> var_
123.19999999999999
>>> var_ = y.var(ddof=1)
>>> var_
123.19999999999999
It is essential to make sure that the ddof=1 argument is specified. This is how the tone is established.
delta degrees of freedom to 1 . Because of this parameter, the correct computation of s2 may be performed using (n 1) in the denominator rather than n.
If the dataset contains any nan values, then the np.var() and.var() functions will also return nan.
:
>>>
>>> np.var(y_with_nan, ddof=1)
nan
>>> y_with_nan.var(ddof=1)
nan
This agrees with the results of the np.mean() and np.average() functions. If you do not want to waste time processing nan values, you should use.
np.nanvar()
:
>>>
>>> np.nanvar(y_with_nan, ddof=1)
123.19999999999999
np.nanvar() ignores nan values. In addition to that, it requires that you provide ddof=1.
Objects of type pd.Series have the function.
.var() that disregards nan values by default.
default:
>>>
>>> z.var(ddof=1)
123.19999999999999
>>> z_with_nan.var(ddof=1)
123.19999999999999
Also, it contains an option known as ddof, but because its default value is 1, you may skip over it. Use the optional parameter skipna if you want a different behavior in relation to nan values. This parameter is available.
You work out the numbers.
variation of the population in the same way as there is variance in the sample. On the other hand, in the denominator, you must write n rather than n 1 as follows: i(xi mean(x))2 / n. In this scenario, the value of n refers to the total number of objects in the population. You may obtain the variance of the population to be comparable to the variance of the sample by using the following.
differences:
-
Replace
(n - 1)
with
n
in the pure Python implementation. -
Use
statistics.pvariance()
instead of
statistics.variance()
. -
Specify
the parameter
ddof=0
if you use NumPy or pandas. In NumPy, you can omit
ddof
because its default value is
0
.
Remember that if you are computing the mean, standard deviation, or any other parameter relating to a population, you should always be aware of whether you are dealing with a sample or the complete population.
variance!
Standard Deviation
The
The sample standard deviation is another method for measuring the dispersion of the data. Since the standard deviation, denoted by the letter s, is calculated by taking the positive square root of the sample variance, there is a connection to the sample variance. Since it uses the same unit as the data points, the standard deviation is often a more practical statistic than its counterpart, the variance. When you have obtained the variance, you may use pure mathematics to get the standard deviation.
Python:
>>>
>>> std_ = var_ ** 0.5
>>> std_
11.099549540409285
Despite the fact that this is a viable option, you may also utilize
statistics.stdev()
:
>>>
>>> std_ = statistics.stdev(x)
>>> std_
11.099549540409287
Naturally, the outcome is the same as it was in the past. Statistics.stdev(x, mean_) is an example of a function that, similar to variance(), does not compute the mean if the mean is provided directly as the second parameter.
NumPy offers a virtually identical method for calculating the standard deviation of your data. You are free to make use of the function.
std() as well as the function that corresponds to it
To get the standard deviation, use the.std() function. If the dataset contains any nan values, then the result that they provide will also be nan. You should make use of in order to disregard nan values.
np.nanstd() . You may think of std(),.std(), and nanstd() from NumPy in the same way that you could think of var(),.var(), and nanvar ()
:
>>>
>>> np.std(y, ddof=1)
11.099549540409285
>>> y.std(ddof=1)
11.099549540409285
>>> np.std(y_with_nan, ddof=1)
nan
>>> y_with_nan.std(ddof=1)
nan
>>> np.nanstd(y_with_nan, ddof=1)
11.099549540409285
Remember to set the delta degrees of freedom to 1 before you continue!
pd.Series objects have the technique available to them as well.
.std() that bypasses nan and instead
default:
>>>
>>> z.std(ddof=1)
11.099549540409285
>>> z_with_nan.std(ddof=1)
11.099549540409285
You do not need to include the ddof parameter because its default value is 1. Again, if you want to change the way nan values are handled, use the parameter skipna.
The
The term “population standard deviation” refers to all members of the population as a whole. It is equal to the population standard deviation’s positive square root. You can compute it in exactly the same way as the sample standard deviation, using the formulas below:
differences:
-
Find
the square root of the population variance in the pure Python implementation. -
Use
statistics.pstdev()
instead of
statistics.stdev()
. -
Specify
the parameter
ddof=0
if you use NumPy or pandas. In NumPy, you can omit
ddof
because its default value is
0
.
As you can see, determining the variance and the standard deviation in Python, NumPy, and pandas can be accomplished in a manner that is nearly identical to one another. You make use of a variety of distinct but comparable functions and methods with the same.
arguments.
Skewness
The
A data sample’s degree of asymmetry can be determined by examining its skewness.
There are a few different mathematical definitions that can be used to describe skewness. (n2 / ((n 1)(n 2)) (i(xi mean(x))3 / (ns3) is an example of a common expression that can be used to calculate the skewness of a dataset x that has n elements. One expression that is easier to understand is i(xi mean(x))3 n / ((n 1)(n 2)s3), where I = 1, 2,…, n and mean(x) refers to the sample mean of x. The skewness measured in this way is referred to as the.
a Fisher-Pearson adjusted moment coefficient that has been standardized.
The figure before this one displayed two datasets that were very symmetrical with one another. In other words, the distances that their points were from the mean were comparable. The following illustration, on the other hand, shows two sets that are not symmetrical:
The first group is represented by the dots in green, and the second group is represented by dots in white. Usually,
You can see with the first set of data that there is a dominant tail on the left side, which is indicated by skewness values that are negative.
You can see an example of a longer or fatter tail on the right side in the second set of data, which corresponds to skewness values that are positive. The dataset is considered to have a high degree of symmetry when the skewness is relatively close to 0 (for instance, between 0.5 and 0.5).
After you have determined the size of your dataset, n, the sample mean, mean_, and the standard deviation, std_, you will be able to determine the sample skewness using pure statistical methods.
Python:
>>>
>>> x = [8.0, 1, 2.5, 4, 28.0]
>>> n = len(x)
>>> mean_ = sum(x) / n
>>> var_ = sum((item - mean_)**2 for item in x) / (n - 1)
>>> std_ = var_ ** 0.5
>>> skew_ = (sum((item - mean_)**3 for item in x)
... * n / ((n - 1) * (n - 2) * std_**3))
>>> skew_
1.9470432273905929
Since the skewness is positive, the right-side tail of x is on the right.
In addition to this, the sample skewness can be calculated using.
scipy.stats.skew()
:
>>>
>>> y, y_with_nan = np.array(x), np.array(x_with_nan)
>>> scipy.stats.skew(y, bias=False)
1.9470432273905927
>>> scipy.stats.skew(y_with_nan, bias=False)
nan
The result that was obtained is identical to that of the Python implementation by itself. In order to activate the corrections for statistical bias, the parameter bias has been set to the value False. The parameter nan policy, which is optional, can have the values ‘propagate,’ ‘raise,’ or ‘omit’ assigned to it. It gives you the ability to control the way in which you will handle nan values.
series objects in pandas have access to the method.
.skew(), which not only rotates the object but also returns its skewness
dataset:
>>>
>>> z, z_with_nan = pd.Series(x), pd.Series(x_with_nan)
>>> z.skew()
1.9470432273905924
>>> z_with_nan.skew()
1.9470432273905924
Because the value of the optional parameter skipna is ignored by default, nan values are not taken into account by the skew() method, as is the case with other methods.
.
Percentiles
The
The element in the dataset that corresponds to the sample p percentile is the one in which p percent of the elements in the dataset have a value that is either less than or equal to that value. In addition, 100 plus p percent of the elements have a value that is either greater than or equal to that value. If the dataset contains two of these types of elements, then the sample p percentile will be the arithmetic mean of those values. Every single dataset contains three
quartiles are a type of percentile that are used to divide the dataset into four equal parts.
parts:
-
The first quartile
is the sample 25th percentile. It divides roughly 25% of the smallest items from the rest of the dataset. -
The second quartile
is the sample 50th percentile or the
median
. Approximately 25% of the items lie between the first and second quartiles and another 25% between the second and third quartiles. -
The third quartile
is the sample 75th percentile. It divides roughly 25% of the largest items from the rest of the dataset.
Approximately the same number of components can be found in each section. You can use if you want to segment your data into a number of intervals, and then you can use.
statistics.quantiles()
:
>>>
>>> x = [-5.0, -1.1, 0.1, 2.0, 8.0, 12.8, 21.0, 25.8, 41.0]
>>> statistics.quantiles(x, n=2)
[8.0]
>>> statistics.quantiles(x, n=4, method='inclusive')
[0.1, 8.0, 21.0]
In this particular illustration, 8.0 represents the middle value of x, while 0.1 and 21.0, respectively, represent the sample’s 25th and 75th percentiles. The parameter n is responsible for determining the number of resulting equal-probability percentiles, and the method is in charge of calculating those percentiles.
Note:
Python 3.8 marks the debut of the statistics.quantiles() function.
.
Additionally, you can use
np.percentile() to determine any sample percentile in your dataset. For instance, this is how you can locate the fifth and the ninetieth.
percentiles:
>>>
>>> y = np.array(x)
>>> np.percentile(y, 5)
-3.44
>>> np.percentile(y, 95)
34.919999999999995
percentile() takes several arguments. As the first argument, you must supply the dataset, and as the second argument, you must supply the percentile value. A NumPy array, list, tuple, or another data structure with a similar purpose may serve as the basis for the dataset. The percentile can be a single number between 0 and 100, as shown in the previous example, or it can be a series of numbers between 0 and 100.
numbers:
>>>
>>> np.percentile(y, [25, 50, 75])
array([ 0.1, 8. , 21. ])
>>> np.median(y)
8.0
This piece of code performs a simultaneous calculation of the 25th, 50th, and 75th percentiles. If the value being passed to percentile() is a sequence, then it will return a NumPy array containing the sequence’s elements. The array of quartiles is what is returned by the first statement. The second statement will return the median, and since the median is equal to the 50th percentile (which is 8.0), you can verify that this is the case.
If ignoring nan values is something you want to do, use.
np.nanpercentile()
instead:
>>>
>>> y_with_nan = np.insert(y, 2, np.nan)
>>> y_with_nan
array([-5. , -1.1, nan, 0.1, 2. , 8. , 12.8, 21. , 25.8, 41. ])
>>> np.nanpercentile(y_with_nan, [25, 50, 75])
array([ 0.1, 8. , 21. ])
You can steer clear of nan values by doing so.
In addition, NumPy provides you with very similar functionalities in
quantile() and
nanquantile() . If you choose to make use of them, then you will be required to provide the quantile values as the numbers that fall between 0 and 1.
percentiles:
>>>
>>> np.quantile(y, 0.05)
-3.44
>>> np.quantile(y, 0.95)
34.919999999999995
>>> np.quantile(y, [0.25, 0.5, 0.75])
array([ 0.1, 8. , 21. ])
>>> np.nanquantile(y_with_nan, [0.25, 0.5, 0.75])
array([ 0.1, 8. , 21. ])
The outcomes are the same as they were in the earlier illustrations, but this time your arguments are in the range of 0 to 1. In other words, you received a score of 0.05 out of a possible 5 and 0.95 out of a possible 95.
Objects of type pd.Series have the function.
.quantile()
:
You will also need to supply the quantile value that you want to use as the argument for
>>>
>>> z, z_with_nan = pd.Series(y), pd.Series(y_with_nan)
>>> z.quantile(0.05)
-3.44
>>> z.quantile(0.95)
34.919999999999995
>>> z.quantile([0.25, 0.5, 0.75])
0.25 0.1
0.50 8.0
0.75 21.0
dtype: float64
>>> z_with_nan.quantile([0.25, 0.5, 0.75])
0.25 0.1
0.50 8.0
0.75 21.0
dtype: float64
.quantile(). This value can either be a single digit between 0 and 1, or it can be a string of digits. When used in the first scenario,.quantile() will return a scalar value. In the second scenario, it hands back a brand-new Series that contains the
results.
Ranges
The
The difference between the highest value and the lowest value in a dataset is referred to as the range of the data. You are able to obtain it by using the function.
np.ptp()
:
>>>
>>> np.ptp(y)
46.0
>>> np.ptp(z)
46.0
>>> np.ptp(y_with_nan)
nan
>>> np.ptp(z_with_nan)
46.0
If your NumPy array contains any values that are not defined, this function will return nan. If you make use of a pandas Series object, you will receive a number as a response.
You also have the option of using the built-in Python.
Use the functions and methods provided by NumPy or Pandas to determine the maximum and minimum values of
sequences:
-
max()
and
min()
from the Python standard library -
amax()
and
amin()
from NumPy -
nanmax()
and
nanmin()
from NumPy to ignore
nan
values -
.max()
and
.min()
from NumPy -
.max()
and
.min()
from pandas to ignore
nan
values by default
The following are some examples of how you might put these into use:
routines:
>>>
>>> np.amax(y) - np.amin(y)
46.0
>>> np.nanmax(y_with_nan) - np.nanmin(y_with_nan)
46.0
>>> y.max() - y.min()
46.0
>>> z.max() - z.min()
46.0
>>> z_with_nan.max() - z_with_nan.min()
46.0
The spectrum of the data can then be obtained in this manner.
The
The first and third quartiles are used to calculate the interquartile range, which is the difference between the two. After you have determined the quartiles, you can take each quartile’s
difference:
>>>
>>> quartiles = np.quantile(y, [0.25, 0.75])
>>> quartiles[1] - quartiles[0]
20.9
>>> quartiles = z.quantile([0.25, 0.75])
>>> quartiles[0.75] - quartiles[0.25]
20.9
Take note that the labels 0.75 and 0.25 can be used to access the values contained in a pandas Series object.
.
Summary of Descriptive Statistics
Both SciPy and pandas provide helpful routines that can be used to quickly obtain descriptive statistics with a single call to a function or method. You can use scipy.stats.describe() like
this:
>>>
>>> result = scipy.stats.describe(y, ddof=1, bias=False)
>>> result
DescribeResult(nobs=9, minmax=(-5.0, 41.0), mean=11.622222222222222, variance=228.75194444444446, skewness=0.9249043136685094, kurtosis=0.14770623629658886)
As the first argument, the dataset that you want to work with must be provided. The argument could be a NumPy array, list, or tuple, or it could be another data structure with a similar purpose. Since ddof=1 is the default and only matters when you’re calculating the variance, you can omit it from your calculations. You can correct the skewness and kurtosis for statistical bias by passing the bias=False parameter when running the analysis. Note that the optional parameter nan policy can have the values ‘propagate’ (which is the default), ‘raise’ (which indicates an error), or ‘omit’ if it is used. You have control over what takes place in the event that there are nan values thanks to this parameter.
describe() provides a return value that is an object that contains the following descriptive information:
statistics:
-
nobs
: the number of observations or elements in your dataset -
minmax
: the tuple with the minimum and maximum values of your dataset -
mean
: the mean of your dataset -
variance
: the variance of your dataset -
skewness
: the skewness of your dataset -
kurtosis
: the kurtosis of your dataset
You can access specific values by using the dot notation.
notation:
>>>
>>> result.nobs
9
>>> result.minmax[0] # Min
-5.0
>>> result.minmax[1] # Max
41.0
>>> result.mean
11.622222222222222
>>> result.variance
228.75194444444446
>>> result.skewness
0.9249043136685094
>>> result.kurtosis
0.14770623629658886
You only need to make one call to a function in order to have SciPy generate a descriptive statistics summary for the dataset you’re working with.
The functionality of pandas is comparable to that of the original. Series objects have a.describe method at their disposal ()
:
>>>
>>> result = z.describe()
>>> result
count 9.000000
mean 11.622222
std 15.124548
min -5.000000
25% 0.100000
50% 8.000000
75% 21.000000
max 41.000000
dtype: float64
It provides a new Series that is responsible for holding the
following:
-
count
:
the number of elements in your dataset -
mean
:
the mean of your dataset -
std
:
the standard deviation of your dataset -
min
and
max
:
the minimum and maximum values of your dataset -
25%
,
50%
, and
75%
:
the quartiles of your dataset
If you want the Series object that is produced to contain other percentiles, then you need to specify a value for the percentiles parameter, which is an optional parameter. You can access each individual result by using its respective.
label:
>>>
>>> result['mean']
11.622222222222222
>>> result['std']
15.12454774346805
>>> result['min']
-5.0
>>> result['max']
41.0
>>> result['25%']
0.1
>>> result['50%']
8.0
>>> result['75%']
21.0
Using this technique, you can obtain the descriptive statistics of a Series object with just one method call by utilizing.
pandas.
Measures of Correlation Between Pairs of Data
When working with data, you will frequently be tasked with analyzing the relationship between the corresponding components of two different variables. Let’s say there are two variables, x and y, and both of them contain n elements in equal measure. Let’s say that x1 from x is equivalent to y1 from y, x2 from x is equivalent to y2 from y, and so on. After that, you could say that there are n pairs of elements that correspond to one another, such as (x1, y1), (x2, y2), and so on.
The following are some measures of correlation that have been found between different pairs of
data:
-
Positive correlation
exists when larger values of 𝑥 correspond to larger values of 𝑦 and vice versa. -
Negative correlation
exists when larger values of 𝑥 correspond to smaller values of 𝑦 and vice versa. -
Weak or no correlation exists
if there is no such apparent relationship.
The figure that follows provides examples of negative correlation, weak correlation, and positive correlation:
The scatter chart on the left, which is filled with red dots, demonstrates a negative correlation. The weak correlation can be seen in the plot in the middle, which is filled with green dots. In conclusion, a positive correlation can be seen in the plot shown on the right with the blue dots. When working with correlation between a set of variables, there is one essential point that you should always keep in mind, and that is the fact that
Correlation is not a measure or indicator of causation; it can only establish a relationship between two variables.
The covariance and the correlation coefficient are the two pieces of statistical information that are used to measure the degree of correlation between two datasets. First, let’s define some data to work with so we can better understand these measures. You will begin by generating two Python lists, after which you will use those lists to obtain corresponding NumPy arrays and pandas Series.
:
>>>
>>> x = list(range(-10, 11))
>>> y = [0, 2, 2, 2, 2, 3, 3, 6, 7, 4, 7, 6, 6, 9, 4, 5, 5, 10, 11, 12, 14]
>>> x_, y_ = np.array(x), np.array(y)
>>> x__, y__ = pd.Series(x_), pd.Series(y_)
Now that you have the two different variables, you can begin investigating the relationship between the two.
them.
Covariance
The
A measure that quantifies the strength and direction of a relationship between a pair of samples is called sample covariance.
variables:
-
If the correlation is positive,
then the covariance is positive, as well. A stronger relationship corresponds to a higher value of the covariance. -
If the correlation is negative,
then the covariance is negative, as well. A stronger relationship corresponds to a lower (or higher
absolute
) value of the covariance. -
If the correlation is weak,
then the covariance is close to zero.
The mathematical formula for calculating the covariance of the variables x and y is as follows: sxy = I (xi mean(x)) (yi mean(y)) / (n 1), where I = 1, 2,…, n, mean(x) is the sample mean of x, and mean(y) is the sample mean of y. The I values range from 1 to n. It follows that the variance of two variables that are identical is the same as their covariance: sxx = i(xi mean(x))2 / (n 1) = (sx)2 and syy = i(yi mean(y))2 / (n 1) = (sy)2.
Calculating the covariance in pure random variables looks like this:
Python:
>>>
>>> n = len(x)
>>> mean_x, mean_y = sum(x) / n, sum(y) / n
>>> cov_xy = (sum((x[k] - mean_x) * (y[k] - mean_y) for k in range(n))
... / (n - 1))
>>> cov_xy
19.95
To begin, you need to determine the mean value of both x and y. The mathematical formula for the covariance must then be applied at this point.
NumPy is able to perform the function.
cov(), which is the function that returns the
covariance matrix
:
>>>
>>> cov_matrix = np.cov(x_, y_)
>>> cov_matrix
array([[38.5 , 19.95 ],
[19.95 , 13.91428571]])
Note that the cov() function has the optional parameters bias and ddof, both of which have the default values of False and None respectively. The sample covariance matrix can be obtained adequately with the help of their default values. The covariance of x and x, also known as the variance of x, is the element that can be found in the upper-left corner of the covariance matrix. In a similar manner, the element in the lower-right corner is either the variance of y or the covariance of y and y. You are able to verify for yourself that this is the case.
true:
>>>
>>> x_.var(ddof=1)
38.5
>>> y_.var(ddof=1)
13.914285714285711
As can be seen, the variances of x and y are respectively equal to cov matrix[0, 0] and cov matrix[1, 1].
The actual covariance between x and y is represented by the other two elements of the covariance matrix, which are identical and have the same value.
:
>>>
>>> cov_xy = cov_matrix[0, 1]
>>> cov_xy
19.95
>>> cov_xy = cov_matrix[1, 0]
>>> cov_xy
19.95
You have come to the same conclusion regarding the value of the covariance when using np.cov() or pure Python.
pandas Series have the method at their disposal
.cov(), which you can employ to perform the calculation of
covariance:
>>>
>>> cov_xy = x__.cov(y__)
>>> cov_xy
19.95
>>> cov_xy = y__.cov(x__)
>>> cov_xy
19.95
In this step, you give one of the Series objects a.cov() call while passing the other object as the first.
argument.
Correlation Coefficient
The
coefficient of correlation, also known as
The symbol for the Pearson product-moment correlation coefficient, which is denoted by the letter r, is r. Another way to measure the degree to which two sets of data are correlated is with the coefficient. Consider it to be the same thing as a standardized covariance. The following are some significant facts regarding:
it:
-
The value 𝑟 > 0
indicates positive correlation. -
The value 𝑟 < 0
indicates negative correlation. -
The value r = 1
is the maximum possible value of 𝑟. It corresponds to a perfect positive linear relationship between variables. -
The value r = −1
is the minimum possible value of 𝑟. It corresponds to a perfect negative linear relationship between variables. -
The value r ≈ 0
, or when 𝑟 is around zero, means that the correlation between variables is weak.
The correlation coefficient, denoted by r, can be calculated using the formula r = sxy / (sxsy), where sx and sy stand for the standard deviations of x and y, respectively. You can calculate the correlation coefficient using pure mathematics if you have the means (mean x and mean y) and standard deviations (std x and std y) for the datasets x and y, as well as their covariance (cov xy).
Python:
>>>
>>> var_x = sum((item - mean_x)**2 for item in x) / (n - 1)
>>> var_y = sum((item - mean_y)**2 for item in y) / (n - 1)
>>> std_x, std_y = var_x ** 0.5, var_y ** 0.5
>>> r = cov_xy / (std_x * std_y)
>>> r
0.861950005631606
You have a variable denoted by the letter r, which stands for the correlation coefficient.
scipy.stats is where the routine can be found.
pearsonr(), which computes the correlation coefficient and the mean, as well as the
𝑝-value
:
>>>
>>> r, p = scipy.stats.pearsonr(x_, y_)
>>> r
0.861950005631606
>>> p
5.122760847201171e-07
The output of the pearsonr() function is a tuple containing two numbers. The r-value is the first one, and the p-value is the second one.
In a manner analogous to that of the covariance matrix, you are able to apply
np.corrcoef() with x_ and y_ as the arguments, and obtain the result.
correlation coefficient matrix
:
>>>
>>> corr_matrix = np.corrcoef(x_, y_)
>>> corr_matrix
array([[1. , 0.86195001],
[0.86195001, 1. ]])
The correlation coefficient between x_ and x_ is represented by the element in the upper-left corner of the graph. The correlation coefficient between y and y can be found in the lower-right part of the graph. The sum of their respective values is 1.0. The actual correlation coefficient between x and y is represented by the other two elements, which are identical and have the same value.
:
>>>
>>> r = corr_matrix[0, 1]
>>> r
0.8619500056316061
>>> r = corr_matrix[1, 0]
>>> r
0.861950005631606
Naturally, the outcome is the same as when using only Python and the pearsonr() function.
It is possible to calculate the correlation coefficient using
scipy.stats.linregress()
:
>>>
>>> scipy.stats.linregress(x_, y_)
LinregressResult(slope=0.5181818181818181, intercept=5.714285714285714, rvalue=0.861950005631606, pvalue=5.122760847201164e-07, stderr=0.06992387660074979)
linregress() uses x_ and y_ as arguments and then performs
The results are given after performing a linear regression. The equation of the regression line is defined by the slope, intercept, and rvalue, with rvalue referring to the correlation coefficient. Dot notation is used to gain access to specific values contained within the output of linregress(), such as the correlation coefficient.
notation:
>>>
>>> result = scipy.stats.linregress(x_, y_)
>>> r = result.rvalue
>>> r
0.861950005631606
You can then obtain the correlation coefficient by performing linear regression in the manner described above.
pandas Series have the method at their disposal
.corr(), which can be used to determine the correlation.
coefficient:
>>>
>>> r = x__.corr(y__)
>>> r
0.8619500056316061
>>> r = y__.corr(x__)
>>> r
0.861950005631606
You need to make a call to the.corr() method on one of the Series objects, and then pass the other object as the first argument.
argument.
Working With 2D Data
Statistics is typically done with two-dimensional data. Here are some examples of data in a two-dimensional format.
formats:
-
Database
tables -
CSV files
-
Excel
, Calc, and Google
spreadsheets
NumPy and SciPy offer a comprehensive set of tools for working with data in two dimensions. pandas’ class DataFrame was designed to specifically manage two-dimensional labeled data.
data.
Axes
Create a 2D NumPy to get things started.
array:
>>>
>>> a = np.array([[1, 1, 1],
... [2, 3, 1],
... [4, 9, 2],
... [8, 27, 4],
... [16, 1, 1]])
>>> a
array([[ 1, 1, 1],
[ 2, 3, 1],
[ 4, 9, 2],
[ 8, 27, 4],
[16, 1, 1]])
You now have a dataset in two dimensions, which you’ll put to use in the following section. It is possible to apply the same statistical functions and methods in Python to it as you would to 1D data.
data:
>>>
>>> np.mean(a)
5.4
>>> a.mean()
5.4
>>> np.median(a)
2.0
>>> a.var(ddof=1)
53.40000000000001
You can see that you are provided with statistics (such as the mean, the median, or the variance) for all of the data contained in the array a. In some situations, you will want this behavior, but in other cases, you will want these quantities calculated for each row or column of your 2D array. Sometimes, you will want this behavior because it is what you want.
The procedures and functions that you have used up until now each have one optional parameter known as axis that is required for the processing of data in two dimensions. axis is capable of taking on any of the following forms:
values:
-
axis=None
says to calculate the statistics across all data in the array. The examples above work like this. This behavior is often the default in NumPy. -
axis=0
says to calculate the statistics across all rows, that is, for each column of the array. This behavior is often the default for SciPy statistical functions. -
axis=1
says to calculate the statistics across all columns, that is, for each row of the array.
Let’s put axis=0 to the test by looking at the np.mean function ()
:
>>>
>>> np.mean(a, axis=0)
array([6.2, 8.2, 1.8])
>>> a.mean(axis=0)
array([6.2, 8.2, 1.8])
The two statements that have been discussed thus far produce new NumPy arrays that contain the mean value for each column of a. In this particular illustration, the average value of the first column is 6.2. The mean of the second column is 8.2, while the mean of the third column is 1.8.
If you tell mean() that axis=1, then you will receive the results for each individual axis.
row:
>>>
>>> np.mean(a, axis=1)
array([ 1., 2., 5., 13., 6.])
>>> a.mean(axis=1)
array([ 1., 2., 5., 13., 6.])
As can be seen, the mean of the first row of a has the value 1.0, while the mean of the second row has the value 2.0, and so on. Please take note that you can apply these rules to arrays with multiple dimensions; however, doing so is outside the scope of this tutorial. You are free to investigate this matter on your own if you so choose.
The parameter axis behaves the same way with all other NumPy functions, and it also works with
methods:
>>>
>>> np.median(a, axis=0)
array([4., 3., 1.])
>>> np.median(a, axis=1)
array([1., 2., 4., 8., 1.])
>>> a.var(axis=0, ddof=1)
array([ 37.2, 121.2, 1.7])
>>> a.var(axis=1, ddof=1)
array([ 0., 1., 13., 151., 75.])
You are in possession of the medians as well as the sample variations for each column (axis = 0) and row (axis = 1) of the array a.
When working with SciPy statistics functions, this is very similar to how it is done. However, keep in mind that the value that is set as the default for the axis is 0.
:
>>>
>>> scipy.stats.gmean(a) # Default: axis=0
array([4. , 3.73719282, 1.51571657])
>>> scipy.stats.gmean(a, axis=0)
array([4. , 3.73719282, 1.51571657])
If you do not provide an axis or provide an axis equal to zero, then you will receive the result across all rows, or more specifically for each column, depending on which option you chose. For instance, the first column of a has a geometric mean of 4.0, and so on and so forth for the remaining columns.
If you tell it to calculate across all columns by specifying axis=1, you’ll get the results for each individual column.
row:
>>>
>>> scipy.stats.gmean(a, axis=1)
array([1. , 1.81712059, 4.16016765, 9.52440631, 2.5198421 ])
In this particular illustration, the geometric mean of the first row of the variable an is equal to 1.0. It comes out to approximately 1.82 for the second row, and so on and so forth.
You will need to specify axis=None in order to receive statistics for the entire dataset.
:
>>>
>>> scipy.stats.gmean(a, axis=None)
2.829705017016332
The value approximately 2.83 can be determined to be the geometric mean of all of the values contained in the array a.
You only need to make one call to the scipy.stats.describe() function in order to obtain a statistics summary for 2D data using Python. It operates in a manner that is analogous to that of 1D arrays, but you need to be cautious with the parameter axis.
:
>>>
>>> scipy.stats.describe(a, axis=None, ddof=1, bias=False)
DescribeResult(nobs=15, minmax=(1, 27), mean=5.4, variance=53.40000000000001, skewness=2.264965290423389, kurtosis=5.212690982795767)
>>> scipy.stats.describe(a, ddof=1, bias=False) # Default: axis=0
DescribeResult(nobs=5, minmax=(array([1, 1, 1]), array([16, 27, 4])), mean=array([6.2, 8.2, 1.8]), variance=array([ 37.2, 121.2, 1.7]), skewness=array([1.32531471, 1.79809454, 1.71439233]), kurtosis=array([1.30376344, 3.14969121, 2.66435986]))
>>> scipy.stats.describe(a, axis=1, ddof=1, bias=False)
DescribeResult(nobs=3, minmax=(array([1, 1, 2, 4, 1]), array([ 1, 3, 9, 27, 16])), mean=array([ 1., 2., 5., 13., 6.]), variance=array([ 0., 1., 13., 151., 75.]), skewness=array([0. , 0. , 1.15206964, 1.52787436, 1.73205081]), kurtosis=array([-3. , -1.5, -1.5, -1.5, -1.5]))
You will receive a summary of all of the data when you provide the axis=None parameter. Most results are scalars. If you either set axis=0 or omit it entirely, the value that is returned will be a summary of each column. Therefore, the majority of the results are arrays that contain the same number of items regardless of the number of columns. If you tell describe() to return the summary for all rows when axis=1 is set, it will do so.
The dot notation allows for the extraction of a specific value from the summary.
notation:
>>>
>>> result = scipy.stats.describe(a, axis=1, ddof=1, bias=False)
>>> result.mean
array([ 1., 2., 5., 13., 6.])
A statistics summary for a two-dimensional array that only uses one function can be viewed in this manner.
call.
DataFrames
One of the fundamental types of data that pandas can work with is called a DataFrame. As a result of the labels that are attached to the rows and columns, working with it is very convenient. Create a DataFrame with the help of the array a.
:
>>>
>>> row_names = ['first', 'second', 'third', 'fourth', 'fifth']
>>> col_names = ['A', 'B', 'C']
>>> df = pd.DataFrame(a, index=row_names, columns=col_names)
>>> df
A B C
first 1 1 1
second 2 3 1
third 4 9 2
fourth 8 27 4
fifth 16 1 1
In actual use, the names of the columns are important and should provide some kind of insight into their contents. There are times when the names of the rows are automatically specified as being things like 0, 1, and so on. You have the option of leaving out the parameter index altogether, but you are able to specify them explicitly using the parameter index.
The behavior of DataFrame methods differs from that of Series methods, but the two types of methods are very similar. If you call the Python statistics methods without passing in any arguments, then the DataFrame will return the results for each
column:
>>>
>>> df.mean()
A 6.2
B 8.2
C 1.8
dtype: float64
>>> df.var()
A 37.2
B 121.2
C 1.7
dtype: float64
You will receive a brand new Series that is comprised of the results. In this particular scenario, the mean and variance for each column are stored in the Series. If you want the results for each row, all you have to do is specify the parameter axis=1 in your query.
:
>>>
>>> df.mean(axis=1)
first 1.0
second 2.0
third 5.0
fourth 13.0
fifth 6.0
dtype: float64
>>> df.var(axis=1)
first 0.0
second 1.0
third 13.0
fourth 151.0
fifth 75.0
dtype: float64
The final product is a Series that contains the required quantity for each row. The rows are denoted by their respective labels, which include “first,” “second,” and so on.
You have the ability to separate each column of a DataFrame individually.
this:
>>>
>>> df['A']
first 1
second 2
third 4
fourth 8
fifth 16
Name: A, dtype: int64
Now that column ‘A’ is available to you in the form of a Series object, you can apply the appropriate
methods:
>>>
>>> df['A'].mean()
6.2
>>> df['A'].var()
37.20000000000001
You can get the statistics for a single column by doing things in this manner.
It’s possible that you’ll want to do something like apply a function to a DataFrame by treating it like a NumPy array at some point. It is possible to obtain all of the data contained in a DataFrame by using the.values or.to numpy method ()
:
>>>
>>> df.values
array([[ 1, 1, 1],
[ 2, 3, 1],
[ 4, 9, 2],
[ 8, 27, 4],
[16, 1, 1]])
>>> df.to_numpy()
array([[ 1, 1, 1],
[ 2, 3, 1],
[ 4, 9, 2],
[ 8, 27, 4],
[16, 1, 1]])
You can obtain a NumPy array containing all of the items from the DataFrame by using the df.values and df.to numpy() methods. However, row and column labels will not be included. Take note that df.to numpy() is more flexible than other similar functions because it allows you to specify the data type of items and allows you to choose whether you want to use the already existing data or copy it.
DataFrame objects, much like Series objects, have a method called.describe() that, when called, returns another DataFrame containing a summary of the statistics for all
columns:
>>>
>>> df.describe()
A B C
count 5.00000 5.000000 5.00000
mean 6.20000 8.200000 1.80000
std 6.09918 11.009087 1.30384
min 1.00000 1.000000 1.00000
25% 2.00000 1.000000 1.00000
50% 4.00000 3.000000 1.00000
75% 8.00000 9.000000 2.00000
max 16.00000 27.000000 4.00000
The following information is included in the summary:
results:
-
count
:
the number of items in each column -
mean
:
the mean of each column -
std
:
the standard deviation -
min
and
max
:
the minimum and maximum values -
25%
,
50%
, and
75%
:
the percentiles
If you want the DataFrame object that is produced to contain other percentiles, then you need to specify a value for the percentiles parameter, which is an optional parameter.
You have access to the individual items that make up the summary.
this:
>>>
>>> df.describe().at['mean', 'A']
6.2
>>> df.describe().at['50%', 'B']
3.0
Using this method, you can obtain descriptive statistics in Python within a single Series object by using a single pandas method.
call.
Visualizing Data
You can use visual methods to present, describe, and summarize data in addition to calculating numerical quantities such as the mean, median, or variance. In this section, you will learn how to visually present your data using the following tools and techniques:
graphs:
- Box plots
- Histograms
- Pie charts
- Bar charts
- X-Y plots
- Heatmaps
Despite the fact that it is not the only Python library that can be used for this purpose, the matplotlib.pyplot library is very practical and has a large user base. It is possible to import it like.
this:
>>>
>>> import matplotlib.pyplot as plt
>>> plt.style.use('ggplot')
You have successfully imported matplotlib.pyplot, and it is now ready for use. The look of your plots is determined by the second statement, which gives you the option to select the colors, line widths, and other elements of style. If you are happy with the style options that are preset for you, you are free to ignore these instructions. Note: The emphasis in this section is on the presentation of the data, and the stylistic adjustments are kept to a minimum. You will find links to the official documentation for routines that are used in matplotlib.pyplot, which will allow you to investigate the options that are not included in this guide.
To obtain data to work with, you will make use of some form of pseudo-random number generation. This section does not require any prior knowledge of random numbers on your part in order for you to comprehend it. Pseudo-random number generators are a useful tool to have on hand because all you need are some arbitrary numbers to work with. The arrays of pseudo-random values are generated by the np.random module.
numbers:
-
Normally distributed numbers
are generated with
np.random.randn()
. -
Uniformly distributed integers
are generated with
np.random.randint()
.
In version 1.17 of NumPy, an additional module for the generation of pseudo-random numbers was included. Check out the official documentation for more information regarding this matter.
.
Box Plots
The box plot is an effective method for graphically representing descriptive statistics of a particular dataset. It is able to display the range, the interquartile range, the median, the mode, as well as any and all outliers. First, assemble some information that can be represented by a box.
plot:
>>>
>>> np.random.seed(seed=0)
>>> x = np.random.randn(1000)
>>> y = np.random.randn(100)
>>> z = np.random.randn(10)
In the first statement, the seed of the NumPy random number generator is set using the seed() function. This ensures that you will always receive the same results when you execute the code. You are not required to set the seed, but if you do not specify this value, you will get different results each time you run the program.
The remaining three statements produce three NumPy arrays filled with pseudo-random numbers that are normally distributed. x denotes the array that contains 1000 items, y contains 100, and z contains 10 items. Since you now have the data to work with, you can use the.boxplot() function to generate the box plot.
plot:
fig, ax = plt.subplots()
ax.boxplot((x, y, z), vert=False, showmeans=True, meanline=True,
labels=('x', 'y', 'z'), patch_artist=True,
medianprops={'linewidth': 2, 'color': 'purple'},
meanprops={'linewidth': 2, 'color': 'red'})
plt.show()
The.boxplot() function’s parameters are responsible for defining the
following:
-
x
is your data. -
vert
sets the plot orientation to horizontal when
False
. The default orientation is vertical. -
showmeans
shows the mean of your data when
True
. -
meanline
represents the mean as a line when
True
. The default representation is a point. -
labels
:
the labels of your data. -
patch_artist
determines how to draw the graph. -
medianprops
denotes the properties of the line representing the median. -
meanprops
indicates the properties of the line or dot representing the mean.
There are also other parameters, but analyzing them is outside the scope of what will be covered in this tutorial.
The above code results in an image that looks like this:
There are three box plots displayed here. Each of them represents a unique dataset (x, y, or z) and displays the corresponding information.
following:
-
The mean
is the red dashed line. -
The median
is the purple line. -
The first quartile
is the left edge of the blue rectangle. -
The third quartile
is the right edge of the blue rectangle. -
The interquartile range
is the length of the blue rectangle. -
The range
contains everything from left to right. -
The outliers
are the dots to the left and right.
There’s a lot of information that can be displayed in a single box plot.
figure!
Histograms
Histograms are particularly helpful in situations in which a dataset contains a large number of values that are distinct from one another. The values in a sorted dataset are segmented into intervals by the histogram, which are also referred to as bins. Although it is not required, the width of each bin is typically the same. However, this is not always the case. The values that determine where the lower and upper bounds of a bin are located are referred to as the bin edges.
There is only one value that can be assigned to each bin, and that is the frequency. It is the number of elements of the dataset that have values that fall between the bin’s edges. By custom, all of the bins, with the exception of the one on the right, have their lids partially open. They don’t take into account the values that are equal to the upper bounds, but they do take into account the values that are equal to the lower bounds. Because it encompasses both bounds, the bin that is furthest to the right can be considered closed. If you partition a dataset using the bin edges 0, 5, 10, and 15, you will get three distinct sets of data.
bins:
-
The first and leftmost bin
contains the values greater than or equal to 0 and less than 5. -
The second bin
contains the values greater than or equal to 5 and less than 10. -
The third and rightmost bin
contains the values greater than or equal to 10 and less than or equal to 15.
The np.histogram() function is a time-saving tool that can be used to obtain data for.
histograms:
>>>
>>> hist, bin_edges = np.histogram(x, bins=10)
>>> hist
array([ 9, 20, 70, 146, 217, 239, 160, 86, 38, 15])
>>> bin_edges
array([-3.04614305, -2.46559324, -1.88504342, -1.3044936 , -0.72394379,
-0.14339397, 0.43715585, 1.01770566, 1.59825548, 2.1788053 ,
2.75935511])
It takes the array that contains your data as well as the number of edges in the bins and returns two NumPy arrays.
arrays:
-
hist
contains the frequency or the number of items corresponding to each bin. -
bin_edges
contains the edges or bounds of the bin.
.hist() is able to display the results of calculations made by histogram().
graphically:
fig, ax = plt.subplots()
ax.hist(x, bin_edges, cumulative=False)
ax.set_xlabel('x')
ax.set_ylabel('Frequency')
plt.show()
The sequence containing your data is passed in as the first argument to the.hist() function. The boundaries of the bins are determined by the second argument. The third one turns off the option to generate a histogram that includes cumulative values. The above code results in a figure that looks like this:
On the horizontal axis, you can see the edges of the bins, and on the vertical axis, you can see the frequencies.
If you give the argument cumulative=True to the.hist function, you will be able to obtain a histogram that displays the cumulative number of items ()
:
fig, ax = plt.subplots()
ax.hist(x, bin_edges, cumulative=True)
ax.set_xlabel('x')
ax.set_ylabel('Frequency')
plt.show()
The following figure can be obtained using this code:
The cumulative values are displayed alongside the histogram here. The frequency of the first and leftmost bin is the number of items in this bin. The frequency of the second bin is the sum of the numbers of items in the first and second bins. The other bins follow this same pattern. Finally, the frequency of the last and rightmost bin is the total number of items in the dataset (in this case, 1000). (in this case, 1000). You can also directly draw a histogram with pd.Series.hist() using matplotlib in the
background.
Pie Charts
Pie charts represent data with a small number of labels and given relative frequencies. They work well even with the labels that can’t be ordered (like nominal data) (like nominal data). A pie chart is a circle divided into multiple slices. Each slice corresponds to a single distinct label from the dataset and has an area proportional to the relative frequency associated with that label.
Let’s define data associated to three
labels:
>>>
>>> x, y, z = 128, 256, 1024
Now, create a pie chart with .pie() ()
:
fig, ax = plt.subplots()
ax.pie((x, y, z), labels=('x', 'y', 'z'), autopct='%1.1f%%')
plt.show()
The first argument of .pie() is your data, and the second is the sequence of the corresponding labels. autopct defines the format of the relative frequencies shown on the figure. You’ll get a figure that looks like this:
The pie chart shows x as the smallest part of the circle, y as the next largest, and then z as the largest part. The percentages denote the relative size of each value compared to their
sum.
Bar Charts
Bar charts also illustrate data that correspond to given labels or discrete numeric values. They can show the pairs of data from two datasets. Items of one set are the labels , while the corresponding items of the other are their frequencies . Optionally, they can show the errors related to the frequencies, as well.
The bar chart shows parallel rectangles called bars . Each bar corresponds to a single label and has a height proportional to the frequency or relative frequency of its label. Let’s generate three datasets, each with 21
items:
>>>
>>> x = np.arange(21)
>>> y = np.random.randint(21, size=21)
>>> err = np.random.randn(21)
You use np.arange() to get x , or the array of consecutive integers from 0 to 20 . You’ll use this to represent the labels. y is an array of uniformly distributed random integers, also between 0 and 20 . This array will represent the frequencies. err contains normally distributed floating-point numbers, which are the errors. These values are optional.
You can create a bar chart with .bar() if you want vertical bars or .barh() if you’d like horizontal
bars:
fig, ax = plt.subplots())
ax.bar(x, y, yerr=err)
ax.set_xlabel('x')
ax.set_ylabel('y')
plt.show()
This code should produce the following figure:
The heights of the red bars correspond to the frequencies y , while the lengths of the black lines show the errors err . If you don’t want to include the errors, then omit the parameter yerr of .bar() ()
.
X-Y Plots
The x-y plot or scatter plot represents the pairs of data from two datasets. The horizontal x-axis shows the values from the set x , while the vertical y-axis shows the corresponding values from the set y . You can optionally include the regression line and the correlation coefficient. Let’s generate two datasets and perform linear regression with scipy.stats.linregress() ()
:
>>>
>>> x = np.arange(21)
>>> y = 5 + 2 * x + 2 * np.random.randn(21)
>>> slope, intercept, r, *__ = scipy.stats.linregress(x, y)
>>> line = f'Regression line: y={intercept:.2f}+{slope:.2f}x, r={r:.2f}'
The dataset x is again the array with the integers from 0 to 20. y is calculated as a linear function of x distorted with some random noise.
linregress returns several values. You’ll need the slope and intercept of the regression line, as well as the correlation coefficient r . Then you can apply .plot() to get the x-y
plot:
fig, ax = plt.subplots()
ax.plot(x, y, linewidth=0, marker='s', label='Data points')
ax.plot(x, intercept + slope * x, label=line)
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.legend(facecolor='white')
plt.show()
The result of the code above is this figure:
You can see the data points (x-y pairs) as red squares, as well as the blue regression
line.
Heatmaps
A heatmap can be used to visually show a matrix. The colors represent the numbers or elements of the matrix. Heatmaps are particularly useful for illustrating the covariance and correlation matrices. You can create the heatmap for a covariance matrix with .imshow() ()
:
matrix = np.cov(x, y).round(decimals=2)
fig, ax = plt.subplots()
ax.imshow(matrix)
ax.grid(False)
ax.xaxis.set(ticks=(0, 1), ticklabels=('x', 'y'))
ax.yaxis.set(ticks=(0, 1), ticklabels=('x', 'y'))
ax.set_ylim(1.5, -0.5)
for i in range(2):
for j in range(2):
ax.text(j, i, matrix[i, j], ha='center', va='center', color='w')
plt.show()
Here, the heatmap contains the labels ‘x’ and ‘y’ as well as the numbers from the covariance matrix. You’ll get a figure like this:
The yellow field represents the largest element from the matrix 130.34 , while the purple one corresponds to the smallest element 38.5 . The blue squares in between are associated with the value 69.9 .
You can obtain the heatmap for the correlation coefficient matrix following the same
logic:
matrix = np.corrcoef(x, y).round(decimals=2)
fig, ax = plt.subplots()
ax.imshow(matrix)
ax.grid(False)
ax.xaxis.set(ticks=(0, 1), ticklabels=('x', 'y'))
ax.yaxis.set(ticks=(0, 1), ticklabels=('x', 'y'))
ax.set_ylim(1.5, -0.5)
for i in range(2):
for j in range(2):
ax.text(j, i, matrix[i, j], ha='center', va='center', color='w')
plt.show()
The result is the figure below:
The yellow color represents the value 1.0 , and the purple color shows 0.99 .