Descriptive Statistics¶
What is Statistics?
Statistics is a numerical way of analyzing data, which helps us to understand the distribution of data. It includes various numerical calculations.
What is Descriptive Statistics?
Descriptive Statistics describes the data in a structured way. For example, Grouping all the similar data into one, finding the frequency of a variable and returning the count, plotting the data into a visualization format, finding the distribution of data using measures of central tendency, and also Understanding the distribution of data by plotting the data. These are the most used techniques in descriptive statistics for describing data in a more meaningful way. This is most important part when training a Ml model we cannot train a model on a dataset without understanding the data. So, the below mentioned are the various techniques that needs to be performed for analyzing the data and tune the data accordingly.
# Importing the required libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
1. Measures of Frequency¶
What is Measures of Frequency?
Counting or Measuring number of times each variable occurs in the given data.
Example – Count of Oranges’s and Apple’s in a Fruits dataset.
1.1 Grouping Data¶
In Measures of frequency grouping of data is one of the method to find frequency of a variable by grouping it. In simple words group all the similar category into one.
Example – Group all the fruits based on the prize. let us now see example practically.
Import Dataset.¶
The Dataset used for group the variables is Fruits.csv. Which contains information about name of the fruit and price of one fruit.
# Import dataset
Fruitsdf = pd.read_csv("Fruits.csv")
# The pandas head function helps to display the first 5 rows of the data. If we mention the no.of rows it will show the same
Fruitsdf.head(25)
name | price | |
---|---|---|
0 | Orange | 40 |
1 | Apple | 30 |
2 | Orange | 40 |
3 | Apple | 30 |
4 | Orange | 40 |
5 | Apple | 30 |
6 | Orange | 40 |
7 | Grapes | 40 |
8 | Grapes | 40 |
9 | Orange | 40 |
10 | Grapes | 40 |
11 | Apple | 30 |
12 | Orange | 40 |
13 | Grapes | 40 |
14 | Watermelon | 50 |
15 | Watermelon | 50 |
16 | Grapes | 40 |
17 | Watermelon | 50 |
18 | Watermelon | 50 |
19 | Grapes | 40 |
20 | Orange | 40 |
21 | Apple | 30 |
22 | Apple | 30 |
# using groupby() from pandas. Sum all the fruit names concerning the price column and display the result
groupvalue = Fruitsdf.groupby('name').sum()
groupvalue
price | |
---|---|
name | |
Apple | 180 |
Grapes | 240 |
Orange | 280 |
Watermelon | 200 |
- From the above result, you can see all the similar fruits grouped and added the total price of similar fruits.
1.2 Univariate Analysis using Measures of Frequency¶
what is Univariate Analysis?
Univariate analysis is analyzing one variable at a time, which means describing only a single attribute at a time.
Example: If an analysis is on fruits, analyze only on apple
what is Univariate Analysis using Measures of Frequency?
Univariate analysis using Measures of Frequency means describing only a single attribute at a time also counting the frequency of that attribute.
For describing data with univariate data, we will use
Bar Graph
Frequency Distribution
Pie Chart
The above three methods are involved in Univariate Analysis using Measures of Frequency.
2.Bar Graph¶
The bar graphs represent complex data/groups of data in a graphical format of Bars, which helps to compare the data.
# Plotting the bar graph for each fruit with respect to the price.
BRA_GRAPH = sns.barplot(x=Fruitsdf['name'], y=Fruitsdf['price'])
- The above bar graph represents the price of each fruit in the graphical format of bars.
3. Frequency Table¶
A frequency distribution shows the number of times a particular item occurs in each set of data. A frequency distribution organizes the data in a meaningful manner for better understanding.
- The pandas crosstab() function used for computing the frequency of a value in an array or from a given set of data
# Using pandas crosstab() we will find the frequency count of fruit in the data
freq_fruits = pd.crosstab(index=Fruitsdf["name"],columns="count")
freq_fruits
col_0 | count |
---|---|
name | |
Apple | 6 |
Grapes | 6 |
Orange | 7 |
Watermelon | 4 |
- The above result returns a count of value, how many times a variable repeats in the data.
As you can see, apple is present 6 times in the dataset we took for computing the frequency. Similarly, this method returns the frequency count for all the variables present in the dataset.
Freq_Fruits = freq_fruits.reset_index()
Freq_Fruits
col_0 | name | count |
---|---|---|
0 | Apple | 6 |
1 | Grapes | 6 |
2 | Orange | 7 |
3 | Watermelon | 4 |
- Using reset.index() to format the table and represent it more clearly.
4. Pie Chart¶
Pie charts are the type of graph where data is represented in a circular graph. Each part of the chart will represent the size of the category in the whole data.
- The data used for plotting this pie chart is the output returned from the frequency table.
Pie_Chart = freq_fruits.plot(kind="pie",y='count',autopct='%1.1f%%',title='Pie Chart',fontsize=14,figsize=(9,9))
- The above pie chart describes the size or proportion of each category from the whole.
So, the data we used to represent this pie chart is the frequency table. From the frequency table, which fruit has the highest count will occupy the most size/proportion from the whole. In our case, orange has the most count.
Bivariate Analysis using Measures of Frequency¶
Bivariate analysis is used to find the relationship between two datasets. It is the analysis of two variables ‘X’ and ‘Y’
1.3.1 Frequency Table¶
# Adding a new column weight to the existing fruits dataset
weight = ['1kg', '2kg', '1kg', '1.5kg','1kg','1.5kg','2kg','2.5kg','3kg','1.5kg','1kg','2.5kg','1kg','1kg','1.5kg','2kg','1kg','2kg','1kg','2kg','1kg','2kg','1kg']
# saving dataframe into new variable
fruitdata = Fruitsdf
fruitdata['weight'] = weight
FreqData = pd.crosstab(index=fruitdata["name"],columns=fruitdata["weight"])
FreqData.reset_index()
weight | name | 1.5kg | 1kg | 2.5kg | 2kg | 3kg |
---|---|---|---|---|---|---|
0 | Apple | 2 | 1 | 1 | 2 | 0 |
1 | Grapes | 0 | 3 | 1 | 1 | 1 |
2 | Orange | 1 | 5 | 0 | 1 | 0 |
3 | Watermelon | 1 | 1 | 0 | 2 | 0 |
1.3.2 Stacked bar chart¶
The stacked Bar graph is a type of chart which shows the comparison between different categories in a single variable. This chart can be used when you need to compare data points.
- Using the frequency returned output data to represent the stacked bar chart.
Stackedbar = FreqData.plot(kind="bar",figsize=(7,7),stacked=True,title='Stacked Bar Chart',fontsize=12)
Stackedbar.set_ylabel("Count",fontsize=12)
Stackedbar.set_xlabel("Fruit names",fontsize=10)
Text(0.5, 0, 'Fruit names')
- The above Stacked bar chart represents the different weights of each fruit.
Measures of Central Tendency¶
Measures of central tendency represent the middle or center of a distribution that describes a whole set of data with a single value.
The measures of central tendency are also classified into three main measures Mean, Median, and Mode
2.1 Measures of Central Tendency: Mean, Median, Mode¶
The dataset we are using to find the mean median and mode is Fruits.csv
2.1.1 Mean¶
It is the average value of the given set of data. It is known as the arithmetic average also it is the most used method in measure of central tendency
How the mean is calculated
mean = sum of the elements / number of elements
The above array has four elements
- Add the 4 elements 20+10+40+5 = 75
- Divide the sum value with number of elements i.e 4
mean = sum of the elements / number of elements = 75/4 = 18.75
# We are using the price column from the fruits dataset to calculate the mean.
Fruitsdf['price'].mean()
39.130434782608695
The mean value is 39.13 for the price column. Try to calculate the mean value manually and compare it with the result returned.
2.1.2 Median¶
The median is the exact middle number in a set of data. The data should be sorted in ascending or descending order before finding the median value.
How the median is calculated for odd no.of elements
Median for odd number of elements in a array:
When there are odd number of elements in an array.
array = [2,3,5,7,8]
- Divide the no.of elements by 2.
- Round up the quotient to the nearest value.
- The rounded value will be position value of an array.
The element in the specified position value of an array is median value
median = Number of elements / 2
= 5/2
Quotient = 2.5
position value = 3
median = 5.0
From the above array the element in the third position is 5. so, the middle score value from the array is 5.
How the median is calculated for even no.of elements
Median for even number of elements in a array:
When there are even number of elements in an array.
array = [3,7,4,5,10,4]
- Take the middle value pairs from the given array i.e [4,5]
- Sum those middle value pairs = 4+5 = 9
- Divide the sum value with 2
- The returned quotient value will be the median or middle score value of the array.
Middle value pairs = 4,5
sum of middle value pairs = 4+5 = 9
Median = sum of middle value pairs / 2
= 9/2
# We are using the price column from fruits dataset to calculate the median
Fruitsdf['price'].median()
40.0
The median value is 40 for the price column. Try to calculate the median value manually and compare it with the result returned.
2.1.3 Mode¶
Mode is the most frequently repeated observation in a distribution. If all the numbers in the given data appear a single time, then there is no Mode.
# We are using the price column from fruits dataset to calculate the mode
Fruitsdf['price'].mode()
0 40 Name: price, dtype: int64
2.2 Effects of Outlier on Measures of Central Tendency¶
Outlier is the variation in the data, which means when some variable value differs from others in a particular data. The effects of an outlier on measures of central tendency are in such a way that it will cause a wrong analysis of the dataset. Outlier mostly, effects on mean other than the median and mode.
Fruitsdf.head()
name | price | weight | |
---|---|---|---|
0 | Orange | 40 | 1kg |
1 | Apple | 30 | 2kg |
2 | Orange | 40 | 1kg |
3 | Apple | 30 | 1.5kg |
4 | Orange | 40 | 1kg |
# appending a new value to the existing fruits.csv dataset
Fruitsdf.loc[len(Fruitsdf.index)] = ['Apple', 100,'2kg']
Fruitsdf.head(30)
name | price | weight | |
---|---|---|---|
0 | Orange | 40 | 1kg |
1 | Apple | 30 | 2kg |
2 | Orange | 40 | 1kg |
3 | Apple | 30 | 1.5kg |
4 | Orange | 40 | 1kg |
5 | Apple | 30 | 1.5kg |
6 | Orange | 40 | 2kg |
7 | Grapes | 40 | 2.5kg |
8 | Grapes | 40 | 3kg |
9 | Orange | 40 | 1.5kg |
10 | Grapes | 40 | 1kg |
11 | Apple | 30 | 2.5kg |
12 | Orange | 40 | 1kg |
13 | Grapes | 40 | 1kg |
14 | Watermelon | 50 | 1.5kg |
15 | Watermelon | 50 | 2kg |
16 | Grapes | 40 | 1kg |
17 | Watermelon | 50 | 2kg |
18 | Watermelon | 50 | 1kg |
19 | Grapes | 40 | 2kg |
20 | Orange | 40 | 1kg |
21 | Apple | 30 | 2kg |
22 | Apple | 30 | 1kg |
23 | Apple | 100 | 2kg |
# Computing mean on the price column from fruits.csv
Fruitsdf['price'].mean()
41.666666666666664
- After appending a new value to the dataset, the mean value has changed. It can happen when there is more difference between the existing values when compared to the newly appended value.
Fruitsdf['price'].median()
40.0
Fruitsdf['price'].mode()
0 40 Name: price, dtype: int64
Below are the two techniques to display a outlier in the dataset.
# Plotting jointplot() to find the outliers
sns.jointplot(x="name", y="price", data=Fruitsdf)
<seaborn.axisgrid.JointGrid at 0x1fb4af8a520>
# Plotting boxplot() to find the outliers
Fruitsdf.boxplot(column="price",vert=False)
<AxesSubplot:>
3. Measures of Variability¶
Measures of variability show the amount of dispersion in the set of data. A dataset having values that spread out has high variability. There are some frequently used measures of variability.
3.1 Range¶
The range is the difference between a minimum and maximum value in a dataset. It shows the spread of data.
minimum value¶
# Finding the minimum value from the dataset. The minimum value is the smallest value from the data.
Fruitsdf['price'].min()
30
maximum value¶
# Finding the maximum value from the dataset. The maximum value is the highest value from the data.
Fruitsdf['price'].max()
100
3.2 Quartiles¶
Quartile method used to find out the interquartile range, which is measures variability around the median. It divides the data into lower quartiles, middle quartiles, and upper quartiles.
Fruitsdf['price'].quantile(.9)
50.0
3.3 Variance¶
Variance measures the degree of dispersion around the center of the given data. It helps to know how individual numbers are related to each other.
Fruitsdf['price'].var()
197.1014492753623
3.4 Means v/s Variance¶
Mean is the average of the given data also variance is the average of the squared difference from the mean.
Fruitsdf['price'].mean()
41.666666666666664
3.5 Standard Deviation¶
Standard deviation in descriptive statistics is the degree of dispersion of the dataset related to its mean. It helps to compare the data with the same mean but a different range.
How to compute Standard Deviation
- Compute mean of the given array.
- mean(A) = 2+6+3+5 = 16/4 = 4
- Find standard deviation.
- Subtract mean value from the array elements
- Calculate the square of the difference values and add them
- Divide the added value by number of elements
- Calculate square root of the value
Standard Deviation = sqrt(((2-4)^2 + (6-4)^2 + (3-4)^2 + (5-4)^2)/4)
= sqrt(((-2)^2 + (2)^2 + (-1)^2 + (1)^2)/4)
= sqrt((4+4+1+1)/4)
= sqrt(10/4)
= sqrt(2.5)
= 1.58113883
Fruitsdf['price'].std()
14.039282363260677
3.6 Summary Statistics¶
Summary statistics are used to summarize the information of a dataset. It gives us a quick and simple description of the data.
The pandas describe() is really helpful for analysing the statistics of whole data.
# describe() will tell us about the statistical information of each Numerical column.
Fruitsdf.describe()
price | |
---|---|
count | 24.000000 |
mean | 41.666667 |
std | 14.039282 |
min | 30.000000 |
25% | 37.500000 |
50% | 40.000000 |
75% | 40.000000 |
max | 100.000000 |
4. Measures of shape¶
Measures of Shape is nothing but describing a data and understanding the data more clearly using visualization techniques. Measures of Shape used to explore the data and find if that data is skewed and kurtosis.
4.1 Calculating Skewness¶
What is Skewness?
When there is a visible difference between the data, it is called skewness.
In other words, when the data not normally distributed, then it is known as skewness.
Fruitsdf.skew()
<ipython-input-30-3f3783433aeb>:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction. Fruitsdf.skew()
price 3.277663 dtype: float64
4.2 Claculating Kurtosis¶
What is Kurtosis?
Kurtosis is a measure of the tailedness of a distribution. The representation of this kurtosis is a bell-shaped distribution of data.
Fruitsdf.kurt()
<ipython-input-31-7bc914ba06b2>:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction. Fruitsdf.kurt()
price 13.584094 dtype: float64
4.3 Normal Distribution¶
Normal Distribution is when the data doesn’t have any skewness or kurtosis in it, which means its a symmetrically distributed data.
# Importing Studentscore dataset to understand the distribution of the data.
students = pd.read_csv("Studentscore.csv")
students.head(25)
Student | Score | |
---|---|---|
0 | c1 | 56.0 |
1 | c2 | 62.0 |
2 | c3 | 63.0 |
3 | c4 | 66.0 |
4 | c5 | 67.0 |
5 | c6 | 70.0 |
6 | c7 | 75.0 |
7 | c8 | 72.5 |
8 | c9 | 72.5 |
9 | c10 | 71.0 |
10 | c11 | 76.0 |
11 | c12 | 78.0 |
12 | c13 | 80.0 |
13 | c14 | 82.0 |
14 | c15 | 83.0 |
15 | c16 | 86.0 |
# The hist() plot is used for representing the histogram analysis of the data
students.hist(column="Score",figsize=(5,5),color="blue",bins=5,range=(50,90))
array([[<AxesSubplot:title={'center':'Score'}>]], dtype=object)
# Density plot is used for finding the shape of the data which should be bell shaped.
students.plot(kind='density',figsize=(8,8))
<AxesSubplot:ylabel='Density'>
Descriptive Statistics¶
What is Statistics?
Statistics is a numerical way of analyzing data, which helps us to understand the distribution of data. It includes various numerical calculations.
What is Descriptive Statistics?
Descriptive Statistics describes the data in a structured way. For example, Grouping all the similar data into one, finding the frequency of a variable and returning the count, plotting the data into a visualization format, finding the distribution of data using measures of central tendency, and also Understanding the distribution of data by plotting the data. These are the most used techniques in descriptive statistics for describing data in a more meaningful way. This is most important part when training a Ml model we cannot train a model on a dataset without understanding the data. So, the below mentioned are the various techniques that needs to be performed for analyzing the data and tune the data accordingly.
# Importing the required libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
1. Measures of Frequency¶
What is Measures of Frequency?
Counting or Measuring number of times each variable occurs in the given data.
Example – Count of Oranges’s and Apple’s in a Fruits dataset.
1.1 Grouping Data¶
In Measures of frequency grouping of data is one of the method to find frequency of a variable by grouping it. In simple words group all the similar category into one.
Example – Group all the fruits based on the prize. let us now see example practically.
Import Dataset.¶
The Dataset used for group the variables is Fruits.csv. Which contains information about name of the fruit and price of one fruit.
# Import dataset
Fruitsdf = pd.read_csv("Fruits.csv")
# The pandas head function helps to display the first 5 rows of the data. If we mention the no.of rows it will show the same
Fruitsdf.head(25)
name | price | |
---|---|---|
0 | Orange | 40 |
1 | Apple | 30 |
2 | Orange | 40 |
3 | Apple | 30 |
4 | Orange | 40 |
5 | Apple | 30 |
6 | Orange | 40 |
7 | Grapes | 40 |
8 | Grapes | 40 |
9 | Orange | 40 |
10 | Grapes | 40 |
11 | Apple | 30 |
12 | Orange | 40 |
13 | Grapes | 40 |
14 | Watermelon | 50 |
15 | Watermelon | 50 |
16 | Grapes | 40 |
17 | Watermelon | 50 |
18 | Watermelon | 50 |
19 | Grapes | 40 |
20 | Orange | 40 |
21 | Apple | 30 |
22 | Apple | 30 |
# using groupby() from pandas. Sum all the fruit names concerning the price column and display the result
groupvalue = Fruitsdf.groupby('name').sum()
groupvalue
price | |
---|---|
name | |
Apple | 180 |
Grapes | 240 |
Orange | 280 |
Watermelon | 200 |
- From the above result, you can see all the similar fruits grouped and added the total price of similar fruits.
1.2 Univariate Analysis using Measures of Frequency¶
what is Univariate Analysis?
Univariate analysis is analyzing one variable at a time, which means describing only a single attribute at a time.
Example: If an analysis is on fruits, analyze only on apple
what is Univariate Analysis using Measures of Frequency?
Univariate analysis using Measures of Frequency means describing only a single attribute at a time also counting the frequency of that attribute.
For describing data with univariate data, we will use
Bar Graph
Frequency Distribution
Pie Chart
The above three methods are involved in Univariate Analysis using Measures of Frequency.
2.Bar Graph¶
The bar graphs represent complex data/groups of data in a graphical format of Bars, which helps to compare the data.
# Plotting the bar graph for each fruit with respect to the price.
BRA_GRAPH = sns.barplot(x=Fruitsdf['name'], y=Fruitsdf['price'])
- The above bar graph represents the price of each fruit in the graphical format of bars.
3. Frequency Table¶
A frequency distribution shows the number of times a particular item occurs in each set of data. A frequency distribution organizes the data in a meaningful manner for better understanding.
- The pandas crosstab() function used for computing the frequency of a value in an array or from a given set of data
# Using pandas crosstab() we will find the frequency count of fruit in the data
freq_fruits = pd.crosstab(index=Fruitsdf["name"],columns="count")
freq_fruits
col_0 | count |
---|---|
name | |
Apple | 6 |
Grapes | 6 |
Orange | 7 |
Watermelon | 4 |
- The above result returns a count of value, how many times a variable repeats in the data.
As you can see, apple is present 6 times in the dataset we took for computing the frequency. Similarly, this method returns the frequency count for all the variables present in the dataset.
Freq_Fruits = freq_fruits.reset_index()
Freq_Fruits
col_0 | name | count |
---|---|---|
0 | Apple | 6 |
1 | Grapes | 6 |
2 | Orange | 7 |
3 | Watermelon | 4 |
- Using reset.index() to format the table and represent it more clearly.
4. Pie Chart¶
Pie charts are the type of graph where data is represented in a circular graph. Each part of the chart will represent the size of the category in the whole data.
- The data used for plotting this pie chart is the output returned from the frequency table.
Pie_Chart = freq_fruits.plot(kind="pie",y='count',autopct='%1.1f%%',title='Pie Chart',fontsize=14,figsize=(9,9))
- The above pie chart describes the size or proportion of each category from the whole.
So, the data we used to represent this pie chart is the frequency table. From the frequency table, which fruit has the highest count will occupy the most size/proportion from the whole. In our case, orange has the most count.
Bivariate Analysis using Measures of Frequency¶
Bivariate analysis is used to find the relationship between two datasets. It is the analysis of two variables ‘X’ and ‘Y’
1.3.1 Frequency Table¶
# Adding a new column weight to the existing fruits dataset
weight = ['1kg', '2kg', '1kg', '1.5kg','1kg','1.5kg','2kg','2.5kg','3kg','1.5kg','1kg','2.5kg','1kg','1kg','1.5kg','2kg','1kg','2kg','1kg','2kg','1kg','2kg','1kg']
# saving dataframe into new variable
fruitdata = Fruitsdf
fruitdata['weight'] = weight
FreqData = pd.crosstab(index=fruitdata["name"],columns=fruitdata["weight"])
FreqData.reset_index()
weight | name | 1.5kg | 1kg | 2.5kg | 2kg | 3kg |
---|---|---|---|---|---|---|
0 | Apple | 2 | 1 | 1 | 2 | 0 |
1 | Grapes | 0 | 3 | 1 | 1 | 1 |
2 | Orange | 1 | 5 | 0 | 1 | 0 |
3 | Watermelon | 1 | 1 | 0 | 2 | 0 |
1.3.2 Stacked bar chart¶
The stacked Bar graph is a type of chart which shows the comparison between different categories in a single variable. This chart can be used when you need to compare data points.
- Using the frequency returned output data to represent the stacked bar chart.
Stackedbar = FreqData.plot(kind="bar",figsize=(7,7),stacked=True,title='Stacked Bar Chart',fontsize=12)
Stackedbar.set_ylabel("Count",fontsize=12)
Stackedbar.set_xlabel("Fruit names",fontsize=10)
Text(0.5, 0, 'Fruit names')
- The above Stacked bar chart represents the different weights of each fruit.
Measures of Central Tendency¶
Measures of central tendency represent the middle or center of a distribution that describes a whole set of data with a single value.
The measures of central tendency are also classified into three main measures Mean, Median, and Mode
2.1 Measures of Central Tendency: Mean, Median, Mode¶
The dataset we are using to find the mean median and mode is Fruits.csv
2.1.1 Mean¶
It is the average value of the given set of data. It is known as the arithmetic average also it is the most used method in measure of central tendency
How the mean is calculated
mean = sum of the elements / number of elements
The above array has four elements
- Add the 4 elements 20+10+40+5 = 75
- Divide the sum value with number of elements i.e 4
mean = sum of the elements / number of elements = 75/4 = 18.75
# We are using the price column from the fruits dataset to calculate the mean.
Fruitsdf['price'].mean()
39.130434782608695
The mean value is 39.13 for the price column. Try to calculate the mean value manually and compare it with the result returned.
2.1.2 Median¶
The median is the exact middle number in a set of data. The data should be sorted in ascending or descending order before finding the median value.
How the median is calculated for odd no.of elements
Median for odd number of elements in a array:
When there are odd number of elements in an array.
array = [2,3,5,7,8]
- Divide the no.of elements by 2.
- Round up the quotient to the nearest value.
- The rounded value will be position value of an array.
The element in the specified position value of an array is median value
median = Number of elements / 2
= 5/2
Quotient = 2.5
position value = 3
median = 5.0
From the above array the element in the third position is 5. so, the middle score value from the array is 5.
How the median is calculated for even no.of elements
Median for even number of elements in a array:
When there are even number of elements in an array.
array = [3,7,4,5,10,4]
- Take the middle value pairs from the given array i.e [4,5]
- Sum those middle value pairs = 4+5 = 9
- Divide the sum value with 2
- The returned quotient value will be the median or middle score value of the array.
Middle value pairs = 4,5
sum of middle value pairs = 4+5 = 9
Median = sum of middle value pairs / 2
= 9/2
# We are using the price column from fruits dataset to calculate the median
Fruitsdf['price'].median()
40.0
The median value is 40 for the price column. Try to calculate the median value manually and compare it with the result returned.
2.1.3 Mode¶
Mode is the most frequently repeated observation in a distribution. If all the numbers in the given data appear a single time, then there is no Mode.
# We are using the price column from fruits dataset to calculate the mode
Fruitsdf['price'].mode()
0 40 Name: price, dtype: int64
2.2 Effects of Outlier on Measures of Central Tendency¶
Outlier is the variation in the data, which means when some variable value differs from others in a particular data. The effects of an outlier on measures of central tendency are in such a way that it will cause a wrong analysis of the dataset. Outlier mostly, effects on mean other than the median and mode.
Fruitsdf.head()
name | price | weight | |
---|---|---|---|
0 | Orange | 40 | 1kg |
1 | Apple | 30 | 2kg |
2 | Orange | 40 | 1kg |
3 | Apple | 30 | 1.5kg |
4 | Orange | 40 | 1kg |
# appending a new value to the existing fruits.csv dataset
Fruitsdf.loc[len(Fruitsdf.index)] = ['Apple', 100,'2kg']
Fruitsdf.head(30)
name | price | weight | |
---|---|---|---|
0 | Orange | 40 | 1kg |
1 | Apple | 30 | 2kg |
2 | Orange | 40 | 1kg |
3 | Apple | 30 | 1.5kg |
4 | Orange | 40 | 1kg |
5 | Apple | 30 | 1.5kg |
6 | Orange | 40 | 2kg |
7 | Grapes | 40 | 2.5kg |
8 | Grapes | 40 | 3kg |
9 | Orange | 40 | 1.5kg |
10 | Grapes | 40 | 1kg |
11 | Apple | 30 | 2.5kg |
12 | Orange | 40 | 1kg |
13 | Grapes | 40 | 1kg |
14 | Watermelon | 50 | 1.5kg |
15 | Watermelon | 50 | 2kg |
16 | Grapes | 40 | 1kg |
17 | Watermelon | 50 | 2kg |
18 | Watermelon | 50 | 1kg |
19 | Grapes | 40 | 2kg |
20 | Orange | 40 | 1kg |
21 | Apple | 30 | 2kg |
22 | Apple | 30 | 1kg |
23 | Apple | 100 | 2kg |
# Computing mean on the price column from fruits.csv
Fruitsdf['price'].mean()
41.666666666666664
- After appending a new value to the dataset, the mean value has changed. It can happen when there is more difference between the existing values when compared to the newly appended value.
Fruitsdf['price'].median()
40.0
Fruitsdf['price'].mode()
0 40 Name: price, dtype: int64
Below are the two techniques to display a outlier in the dataset.
# Plotting jointplot() to find the outliers
sns.jointplot(x="name", y="price", data=Fruitsdf)
<seaborn.axisgrid.JointGrid at 0x1fb4af8a520>
# Plotting boxplot() to find the outliers
Fruitsdf.boxplot(column="price",vert=False)
<AxesSubplot:>
3. Measures of Variability¶
Measures of variability show the amount of dispersion in the set of data. A dataset having values that spread out has high variability. There are some frequently used measures of variability.
3.1 Range¶
The range is the difference between a minimum and maximum value in a dataset. It shows the spread of data.
minimum value¶
# Finding the minimum value from the dataset. The minimum value is the smallest value from the data.
Fruitsdf['price'].min()
30
maximum value¶
# Finding the maximum value from the dataset. The maximum value is the highest value from the data.
Fruitsdf['price'].max()
100
3.2 Quartiles¶
Quartile method used to find out the interquartile range, which is measures variability around the median. It divides the data into lower quartiles, middle quartiles, and upper quartiles.
Fruitsdf['price'].quantile(.9)
50.0
3.3 Variance¶
Variance measures the degree of dispersion around the center of the given data. It helps to know how individual numbers are related to each other.
Fruitsdf['price'].var()
197.1014492753623
3.4 Means v/s Variance¶
Mean is the average of the given data also variance is the average of the squared difference from the mean.
Fruitsdf['price'].mean()
41.666666666666664
3.5 Standard Deviation¶
Standard deviation in descriptive statistics is the degree of dispersion of the dataset related to its mean. It helps to compare the data with the same mean but a different range.
How to compute Standard Deviation
- Compute mean of the given array.
- mean(A) = 2+6+3+5 = 16/4 = 4
- Find standard deviation.
- Subtract mean value from the array elements
- Calculate the square of the difference values and add them
- Divide the added value by number of elements
- Calculate square root of the value
Standard Deviation = sqrt(((2-4)^2 + (6-4)^2 + (3-4)^2 + (5-4)^2)/4)
= sqrt(((-2)^2 + (2)^2 + (-1)^2 + (1)^2)/4)
= sqrt((4+4+1+1)/4)
= sqrt(10/4)
= sqrt(2.5)
= 1.58113883
Fruitsdf['price'].std()
14.039282363260677
3.6 Summary Statistics¶
Summary statistics are used to summarize the information of a dataset. It gives us a quick and simple description of the data.
The pandas describe() is really helpful for analysing the statistics of whole data.
# describe() will tell us about the statistical information of each Numerical column.
Fruitsdf.describe()
price | |
---|---|
count | 24.000000 |
mean | 41.666667 |
std | 14.039282 |
min | 30.000000 |
25% | 37.500000 |
50% | 40.000000 |
75% | 40.000000 |
max | 100.000000 |
4. Measures of shape¶
Measures of Shape is nothing but describing a data and understanding the data more clearly using visualization techniques. Measures of Shape used to explore the data and find if that data is skewed and kurtosis.
4.1 Calculating Skewness¶
What is Skewness?
When there is a visible difference between the data, it is called skewness.
In other words, when the data not normally distributed, then it is known as skewness.
Fruitsdf.skew()
<ipython-input-30-3f3783433aeb>:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction. Fruitsdf.skew()
price 3.277663 dtype: float64
4.2 Claculating Kurtosis¶
What is Kurtosis?
Kurtosis is a measure of the tailedness of a distribution. The representation of this kurtosis is a bell-shaped distribution of data.
Fruitsdf.kurt()
<ipython-input-31-7bc914ba06b2>:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction. Fruitsdf.kurt()
price 13.584094 dtype: float64
4.3 Normal Distribution¶
Normal Distribution is when the data doesn’t have any skewness or kurtosis in it, which means its a symmetrically distributed data.
# Importing Studentscore dataset to understand the distribution of the data.
students = pd.read_csv("Studentscore.csv")
students.head(25)
Student | Score | |
---|---|---|
0 | c1 | 56.0 |
1 | c2 | 62.0 |
2 | c3 | 63.0 |
3 | c4 | 66.0 |
4 | c5 | 67.0 |
5 | c6 | 70.0 |
6 | c7 | 75.0 |
7 | c8 | 72.5 |
8 | c9 | 72.5 |
9 | c10 | 71.0 |
10 | c11 | 76.0 |
11 | c12 | 78.0 |
12 | c13 | 80.0 |
13 | c14 | 82.0 |
14 | c15 | 83.0 |
15 | c16 | 86.0 |
# The hist() plot is used for representing the histogram analysis of the data
students.hist(column="Score",figsize=(5,5),color="blue",bins=5,range=(50,90))
array([[<AxesSubplot:title={'center':'Score'}>]], dtype=object)
# Density plot is used for finding the shape of the data which should be bell shaped.
students.plot(kind='density',figsize=(8,8))
<AxesSubplot:ylabel='Density'>