Descriptive Statistics describes the data in a structured way. For example, Grouping all the similar data into one, finding the frequency of a variable and returning the count, plotting the data into a visualization format, finding the distribution of data using measures of central tendency, and also Understanding the distribution of data by plotting the data. These are the most used techniques in descriptive statistics for describing data in a more meaningful way. This is most important part when training a Ml model we cannot train a model on a dataset without understanding the data. So, the below mentioned are the various techniques that needs to be performed for analyzing the data and tune the data accordingly.

In [35]:

# Importing the required libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

1. Measures of Frequency¶

What is Measures of Frequency?

Counting or Measuring number of times each variable occurs in the given data.

Example – Count of Oranges’s and Apple’s in a Fruits dataset.

1.1 Grouping Data¶

In Measures of frequency grouping of data is one of the method to find frequency of a variable by grouping it. In simple words group all the similar category into one.

Example – Group all the fruits based on the prize. let us now see example practically.

Import Dataset.¶

The Dataset used for group the variables is Fruits.csv. Which contains information about name of the fruit and price of one fruit.

In [2]:

# Import dataset
Fruitsdf = pd.read_csv("Fruits.csv")

In [3]:

# The pandas head function helps to display the first 5 rows of the data. If we mention the no.of rows it will show the same

Fruitsdf.head(25)

Out[3]:

	name	price
0	Orange	40
1	Apple	30
2	Orange	40
3	Apple	30
4	Orange	40
5	Apple	30
6	Orange	40
7	Grapes	40
8	Grapes	40
9	Orange	40
10	Grapes	40
11	Apple	30
12	Orange	40
13	Grapes	40
14	Watermelon	50
15	Watermelon	50
16	Grapes	40
17	Watermelon	50
18	Watermelon	50
19	Grapes	40
20	Orange	40
21	Apple	30
22	Apple	30

In [4]:

# using groupby() from pandas. Sum all the fruit names concerning the price column and display the result
groupvalue = Fruitsdf.groupby('name').sum()
groupvalue

Out[4]:

	price
name
Apple	180
Grapes	240
Orange	280
Watermelon	200

From the above result, you can see all the similar fruits grouped and added the total price of similar fruits.

1.2 Univariate Analysis using Measures of Frequency¶

what is Univariate Analysis?

Univariate analysis is analyzing one variable at a time, which means describing only a single attribute at a time.

Example: If an analysis is on fruits, analyze only on apple

what is Univariate Analysis using Measures of Frequency?

Univariate analysis using Measures of Frequency means describing only a single attribute at a time also counting the frequency of that attribute.

For describing data with univariate data, we will use

Bar Graph
Frequency Distribution
Pie Chart

The above three methods are involved in Univariate Analysis using Measures of Frequency.

2.Bar Graph¶

The bar graphs represent complex data/groups of data in a graphical format of Bars, which helps to compare the data.

In [5]:

# Plotting the bar graph for each fruit with respect to the price.

BRA_GRAPH = sns.barplot(x=Fruitsdf['name'], y=Fruitsdf['price'])

The above bar graph represents the price of each fruit in the graphical format of bars.

3. Frequency Table¶

A frequency distribution shows the number of times a particular item occurs in each set of data. A frequency distribution organizes the data in a meaningful manner for better understanding.

The pandas crosstab() function used for computing the frequency of a value in an array or from a given set of data

In [6]:

# Using pandas crosstab() we will find the frequency count of fruit in the data

freq_fruits = pd.crosstab(index=Fruitsdf["name"],columns="count")
freq_fruits

Out[6]:

col_0	count
name
Apple	6
Grapes	6
Orange	7
Watermelon	4

The above result returns a count of value, how many times a variable repeats in the data.

As you can see, apple is present 6 times in the dataset we took for computing the frequency. Similarly, this method returns the frequency count for all the variables present in the dataset.

In [7]:

Freq_Fruits = freq_fruits.reset_index()
Freq_Fruits

Out[7]:

col_0	name	count
0	Apple	6
1	Grapes	6
2	Orange	7
3	Watermelon	4

Using reset.index() to format the table and represent it more clearly.

4. Pie Chart¶

Pie charts are the type of graph where data is represented in a circular graph. Each part of the chart will represent the size of the category in the whole data.

The data used for plotting this pie chart is the output returned from the frequency table.

In [8]:

Pie_Chart = freq_fruits.plot(kind="pie",y='count',autopct='%1.1f%%',title='Pie Chart',fontsize=14,figsize=(9,9))

The above pie chart describes the size or proportion of each category from the whole.

So, the data we used to represent this pie chart is the frequency table. From the frequency table, which fruit has the highest count will occupy the most size/proportion from the whole. In our case, orange has the most count.

Bivariate Analysis using Measures of Frequency¶

Bivariate analysis is used to find the relationship between two datasets. It is the analysis of two variables ‘X’ and ‘Y’

1.3.1 Frequency Table¶

In [9]:

# Adding a new column weight to the existing fruits dataset
weight = ['1kg', '2kg', '1kg', '1.5kg','1kg','1.5kg','2kg','2.5kg','3kg','1.5kg','1kg','2.5kg','1kg','1kg','1.5kg','2kg','1kg','2kg','1kg','2kg','1kg','2kg','1kg']

# saving dataframe into new variable
fruitdata = Fruitsdf

fruitdata['weight'] = weight

In [10]:

FreqData = pd.crosstab(index=fruitdata["name"],columns=fruitdata["weight"])
FreqData.reset_index()

Out[10]:

weight	name	1.5kg	1kg	2.5kg	2kg	3kg
0	Apple	2	1	1	2	0
1	Grapes	0	3	1	1	1
2	Orange	1	5	0	1	0
3	Watermelon	1	1	0	2	0

1.3.2 Stacked bar chart¶

The stacked Bar graph is a type of chart which shows the comparison between different categories in a single variable. This chart can be used when you need to compare data points.

Using the frequency returned output data to represent the stacked bar chart.

In [11]:

Stackedbar = FreqData.plot(kind="bar",figsize=(7,7),stacked=True,title='Stacked Bar Chart',fontsize=12)
Stackedbar.set_ylabel("Count",fontsize=12)
Stackedbar.set_xlabel("Fruit names",fontsize=10)

Out[11]:

Text(0.5, 0, 'Fruit names')

The above Stacked bar chart represents the different weights of each fruit.

Measures of Central Tendency¶

Measures of central tendency represent the middle or center of a distribution that describes a whole set of data with a single value.

The measures of central tendency are also classified into three main measures Mean, Median, and Mode

2.1 Measures of Central Tendency: Mean, Median, Mode¶

The dataset we are using to find the mean median and mode is Fruits.csv

2.1.1 Mean¶

It is the average value of the given set of data. It is known as the arithmetic average also it is the most used method in measure of central tendency

How the mean is calculated

mean = sum of the elements / number of elements

The above array has four elements

Add the 4 elements 20+10+40+5 = 75
Divide the sum value with number of elements i.e 4

mean = sum of the elements / number of elements = 75/4 = 18.75

In [12]:

# We are using the price column from the fruits dataset to calculate the mean.
Fruitsdf['price'].mean()

Out[12]:

39.130434782608695

The mean value is 39.13 for the price column. Try to calculate the mean value manually and compare it with the result returned.

2.1.2 Median¶

The median is the exact middle number in a set of data. The data should be sorted in ascending or descending order before finding the median value.

How the median is calculated for odd no.of elements

Median for odd number of elements in a array:

When there are odd number of elements in an array.

array = [2,3,5,7,8]

Divide the no.of elements by 2.
Round up the quotient to the nearest value.
The rounded value will be position value of an array.
The element in the specified position value of an array is median value

median = Number of elements / 2
```
     = 5/2
```
Quotient = 2.5

position value = 3

median = 5.0

From the above array the element in the third position is 5. so, the middle score value from the array is 5.

How the median is calculated for even no.of elements

Median for even number of elements in a array:

When there are even number of elements in an array.

array = [3,7,4,5,10,4]

Take the middle value pairs from the given array i.e [4,5]
Sum those middle value pairs = 4+5 = 9
Divide the sum value with 2
The returned quotient value will be the median or middle score value of the array.

Middle value pairs = 4,5

sum of middle value pairs = 4+5 = 9

Median = sum of middle value pairs / 2
       = 9/2

In [13]:

# We are using the price column from fruits dataset to calculate the median
Fruitsdf['price'].median()

Out[13]:

40.0

The median value is 40 for the price column. Try to calculate the median value manually and compare it with the result returned.

2.1.3 Mode¶

Mode is the most frequently repeated observation in a distribution. If all the numbers in the given data appear a single time, then there is no Mode.

In [14]:

# We are using the price column from fruits dataset to calculate the mode

Fruitsdf['price'].mode()

Out[14]:

0    40
Name: price, dtype: int64

2.2 Effects of Outlier on Measures of Central Tendency¶

Outlier is the variation in the data, which means when some variable value differs from others in a particular data. The effects of an outlier on measures of central tendency are in such a way that it will cause a wrong analysis of the dataset. Outlier mostly, effects on mean other than the median and mode.

In [15]:

Fruitsdf.head()

Out[15]:

	name	price	weight
0	Orange	40	1kg
1	Apple	30	2kg
2	Orange	40	1kg
3	Apple	30	1.5kg
4	Orange	40	1kg

In [16]:

# appending a new value to the existing fruits.csv dataset
Fruitsdf.loc[len(Fruitsdf.index)] = ['Apple', 100,'2kg'] 

In [17]:

Fruitsdf.head(30)

Out[17]:

	name	price	weight
0	Orange	40	1kg
1	Apple	30	2kg
2	Orange	40	1kg
3	Apple	30	1.5kg
4	Orange	40	1kg
5	Apple	30	1.5kg
6	Orange	40	2kg
7	Grapes	40	2.5kg
8	Grapes	40	3kg
9	Orange	40	1.5kg
10	Grapes	40	1kg
11	Apple	30	2.5kg
12	Orange	40	1kg
13	Grapes	40	1kg
14	Watermelon	50	1.5kg
15	Watermelon	50	2kg
16	Grapes	40	1kg
17	Watermelon	50	2kg
18	Watermelon	50	1kg
19	Grapes	40	2kg
20	Orange	40	1kg
21	Apple	30	2kg
22	Apple	30	1kg
23	Apple	100	2kg

In [18]:

# Computing mean on the price column from fruits.csv

Fruitsdf['price'].mean()

Out[18]:

41.666666666666664

After appending a new value to the dataset, the mean value has changed. It can happen when there is more difference between the existing values when compared to the newly appended value.

In [19]:

Fruitsdf['price'].median()

Out[19]:

40.0

In [20]:

Fruitsdf['price'].mode()

Out[20]:

0    40
Name: price, dtype: int64

Below are the two techniques to display a outlier in the dataset.

In [21]:

# Plotting jointplot() to find the outliers
sns.jointplot(x="name", y="price", data=Fruitsdf)

Out[21]:

<seaborn.axisgrid.JointGrid at 0x1fb4af8a520>

In [22]:

# Plotting boxplot() to find the outliers

Fruitsdf.boxplot(column="price",vert=False)

Out[22]:

<AxesSubplot:>

3. Measures of Variability¶

Measures of variability show the amount of dispersion in the set of data. A dataset having values that spread out has high variability. There are some frequently used measures of variability.

3.1 Range¶

The range is the difference between a minimum and maximum value in a dataset. It shows the spread of data.

minimum value¶

In [23]:

# Finding the minimum value from the dataset. The minimum value is the smallest value from the data.

Fruitsdf['price'].min()

Out[23]:

maximum value¶

In [24]:

# Finding the maximum value from the dataset. The maximum value is the highest value from the data.  

Fruitsdf['price'].max()

Out[24]:

3.2 Quartiles¶

Quartile method used to find out the interquartile range, which is measures variability around the median. It divides the data into lower quartiles, middle quartiles, and upper quartiles.

In [25]:

Fruitsdf['price'].quantile(.9)

Out[25]:

50.0

3.3 Variance¶

Variance measures the degree of dispersion around the center of the given data. It helps to know how individual numbers are related to each other.

In [26]:

Fruitsdf['price'].var()

Out[26]:

197.1014492753623

3.4 Means v/s Variance¶

Mean is the average of the given data also variance is the average of the squared difference from the mean.

In [27]:

Fruitsdf['price'].mean()

Out[27]:

41.666666666666664

3.5 Standard Deviation¶

Standard deviation in descriptive statistics is the degree of dispersion of the dataset related to its mean. It helps to compare the data with the same mean but a different range.

How to compute Standard Deviation

Compute mean of the given array.

mean(A) = 2+6+3+5 = 16/4 = 4

Find standard deviation.

Subtract mean value from the array elements
Calculate the square of the difference values and add them
Divide the added value by number of elements
Calculate square root of the value

Standard Deviation = sqrt(((2-4)^2 + (6-4)^2 + (3-4)^2 + (5-4)^2)/4)

               = sqrt(((-2)^2 + (2)^2 + (-1)^2 + (1)^2)/4)

               = sqrt((4+4+1+1)/4)

               = sqrt(10/4)

               = sqrt(2.5)

               = 1.58113883

In [28]:

Fruitsdf['price'].std()

Out[28]:

14.039282363260677

3.6 Summary Statistics¶

Summary statistics are used to summarize the information of a dataset. It gives us a quick and simple description of the data.

The pandas describe() is really helpful for analysing the statistics of whole data.

In [29]:

# describe() will tell us about the statistical information of each Numerical column.

Fruitsdf.describe()

Out[29]:

	price
count	24.000000
mean	41.666667
std	14.039282
min	30.000000
25%	37.500000
50%	40.000000
75%	40.000000
max	100.000000

4. Measures of shape¶

Measures of Shape is nothing but describing a data and understanding the data more clearly using visualization techniques. Measures of Shape used to explore the data and find if that data is skewed and kurtosis.

4.1 Calculating Skewness¶

What is Skewness?

When there is a visible difference between the data, it is called skewness.

In other words, when the data not normally distributed, then it is known as skewness.

In [30]:

Fruitsdf.skew()

<ipython-input-30-3f3783433aeb>:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  Fruitsdf.skew()

Out[30]:

price    3.277663
dtype: float64

4.2 Claculating Kurtosis¶

What is Kurtosis?

Kurtosis is a measure of the tailedness of a distribution. The representation of this kurtosis is a bell-shaped distribution of data.

In [31]:

Fruitsdf.kurt()

<ipython-input-31-7bc914ba06b2>:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  Fruitsdf.kurt()

Out[31]:

price    13.584094
dtype: float64

4.3 Normal Distribution¶

Normal Distribution is when the data doesn’t have any skewness or kurtosis in it, which means its a symmetrically distributed data.

In [32]:

# Importing Studentscore dataset to understand the distribution of the data.
students = pd.read_csv("Studentscore.csv")
students.head(25)

Out[32]:

	Student	Score
0	c1	56.0
1	c2	62.0
2	c3	63.0
3	c4	66.0
4	c5	67.0
5	c6	70.0
6	c7	75.0
7	c8	72.5
8	c9	72.5
9	c10	71.0
10	c11	76.0
11	c12	78.0
12	c13	80.0
13	c14	82.0
14	c15	83.0
15	c16	86.0

In [33]:

# The hist() plot is used for representing the histogram analysis of the data
students.hist(column="Score",figsize=(5,5),color="blue",bins=5,range=(50,90))

Out[33]:

array([[<AxesSubplot:title={'center':'Score'}>]], dtype=object)

In [34]:

# Density plot is used for finding the shape of the data which should be bell shaped.
students.plot(kind='density',figsize=(8,8))

Out[34]:

<AxesSubplot:ylabel='Density'>

Descriptive Statistics¶

What is Statistics?

Statistics is a numerical way of analyzing data, which helps us to understand the distribution of data. It includes various numerical calculations.

What is Descriptive Statistics?

Descriptive Statistics describes the data in a structured way. For example, Grouping all the similar data into one, finding the frequency of a variable and returning the count, plotting the data into a visualization format, finding the distribution of data using measures of central tendency, and also Understanding the distribution of data by plotting the data. These are the most used techniques in descriptive statistics for describing data in a more meaningful way. This is most important part when training a Ml model we cannot train a model on a dataset without understanding the data. So, the below mentioned are the various techniques that needs to be performed for analyzing the data and tune the data accordingly.

In [35]:

# Importing the required libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

1. Measures of Frequency¶

What is Measures of Frequency?

Counting or Measuring number of times each variable occurs in the given data.

Example – Count of Oranges’s and Apple’s in a Fruits dataset.

1.1 Grouping Data¶

In Measures of frequency grouping of data is one of the method to find frequency of a variable by grouping it. In simple words group all the similar category into one.

Example – Group all the fruits based on the prize. let us now see example practically.

Import Dataset.¶

The Dataset used for group the variables is Fruits.csv. Which contains information about name of the fruit and price of one fruit.

In [2]:

# Import dataset
Fruitsdf = pd.read_csv("Fruits.csv")

In [3]:

# The pandas head function helps to display the first 5 rows of the data. If we mention the no.of rows it will show the same

Fruitsdf.head(25)

Out[3]:

	name	price
0	Orange	40
1	Apple	30
2	Orange	40
3	Apple	30
4	Orange	40
5	Apple	30
6	Orange	40
7	Grapes	40
8	Grapes	40
9	Orange	40
10	Grapes	40
11	Apple	30
12	Orange	40
13	Grapes	40
14	Watermelon	50
15	Watermelon	50
16	Grapes	40
17	Watermelon	50
18	Watermelon	50
19	Grapes	40
20	Orange	40
21	Apple	30
22	Apple	30

In [4]:

# using groupby() from pandas. Sum all the fruit names concerning the price column and display the result
groupvalue = Fruitsdf.groupby('name').sum()
groupvalue

Out[4]:

	price
name
Apple	180
Grapes	240
Orange	280
Watermelon	200

From the above result, you can see all the similar fruits grouped and added the total price of similar fruits.

1.2 Univariate Analysis using Measures of Frequency¶

what is Univariate Analysis?

Univariate analysis is analyzing one variable at a time, which means describing only a single attribute at a time.

Example: If an analysis is on fruits, analyze only on apple

what is Univariate Analysis using Measures of Frequency?

Univariate analysis using Measures of Frequency means describing only a single attribute at a time also counting the frequency of that attribute.

For describing data with univariate data, we will use

Bar Graph
Frequency Distribution
Pie Chart

The above three methods are involved in Univariate Analysis using Measures of Frequency.

2.Bar Graph¶

The bar graphs represent complex data/groups of data in a graphical format of Bars, which helps to compare the data.

In [5]:

# Plotting the bar graph for each fruit with respect to the price.

BRA_GRAPH = sns.barplot(x=Fruitsdf['name'], y=Fruitsdf['price'])

The above bar graph represents the price of each fruit in the graphical format of bars.

3. Frequency Table¶

A frequency distribution shows the number of times a particular item occurs in each set of data. A frequency distribution organizes the data in a meaningful manner for better understanding.

The pandas crosstab() function used for computing the frequency of a value in an array or from a given set of data

In [6]:

# Using pandas crosstab() we will find the frequency count of fruit in the data

freq_fruits = pd.crosstab(index=Fruitsdf["name"],columns="count")
freq_fruits

Out[6]:

col_0	count
name
Apple	6
Grapes	6
Orange	7
Watermelon	4

The above result returns a count of value, how many times a variable repeats in the data.

As you can see, apple is present 6 times in the dataset we took for computing the frequency. Similarly, this method returns the frequency count for all the variables present in the dataset.

In [7]:

Freq_Fruits = freq_fruits.reset_index()
Freq_Fruits

Out[7]:

col_0	name	count
0	Apple	6
1	Grapes	6
2	Orange	7
3	Watermelon	4

Using reset.index() to format the table and represent it more clearly.

4. Pie Chart¶

Pie charts are the type of graph where data is represented in a circular graph. Each part of the chart will represent the size of the category in the whole data.

The data used for plotting this pie chart is the output returned from the frequency table.

In [8]:

Pie_Chart = freq_fruits.plot(kind="pie",y='count',autopct='%1.1f%%',title='Pie Chart',fontsize=14,figsize=(9,9))

The above pie chart describes the size or proportion of each category from the whole.

So, the data we used to represent this pie chart is the frequency table. From the frequency table, which fruit has the highest count will occupy the most size/proportion from the whole. In our case, orange has the most count.

Bivariate Analysis using Measures of Frequency¶

Bivariate analysis is used to find the relationship between two datasets. It is the analysis of two variables ‘X’ and ‘Y’

1.3.1 Frequency Table¶

In [9]:

# Adding a new column weight to the existing fruits dataset
weight = ['1kg', '2kg', '1kg', '1.5kg','1kg','1.5kg','2kg','2.5kg','3kg','1.5kg','1kg','2.5kg','1kg','1kg','1.5kg','2kg','1kg','2kg','1kg','2kg','1kg','2kg','1kg']

# saving dataframe into new variable
fruitdata = Fruitsdf

fruitdata['weight'] = weight

In [10]:

FreqData = pd.crosstab(index=fruitdata["name"],columns=fruitdata["weight"])
FreqData.reset_index()

Out[10]:

weight	name	1.5kg	1kg	2.5kg	2kg	3kg
0	Apple	2	1	1	2	0
1	Grapes	0	3	1	1	1
2	Orange	1	5	0	1	0
3	Watermelon	1	1	0	2	0

1.3.2 Stacked bar chart¶

The stacked Bar graph is a type of chart which shows the comparison between different categories in a single variable. This chart can be used when you need to compare data points.

Using the frequency returned output data to represent the stacked bar chart.

In [11]:

Stackedbar = FreqData.plot(kind="bar",figsize=(7,7),stacked=True,title='Stacked Bar Chart',fontsize=12)
Stackedbar.set_ylabel("Count",fontsize=12)
Stackedbar.set_xlabel("Fruit names",fontsize=10)

Out[11]:

Text(0.5, 0, 'Fruit names')

The above Stacked bar chart represents the different weights of each fruit.

Measures of Central Tendency¶

Measures of central tendency represent the middle or center of a distribution that describes a whole set of data with a single value.

The measures of central tendency are also classified into three main measures Mean, Median, and Mode

2.1 Measures of Central Tendency: Mean, Median, Mode¶

The dataset we are using to find the mean median and mode is Fruits.csv

2.1.1 Mean¶

It is the average value of the given set of data. It is known as the arithmetic average also it is the most used method in measure of central tendency

How the mean is calculated

mean = sum of the elements / number of elements

The above array has four elements

Add the 4 elements 20+10+40+5 = 75
Divide the sum value with number of elements i.e 4

mean = sum of the elements / number of elements = 75/4 = 18.75

In [12]:

# We are using the price column from the fruits dataset to calculate the mean.
Fruitsdf['price'].mean()

Out[12]:

39.130434782608695

The mean value is 39.13 for the price column. Try to calculate the mean value manually and compare it with the result returned.

2.1.2 Median¶

The median is the exact middle number in a set of data. The data should be sorted in ascending or descending order before finding the median value.

How the median is calculated for odd no.of elements

Median for odd number of elements in a array:

When there are odd number of elements in an array.

array = [2,3,5,7,8]

Divide the no.of elements by 2.
Round up the quotient to the nearest value.
The rounded value will be position value of an array.
The element in the specified position value of an array is median value

median = Number of elements / 2
```
     = 5/2
```
Quotient = 2.5

position value = 3

median = 5.0

From the above array the element in the third position is 5. so, the middle score value from the array is 5.

How the median is calculated for even no.of elements

Median for even number of elements in a array:

When there are even number of elements in an array.

array = [3,7,4,5,10,4]

Take the middle value pairs from the given array i.e [4,5]
Sum those middle value pairs = 4+5 = 9
Divide the sum value with 2
The returned quotient value will be the median or middle score value of the array.

Middle value pairs = 4,5

sum of middle value pairs = 4+5 = 9

Median = sum of middle value pairs / 2
       = 9/2

In [13]:

# We are using the price column from fruits dataset to calculate the median
Fruitsdf['price'].median()

Out[13]:

40.0

The median value is 40 for the price column. Try to calculate the median value manually and compare it with the result returned.

2.1.3 Mode¶

Mode is the most frequently repeated observation in a distribution. If all the numbers in the given data appear a single time, then there is no Mode.

In [14]:

# We are using the price column from fruits dataset to calculate the mode

Fruitsdf['price'].mode()

Out[14]:

0    40
Name: price, dtype: int64

2.2 Effects of Outlier on Measures of Central Tendency¶

Outlier is the variation in the data, which means when some variable value differs from others in a particular data. The effects of an outlier on measures of central tendency are in such a way that it will cause a wrong analysis of the dataset. Outlier mostly, effects on mean other than the median and mode.

In [15]:

Fruitsdf.head()

Out[15]:

	name	price	weight
0	Orange	40	1kg
1	Apple	30	2kg
2	Orange	40	1kg
3	Apple	30	1.5kg
4	Orange	40	1kg

In [16]:

# appending a new value to the existing fruits.csv dataset
Fruitsdf.loc[len(Fruitsdf.index)] = ['Apple', 100,'2kg'] 

In [17]:

Fruitsdf.head(30)

Out[17]:

	name	price	weight
0	Orange	40	1kg
1	Apple	30	2kg
2	Orange	40	1kg
3	Apple	30	1.5kg
4	Orange	40	1kg
5	Apple	30	1.5kg
6	Orange	40	2kg
7	Grapes	40	2.5kg
8	Grapes	40	3kg
9	Orange	40	1.5kg
10	Grapes	40	1kg
11	Apple	30	2.5kg
12	Orange	40	1kg
13	Grapes	40	1kg
14	Watermelon	50	1.5kg
15	Watermelon	50	2kg
16	Grapes	40	1kg
17	Watermelon	50	2kg
18	Watermelon	50	1kg
19	Grapes	40	2kg
20	Orange	40	1kg
21	Apple	30	2kg
22	Apple	30	1kg
23	Apple	100	2kg

In [18]:

# Computing mean on the price column from fruits.csv

Fruitsdf['price'].mean()

Out[18]:

41.666666666666664

After appending a new value to the dataset, the mean value has changed. It can happen when there is more difference between the existing values when compared to the newly appended value.

In [19]:

Fruitsdf['price'].median()

Out[19]:

40.0

In [20]:

Fruitsdf['price'].mode()

Out[20]:

0    40
Name: price, dtype: int64

Below are the two techniques to display a outlier in the dataset.

In [21]:

# Plotting jointplot() to find the outliers
sns.jointplot(x="name", y="price", data=Fruitsdf)

Out[21]:

<seaborn.axisgrid.JointGrid at 0x1fb4af8a520>

In [22]:

# Plotting boxplot() to find the outliers

Fruitsdf.boxplot(column="price",vert=False)

Out[22]:

<AxesSubplot:>

3. Measures of Variability¶

Measures of variability show the amount of dispersion in the set of data. A dataset having values that spread out has high variability. There are some frequently used measures of variability.

3.1 Range¶

The range is the difference between a minimum and maximum value in a dataset. It shows the spread of data.

minimum value¶

In [23]:

# Finding the minimum value from the dataset. The minimum value is the smallest value from the data.

Fruitsdf['price'].min()

Out[23]:

maximum value¶

In [24]:

# Finding the maximum value from the dataset. The maximum value is the highest value from the data.  

Fruitsdf['price'].max()

Out[24]:

3.2 Quartiles¶

Quartile method used to find out the interquartile range, which is measures variability around the median. It divides the data into lower quartiles, middle quartiles, and upper quartiles.

In [25]:

Fruitsdf['price'].quantile(.9)

Out[25]:

50.0

3.3 Variance¶

Variance measures the degree of dispersion around the center of the given data. It helps to know how individual numbers are related to each other.

In [26]:

Fruitsdf['price'].var()

Out[26]:

197.1014492753623

3.4 Means v/s Variance¶

Mean is the average of the given data also variance is the average of the squared difference from the mean.

In [27]:

Fruitsdf['price'].mean()

Out[27]:

41.666666666666664

3.5 Standard Deviation¶

Standard deviation in descriptive statistics is the degree of dispersion of the dataset related to its mean. It helps to compare the data with the same mean but a different range.

How to compute Standard Deviation

Compute mean of the given array.

mean(A) = 2+6+3+5 = 16/4 = 4

Find standard deviation.

Subtract mean value from the array elements
Calculate the square of the difference values and add them
Divide the added value by number of elements
Calculate square root of the value

Standard Deviation = sqrt(((2-4)^2 + (6-4)^2 + (3-4)^2 + (5-4)^2)/4)

               = sqrt(((-2)^2 + (2)^2 + (-1)^2 + (1)^2)/4)

               = sqrt((4+4+1+1)/4)

               = sqrt(10/4)

               = sqrt(2.5)

               = 1.58113883

In [28]:

Fruitsdf['price'].std()

Out[28]:

14.039282363260677

3.6 Summary Statistics¶

Summary statistics are used to summarize the information of a dataset. It gives us a quick and simple description of the data.

The pandas describe() is really helpful for analysing the statistics of whole data.

In [29]:

# describe() will tell us about the statistical information of each Numerical column.

Fruitsdf.describe()

Out[29]:

	price
count	24.000000
mean	41.666667
std	14.039282
min	30.000000
25%	37.500000
50%	40.000000
75%	40.000000
max	100.000000

4. Measures of shape¶

Measures of Shape is nothing but describing a data and understanding the data more clearly using visualization techniques. Measures of Shape used to explore the data and find if that data is skewed and kurtosis.

4.1 Calculating Skewness¶

What is Skewness?

When there is a visible difference between the data, it is called skewness.

In other words, when the data not normally distributed, then it is known as skewness.

In [30]:

Fruitsdf.skew()

<ipython-input-30-3f3783433aeb>:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  Fruitsdf.skew()

Out[30]:

price    3.277663
dtype: float64

4.2 Claculating Kurtosis¶

What is Kurtosis?

Kurtosis is a measure of the tailedness of a distribution. The representation of this kurtosis is a bell-shaped distribution of data.

In [31]:

Fruitsdf.kurt()

<ipython-input-31-7bc914ba06b2>:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  Fruitsdf.kurt()

Out[31]:

price    13.584094
dtype: float64

4.3 Normal Distribution¶

Normal Distribution is when the data doesn’t have any skewness or kurtosis in it, which means its a symmetrically distributed data.

In [32]:

# Importing Studentscore dataset to understand the distribution of the data.
students = pd.read_csv("Studentscore.csv")
students.head(25)

Out[32]:

	Student	Score
0	c1	56.0
1	c2	62.0
2	c3	63.0
3	c4	66.0
4	c5	67.0
5	c6	70.0
6	c7	75.0
7	c8	72.5
8	c9	72.5
9	c10	71.0
10	c11	76.0
11	c12	78.0
12	c13	80.0
13	c14	82.0
14	c15	83.0
15	c16	86.0

In [33]:

# The hist() plot is used for representing the histogram analysis of the data
students.hist(column="Score",figsize=(5,5),color="blue",bins=5,range=(50,90))

Out[33]:

array([[<AxesSubplot:title={'center':'Score'}>]], dtype=object)

In [34]:

# Density plot is used for finding the shape of the data which should be bell shaped.
students.plot(kind='density',figsize=(8,8))

Out[34]:

<AxesSubplot:ylabel='Density'>

Descriptive Statistics¶

1. Measures of Frequency¶

1.1 Grouping Data¶

Import Dataset.¶

1.2 Univariate Analysis using Measures of Frequency¶

2.Bar Graph¶

3. Frequency Table¶

4. Pie Chart¶

Bivariate Analysis using Measures of Frequency¶

1.3.1 Frequency Table¶

1.3.2 Stacked bar chart¶

Measures of Central Tendency¶

2.1 Measures of Central Tendency: Mean, Median, Mode¶

2.1.1 Mean¶

2.1.2 Median¶

2.1.3 Mode¶

2.2 Effects of Outlier on Measures of Central Tendency¶

3. Measures of Variability¶

3.1 Range¶

minimum value¶

maximum value¶

3.2 Quartiles¶

3.3 Variance¶

3.4 Means v/s Variance¶

3.5 Standard Deviation¶

3.6 Summary Statistics¶

4. Measures of shape¶

4.1 Calculating Skewness¶

4.2 Claculating Kurtosis¶

4.3 Normal Distribution¶

Descriptive Statistics¶

1. Measures of Frequency¶

1.1 Grouping Data¶

Import Dataset.¶

1.2 Univariate Analysis using Measures of Frequency¶

2.Bar Graph¶

3. Frequency Table¶

4. Pie Chart¶

Bivariate Analysis using Measures of Frequency¶

1.3.1 Frequency Table¶

1.3.2 Stacked bar chart¶

Measures of Central Tendency¶

2.1 Measures of Central Tendency: Mean, Median, Mode¶

2.1.1 Mean¶

2.1.2 Median¶

2.1.3 Mode¶

2.2 Effects of Outlier on Measures of Central Tendency¶

3. Measures of Variability¶

3.1 Range¶

minimum value¶

maximum value¶

3.2 Quartiles¶

3.3 Variance¶

3.4 Means v/s Variance¶

3.5 Standard Deviation¶

3.6 Summary Statistics¶

4. Measures of shape¶

4.1 Calculating Skewness¶

4.2 Claculating Kurtosis¶

4.3 Normal Distribution¶

You Might Also Like

EMBEDDED METHODS

DBSCAN

MEASURES OF SHAPE

Leave a Reply Cancel reply

USEFUL LINK

POLICIES

CONTACT INFORMATION