Inferential Statistics¶
Inferential statistics is used for finding inferences on the data and make predictions about the data on a given sample of data.This uses probability to find conclusions.
There are possible methods to perform inferential statistics on the data. In this blog we will discuss about Z-Score, Z-Test, F-Test, Correlation Coefficients, chi-square Test for performing the analysis of the data and get a probable conclusion based on it.
When we use Inferential Statistics?
Inferential statistics mainly used for finding conclusions about the data, the data can be a sample or set of features so sometimes we use a large size of data for building a model at that time this inferential statistics comes in handy.
Contents¶
Z Scores, Z-Test
1.1 Z Value
1.2 Z Test
1.3 Two-sided One-Sample t-Test
1.4 Independent t-Test
1.5 Paired T-test
- F-test
- Correlation Coefficients
- Chi-Square Test
1. Z scores, Z-Test¶
1.1 Z Value¶
Z- Value/ Z- Score tells a value (x) is how many standard deviations below or above the population mean. If the Z value is positive the value/ score (x) is higher than the mean and if the Z value is negative the value is lesser than the mean
Z-Score can be calculated as follows
z = (X – μ) / σ
where,
X : Single data value
μ : Mean value
σ : Standard Deviation
Z-score in python can be calcualted by using scipy.stats.zscore such as, scipy.stats.zscore(a, axis=0, ddof=0, nan_policy=’propagate’)
where,
a : array_like
An array like object containing the sample data.
axis : int or None, optional
Axis , either horizontal or vertical
ddof : int, optional
Degrees of freedom correction in standard deviation calcualtion. Default value is 0 (zero).
nan_policy : {‘propagate’, ‘raise’, ‘omit’}, optional
This field defines a way of handling when input contains nan.
Default value is propagate, which returns nan
The value raise, throws an error
The value omit, ignores nan values and performs the calculation.
Note: Whenever the value is omit, the nan values in the input propagate to the output, but these nan values
do not affect the z-scores that's been computed for the non-nan values
Ex: a = [0.8976,0.9989,0.5678,0.1234,0.7765,1,1.675,1.456]
==> Mean (μ) = Sum of all the elements/N , where N = total number of elements
mean (μ)= (0.8976+0.9989+0.5678+0.1234,0.7765+1+1.675+1.456)/8 =0.9369
==> standard deviation (σ) = sqrt((X-μ)/N) , where X = element
standard deviation (σ) = sqrt((0.8976-0.9369)^2+(0.9989-0.9369)^2+(0.5678-0.9369)^2+(0.1234-0.9369)^2+(0.7765-0.9369)^2+(1-0.9369)^2+(1.675-0.9369)^2+(1.456-0.9369)^2))/8 = 0.45378
==> Z-score (z) = (X-μ)/σ
z = [(0.8976-0.9369)/0.4537,(0.9989-0.9369)/0.4537,(0.5678-0.9369)/0.4537,(0.1234-0.9369)/0.4537,(0.7765-0.9369)/0.4537,(1-0.9369)/0.4537,(1.675-0.9369)/0.4537,(1.456-0.9369)/0.4537]
Result is ==> z = [-0.0866,0.1357,-0.8135,-1.7930,-0.3535,0.1390,1.6268,1.144]
Computing z-score using defualt values¶
import numpy as np
import scipy.stats as stats
a = np.array([0.8976,0.9989,0.5678,0.1234,0.7765,1,1.675,1.456])
stats.zscore(a)
array([-0.08660476, 0.13662837, -0.81337952, -1.79269639, -0.35347081, 0.13905242, 1.62653867, 1.14393202])
Computing z-score along specified axis using degrees of freedom¶
a = np.array([[0.1234,0.4567,0.7890,0.9876],
[0.6789,0.7890,0.9987,0.6657],
[0.2234,0.9987,0.3345,0.5567]])
stats.zscore(a,axis=1,ddof=1)
array([[-1.22576827, -0.3486311 , 0.52587439, 1.04852498], [-0.67641081, 0.03847117, 1.40005837, -0.76211873], [-0.88942498, 1.37202025, -0.56536131, 0.08276604]])
Computing z-score using nan_policy¶
a = np.array([[0.1234,np.nan,0.7890,0.9876],
[0.6789,0.7890,0.9987,0.6657],
[np.nan,0.9987,0.3345,np.nan]])
stats.zscore(a,axis=1) # default value of nan_policy is propagate, which returns nan
array([[ nan, nan, nan, nan], [-0.78105192, 0.04442268, 1.61664815, -0.88001891], [ nan, nan, nan, nan]])
a = np.array([[0.1234,np.nan,0.7890,0.9876],
[0.6789,0.7890,0.9987,0.6657],
[np.nan,0.9987,0.3345,np.nan]])
# nan_policy='raise', throws error
stats.zscore(a,axis=1,nan_policy='raise')
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-16-7d7bc298bb30> in <module> 3 [np.nan,0.9987,0.3345,np.nan]]) 4 ----> 5 stats.zscore(a,axis=1,nan_policy='raise') # nan_policy='raise', throws error ~\anaconda3\lib\site-packages\scipy\stats\stats.py in zscore(a, axis, ddof, nan_policy) 2545 return np.empty(a.shape) 2546 -> 2547 contains_nan, nan_policy = _contains_nan(a, nan_policy) 2548 2549 if contains_nan and nan_policy == 'omit': ~\anaconda3\lib\site-packages\scipy\stats\stats.py in _contains_nan(a, nan_policy) 237 238 if contains_nan and nan_policy == 'raise': --> 239 raise ValueError("The input contains nan values") 240 241 return contains_nan, nan_policy ValueError: The input contains nan values
a = np.array([[0.1234,np.nan,0.7890,0.9876],
[0.6789,0.7890,0.9987,0.6657],
[np.nan,0.9987,0.3345,np.nan]])
stats.zscore(a,axis=1,nan_policy='omit') # nan_policy='omit', computes the z-score, ignoring all the nans
array([[-1.37976297, nan, 0.4211984 , 0.95856458], [-0.78105192, 0.04442268, 1.61664815, -0.88001891], [ nan, 1. , -1. , nan]])
1.2 Z-Test¶
Z- Test is to test the population proportion. Z-Test can be used to test the given mean, when the sample is large, which means the length of the data is more than 30 , and when the population standard deviation is known as well as variance is known. This test is perfromed to check if the 2 sample means are approximately equal or not.
To perfrom this z-test the samples should be taken at random from the population and the data should be normally distributed. If the data taken is larger than 30 then it is assumed that the data is normally distributed. If the sample size is less than 30 then the t-test is considered.
We check if the value obtained is approimately equal or not by considering hypothesis such as Null Hypothesis (H0) : If the value is equal to the other value, this hypothesis is accepted Alternate Hypothesis (HA) :If the values is not equal to the other value, this hypothesis is accepted
This Z-test is calculated by using the formula
After performing the z-test, the value obtained should be compared with the alpha value, which is assumed to be 0.05 in the z-score table, which is considered to be pvalue.
- This pvalue if it is less than the alpha value, then the Null Hypothesis is rejected which means the Alternate Hypothesis is considered. In other words the means of the two samples are not equal
The pvalue if it is greater than the alpha value, then the Null Hypothesis is accepted. In other words the means or averages of the two samples are equal.
In Machine Learning, we calculate z-test by using method ztest from statsmodels.stats.weightstats
statsmodels.stats.weightstats.ztest(x1, x2=None, value=0, alternative='two-sided', usevar='pooled', ddof=1.0) where, x1,x2 are arrays value : float In the one sample case, value is the mean of x1 under the Null hypothesis. In the two sample case, value is the difference between mean of x1 and mean of x2 under the Null hypothesis. The test statistic is x1_mean - x2_mean - value. alternative : str The alternative hypothesis, H1, has to be one of the following ‘two-sided’: H1: difference in means not equal to value (default) ‘larger’ : H1: difference in means larger than value ‘smaller’ : H1: difference in means smaller than value usevar : str, ‘pooled’ Currently, only ‘pooled’ is implemented. If pooled, then the standard deviation of the samples is assumed to be the same. see CompareMeans.ztest_ind for different options. ddof : int Degrees of freedom use in the calculation of the variance of the mean estimate. In the case of comparing means this is one, however it can be adjusted for testing other statistics (proportion, correlation)
Returns,
tstat : float
test statistic
pvalue : float
pvalue of the t-test
Example¶
import numpy as np
import pandas as pd
from numpy.random import randn
from statsmodels.stats.weightstats import ztest
x1 = [20, 30, 40, 50, 10, 20]
z = ztest(x1,value= 25) # where value is a mean value
z
(0.5547001962252289, 0.5790997419539189)
The first value from the above result is statistic value and the other value is pvalue. From the above output, we can understand that the pvalue of the taken data is 0.9 which is greater than the value of alpha which is 0.05. Hence we come to the output that Null Hypothesis is correct and it is accepted, which means that the given data and the assumed mean are approximately equal.
x1 = [20, 30, 40, 50, 10, 20]
x2 = [11, 12, 13, 14, 15, 16]
z=ztest(x1, x2, value= 0, alternative = 'larger')
z
(2.448717008689441, 0.007168301924196878)
From the above output, we can understand that the Null Hypothesis is rejected and Alternate Hypothesis is accepted as the pvalue is less than the alpha value
T-Test¶
T-test, also known as Student’s T-test, is used to determine the difference among two groups of variables by comparing their mean values or the averages. This T-test not only determines the difference but also determines the significance of their differences. In other words, this test simply explains that the differeneces among the varibale groups is occurred by a chance or relevant to the data taken.
The 3 types of T-test are
1. Independent T-test
2. Paired Sample T-test
3. One-Sample T-test
One-Sample T-test : This one-sample T-test is a t-test where the one group’s mean or avergae is compared with one significant value which is a mean of the population
Types of One-Sample T-test are
1. One tailed One-Sample T-test
2. Two tailed One-Sample T-test
3. Upper tailed One-Sample T-test
4. Lower tailed One-Sample T-test
1.3 Two Sided One-Sample T-test¶
Two Sided One-Sample T-test or Two tailed One-Sample T-test ———————
1.4 Independent t-Test¶
Independent T-test, also known as Two Sample T-test is used to test whether the means of the taken 2 groups are equl or not. This Independent T-test assumes that the variances of the taken population has equal variance by defualt.
In Machine Learning, we can perform this test using
- Using scipy library
- Using Statsmodels
Using scipy library¶
scipy.stats.ttest_ind(a, b, axis=0, equal_var=True) where, a, b are two arrays of 2 groups
axis : int or None, optional Axis along which to compute test. If None, compute over the whole arrays, a, and b.
equal_var : bool, optional If True (default), perform a standard independent 2 sample test that assumes equal population variances If False, perform Welch’s t-test, which does not assume equal population variance.
Returns statistic : float or array The calculated t-statistic.
pvalue : float or array The p-value.
Example¶
For example, we are given 2 different groups of data among which Bag-A has a bunch of apples and Bag-B has a bunch of mangoes. We need to check if the both bags have the same averages or means.
For this we need to assume 2 hypothesis, null hypothesis and alternate hypothesis H0 -> The means of two bags are equal HA -> The means of two bags are not equal
Two find which hypothesis is predicted to be different, we check the value of alpha, which is assumed to be 0.05, with the pvalue that is obtained after performing t-test. If the pvalue is less than the alpha value, then the HA is considered to be true If the pvalue is greater than the alpha value, then the H0 is considere to be true
Lets test the above example with ttest using scipy library
import scipy.stats as stats
a = np.array([5,6,7,8,2,3,4,5])
b = np.array([12,13,14,15,16,2,3,4])
stats.ttest_ind(a, b, equal_var=True) # Assuming that the 2 groups have equal variance
Ttest_indResult(statistic=-2.2331335038240865, pvalue=0.042379219768910015)
stats.ttest_ind(a, b, equal_var=False) # Assuming that the 2 groups doesn't have equal variance
Ttest_indResult(statistic=-2.2331335038240865, pvalue=0.05369587840008499)
Understanding the result from the above test, assuming that the alpha value is 0.05, the both results assuming that the 2 groups have same variances and different variances, returns out pvalue which is less than the value of alpha. Hence the 2 bags or the 2 groups means are not supposed to be equal
Using statsmodels¶
statsmodels.stats.weightstats.ttest_ind(x1, x2)
where,
x1 and x2 are two array groups
Returns:
tstat : float
test statistic
pvalue : float
pvalue of the t-test
df : int or float
degrees of freedom used in the t-test
from statsmodels.stats.weightstats import ttest_ind
a = np.array([12,14,16,4,5,11,12,11])
b = np.array([12,13,14,15,16,2,3,4])
ttest_ind(a,b)
(0.2963188789948769, 0.7713367820262194, 14.0)
Understanding the result from the above test, assuming that the alpha value is 0.05, the resulted pvalue is 0.7 which is greater than the assumed alpha value. Hence it can be said that the assumption H0 is true, i.e., the means of the 2 groups are equal
1.5 Paired t-Test¶
A Paired t-Test explains the difference between two variables for the same subject. It compares one set of measurements with the second set from the same sample. This test is also known as Dependent Sample T-test
In simple words, this T-test measures the difference between two averages or means of two different groups. This test similar to other tests assumes that there are 2 hypothesis. Null Hypothesis (H0) : The difference between two means of the two groups is zero Alternate Hypothesis (HA) : The difference between two means of the two groups is not equal to zero.
In Machine Learning, this Paired T-test can be calculated by using ttest_rel() method defined in scipy.stats library
scipy.stats.ttest_rel(a, b, axis=0, nan_policy='propagate', alternative='two-sided')
where,
a, b : array_like
axis : int or None, optional
Axis along which to compute test. If None, compute over the whole arrays, a, and b.
nan_policy : {‘propagate’, ‘raise’, ‘omit’}, optional
Defines how to handle when input contains nan. The following options are available (default is ‘propagate’):
‘propagate’: returns nan
‘raise’: throws an error
‘omit’: performs the calculations ignoring nan values
alternative : {‘two-sided’, ‘less’, ‘greater’}, optional
Defines the alternative hypothesis. The following options are available (default is ‘two-sided’):
‘two-sided’: the means of the distributions underlying the samples are unequal.
‘less’: the mean of the distribution underlying the first sample is less than the mean of the distribution underlying the second sample.
‘greater’: the mean of the distribution underlying the first sample is greater than the mean of the distribution underlying the second sample.
Returns
statistic : float or array
t-statistic.
pvalue : float or array
The p-value.
import scipy.stats as stats
a = np.array([12, 14, 16, 4, 5, 11, 12, 11])
b = np.array([12, 13, 14, 15, 16, 2, 3, 4])
stats.ttest_rel(a,b)
Ttest_relResult(statistic=0.26355219111613715, pvalue=0.7997147761519707)
From the above result, the assumed alpha value which is 0.05 is less than the obtained pvalue. Hence we can accpet the Null hypothesis H0, saying that the difference between two means of two groups is zero
2. F-Test¶
F-Test can be applied to test the significant difference between the variance of two populations, based on the small samples drawn from those populations. The test based on this statistic is known as F-Test.
Simply said, this F-test compares the variances of 2 values by perfroming division.The result of the f-test is always positive, because the variances are always positive. Let’s assume that the two variables are s1 and s2, the formula is considered as F = s1^2/s2^2
The Hypothesis for this F-test are defined as, Null Hypothesis (H0) : The variances of two variables are equal and there is no significant difference Alternate Hypothesis (HA) : The variances of two variables are not equal
F-Statistic, also known to be as F-Value is used in Analysis of Variance (ANOVA) and in regression models to find the significance between the means of the two populations by comparing variances. F-Statistic is used in F-test. F-Test is almost similar to T-test, except in F-test we check for the significance among group of variables, whereas in T-test, we check for the significance among 2 variables. F-test is used to check for the similarity among the means of different variables.
For an F-test to be conducted we need to assume
- that the data taken is normally distributed
- The numerator variance should be larger and the denominator variance should be smaller
There are many statistics in which F-Statistic is used, but mostly used F-test is Analysis Of Variance (ANOVA)
ANOVA : Analysis Of Variances, called as ANOVA, is used to test two or more than two groups of means differences. This ANOVA uses F-Statistic to calculate the difference between means of 2 groups or more than that
The Hypothesis here is taken as,
Null Hypothesis (H0) : The groups are significantly equal
Alternate Hypothesis (H1) : The groups are not equal
There are different types of ANOVA such as,
- One-way ANOVA
- Two-way ANOVA
- Factorial ANOVA
- Repeated Measures ANOVA
- MANOVA etc.,
The most used ANOVA is One-way ANOVA, which is used to compare groups mean with an independent variable to check whether the groups are likely or not
This test is performed by using a method f_oneway from scipy.stats as
scipy.stats.f_oneway(*samples, axis=0)
where,
samples can be any number of groups or array like variables
axis defines along which the test to be performed, by default set to zero and is optional
returns,
statistic : float
The computed F statistic of the test.
pvalue : float
The associated p-value from the F distribution.
As per this if the pvalue is less than the alpha value (0.05) then the Null Hypothesis is rejected and if the pvalue is higher than the alpha value then the Null Hypothesis is accepted.
# Importing Libraries
import numpy as np
import pandas as pd
import scipy.stats as stats
# Creating a dataset
cities = ["punjab","delhi","hyderabad","bangalore","mumbai"]
people_of_spec_city = np.random.choice(a= cities, p = [0.05, 0.15 ,0.25, 0.05, 0.5], size=1000)
# np.random.choice, returns some random values from the given value a, with the probabilities mentioned in p of size given
people_of_spec_city
array(['mumbai', 'hyderabad', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'delhi', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'bangalore', 'hyderabad', 'delhi', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'punjab', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'punjab', 'hyderabad', 'delhi', 'bangalore', 'hyderabad', 'mumbai', 'hyderabad', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'bangalore', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'delhi', 'delhi', 'hyderabad', 'delhi', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'hyderabad', 'mumbai', 'bangalore', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'hyderabad', 'punjab', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'delhi', 'mumbai', 'mumbai', 'bangalore', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'bangalore', 'mumbai', 'bangalore', 'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'delhi', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'mumbai', 'hyderabad', 'hyderabad', 'delhi', 'delhi', 'mumbai', 'delhi', 'delhi', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'hyderabad', 'punjab', 'delhi', 'mumbai', 'delhi', 'mumbai', 'mumbai', 'bangalore', 'mumbai', 'hyderabad', 'mumbai', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'punjab', 'punjab', 'bangalore', 'bangalore', 'mumbai', 'hyderabad', 'hyderabad', 'mumbai', 'hyderabad', 'mumbai', 'delhi', 'mumbai', 'mumbai', 'hyderabad', 'delhi', 'mumbai', 'hyderabad', 'punjab', 'delhi', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai', 'punjab', 'hyderabad', 'mumbai', 'hyderabad', 'hyderabad', 'bangalore', 'punjab', 'bangalore', 'hyderabad', 'bangalore', 'mumbai', 'delhi', 'bangalore', 'mumbai', 'delhi', 'mumbai', 'mumbai', 'hyderabad', 'delhi', 'hyderabad', 'hyderabad', 'delhi', 'delhi', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'bangalore', 'mumbai', 'mumbai', 'delhi', 'mumbai', 'mumbai', 'mumbai', 'bangalore', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'hyderabad', 'delhi', 'mumbai', 'bangalore', 'hyderabad', 'mumbai', 'delhi', 'hyderabad', 'bangalore', 'mumbai', 'delhi', 'delhi', 'delhi', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'hyderabad', 'mumbai', 'delhi', 'mumbai', 'mumbai', 'punjab', 'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'delhi', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'delhi', 'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'hyderabad', 'mumbai', 'punjab', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'delhi', 'hyderabad', 'mumbai', 'delhi', 'hyderabad', 'hyderabad', 'delhi', 'mumbai', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai', 'delhi', 'mumbai', 'mumbai', 'delhi', 'mumbai', 'punjab', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'delhi', 'hyderabad', 'hyderabad', 'mumbai', 'hyderabad', 'bangalore', 'delhi', 'mumbai', 'hyderabad', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'delhi', 'hyderabad', 'hyderabad', 'hyderabad', 'hyderabad', 'hyderabad', 'hyderabad', 'mumbai', 'delhi', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'delhi', 'mumbai', 'mumbai', 'delhi', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'hyderabad', 'hyderabad', 'hyderabad', 'bangalore', 'mumbai', 'hyderabad', 'mumbai', 'hyderabad', 'punjab', 'bangalore', 'mumbai', 'punjab', 'hyderabad', 'mumbai', 'delhi', 'punjab', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'punjab', 'delhi', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'mumbai', 'hyderabad', 'hyderabad', 'mumbai', 'hyderabad', 'mumbai', 'bangalore', 'mumbai', 'mumbai', 'delhi', 'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'delhi', 'mumbai', 'hyderabad', 'punjab', 'bangalore', 'mumbai', 'mumbai', 'hyderabad', 'delhi', 'hyderabad', 'punjab', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'hyderabad', 'hyderabad', 'mumbai', 'delhi', 'delhi', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'delhi', 'delhi', 'delhi', 'mumbai', 'mumbai', 'delhi', 'mumbai', 'bangalore', 'mumbai', 'hyderabad', 'hyderabad', 'bangalore', 'delhi', 'mumbai', 'mumbai', 'delhi', 'hyderabad', 'bangalore', 'mumbai', 'mumbai', 'delhi', 'punjab', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'bangalore', 'delhi', 'hyderabad', 'delhi', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'delhi', 'mumbai', 'mumbai', 'delhi', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'punjab', 'punjab', 'mumbai', 'hyderabad', 'hyderabad', 'hyderabad', 'hyderabad', 'mumbai', 'bangalore', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'delhi', 'bangalore', 'hyderabad', 'bangalore', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'delhi', 'hyderabad', 'delhi', 'delhi', 'hyderabad', 'mumbai', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'punjab', 'delhi', 'mumbai', 'mumbai', 'delhi', 'hyderabad', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'punjab', 'delhi', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'delhi', 'mumbai', 'delhi', 'hyderabad', 'delhi', 'delhi', 'mumbai', 'delhi', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'delhi', 'delhi', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'delhi', 'bangalore', 'mumbai', 'mumbai', 'bangalore', 'mumbai', 'mumbai', 'mumbai', 'bangalore', 'delhi', 'mumbai', 'bangalore', 'bangalore', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'mumbai', 'mumbai', 'delhi', 'hyderabad', 'hyderabad', 'mumbai', 'bangalore', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'delhi', 'mumbai', 'mumbai', 'hyderabad', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'bangalore', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'delhi', 'mumbai', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'mumbai', 'punjab', 'delhi', 'mumbai', 'punjab', 'hyderabad', 'delhi', 'hyderabad', 'mumbai', 'mumbai', 'delhi', 'punjab', 'mumbai', 'delhi', 'delhi', 'hyderabad', 'mumbai', 'punjab', 'mumbai', 'punjab', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'delhi', 'hyderabad', 'mumbai', 'mumbai', 'bangalore', 'mumbai', 'mumbai', 'hyderabad', 'delhi', 'delhi', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'hyderabad', 'delhi', 'mumbai', 'mumbai', 'delhi', 'punjab', 'mumbai', 'hyderabad', 'mumbai', 'hyderabad', 'mumbai', 'delhi', 'punjab', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'hyderabad', 'mumbai', 'delhi', 'mumbai', 'delhi', 'hyderabad', 'bangalore', 'mumbai', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'bangalore', 'hyderabad', 'mumbai', 'hyderabad', 'bangalore', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'bangalore', 'delhi', 'hyderabad', 'mumbai', 'hyderabad', 'hyderabad', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'punjab', 'hyderabad', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'delhi', 'hyderabad', 'hyderabad', 'bangalore', 'delhi', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'hyderabad', 'mumbai', 'delhi', 'mumbai', 'mumbai', 'bangalore', 'mumbai', 'mumbai', 'delhi', 'mumbai', 'bangalore', 'mumbai', 'hyderabad', 'delhi', 'delhi', 'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'mumbai', 'hyderabad', 'hyderabad', 'hyderabad', 'mumbai', 'delhi', 'mumbai', 'hyderabad', 'delhi', 'bangalore', 'hyderabad', 'mumbai', 'hyderabad', 'bangalore', 'hyderabad', 'delhi', 'delhi', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai', 'delhi', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai', 'delhi', 'mumbai', 'hyderabad', 'mumbai', 'hyderabad', 'delhi', 'delhi', 'mumbai', 'mumbai', 'hyderabad', 'bangalore', 'hyderabad', 'hyderabad', 'hyderabad', 'delhi', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'hyderabad', 'delhi', 'punjab', 'mumbai', 'hyderabad', 'delhi', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'mumbai', 'delhi', 'delhi', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'hyderabad', 'mumbai', 'delhi', 'mumbai', 'mumbai', 'punjab', 'hyderabad', 'hyderabad', 'mumbai', 'punjab', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'mumbai', 'delhi', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'hyderabad', 'bangalore', 'hyderabad', 'hyderabad', 'mumbai', 'delhi', 'mumbai', 'bangalore', 'delhi', 'hyderabad', 'mumbai', 'delhi', 'hyderabad', 'hyderabad', 'mumbai', 'delhi', 'delhi', 'delhi', 'mumbai', 'hyderabad', 'hyderabad', 'hyderabad', 'hyderabad', 'punjab', 'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'delhi', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'hyderabad', 'delhi', 'bangalore', 'delhi', 'punjab', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'hyderabad', 'delhi', 'mumbai', 'mumbai', 'delhi', 'delhi', 'hyderabad', 'punjab', 'delhi', 'mumbai', 'delhi', 'hyderabad', 'hyderabad', 'mumbai', 'punjab', 'mumbai', 'hyderabad', 'mumbai', 'punjab', 'mumbai', 'delhi', 'punjab', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'hyderabad', 'mumbai', 'delhi', 'delhi', 'mumbai', 'hyderabad', 'mumbai', 'punjab', 'mumbai', 'bangalore', 'bangalore', 'mumbai', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai', 'delhi', 'mumbai', 'delhi', 'mumbai', 'punjab', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'hyderabad', 'delhi', 'mumbai'], dtype='<U9')
population_of_spec_city = stats.poisson.rvs(loc=18, mu=30, size= 1000)
# stats.poisson.rvs method is used to generate random numbers, where loc is to define mean, mu is used to specify shape
#paramters and as for the size, it defines the number of values
population_of_spec_city
array([61, 43, 54, 46, 55, 52, 43, 48, 42, 52, 55, 50, 39, 50, 54, 41, 42, 51, 48, 58, 50, 49, 43, 48, 44, 49, 48, 49, 44, 47, 48, 58, 51, 39, 39, 44, 48, 44, 47, 47, 45, 54, 49, 43, 47, 57, 44, 49, 57, 39, 48, 48, 39, 42, 52, 53, 51, 52, 46, 55, 43, 45, 51, 52, 52, 42, 40, 40, 40, 52, 48, 59, 48, 52, 56, 48, 56, 49, 43, 57, 52, 42, 58, 45, 52, 53, 49, 49, 40, 44, 52, 55, 52, 60, 49, 36, 47, 42, 46, 49, 51, 44, 45, 42, 49, 41, 46, 46, 51, 57, 50, 58, 47, 49, 47, 40, 49, 50, 50, 58, 50, 47, 53, 50, 55, 43, 51, 52, 54, 56, 44, 41, 47, 38, 52, 48, 52, 43, 47, 60, 41, 59, 51, 41, 50, 50, 42, 42, 48, 36, 43, 48, 44, 51, 43, 46, 45, 49, 44, 55, 39, 51, 65, 47, 54, 48, 42, 45, 56, 49, 44, 41, 40, 41, 51, 38, 57, 49, 40, 50, 39, 50, 45, 55, 49, 47, 49, 48, 46, 46, 47, 52, 54, 50, 42, 60, 55, 50, 52, 41, 50, 52, 41, 44, 51, 45, 41, 46, 57, 49, 41, 51, 41, 40, 51, 46, 41, 47, 46, 49, 52, 44, 45, 48, 58, 52, 55, 39, 45, 53, 36, 43, 50, 48, 49, 43, 54, 46, 46, 62, 40, 47, 51, 49, 41, 58, 50, 62, 43, 48, 40, 50, 55, 48, 51, 50, 36, 44, 46, 39, 54, 48, 49, 48, 45, 49, 41, 41, 40, 53, 41, 35, 52, 36, 39, 51, 47, 52, 43, 41, 47, 58, 45, 38, 47, 47, 48, 48, 47, 43, 57, 48, 43, 46, 48, 47, 42, 52, 52, 55, 50, 54, 52, 40, 50, 52, 40, 41, 46, 59, 44, 61, 44, 48, 37, 47, 52, 41, 43, 44, 45, 43, 54, 40, 37, 51, 53, 53, 50, 37, 52, 46, 46, 42, 43, 49, 43, 46, 48, 53, 50, 53, 57, 43, 48, 57, 47, 53, 49, 47, 44, 53, 44, 55, 53, 47, 41, 44, 49, 51, 48, 50, 45, 52, 54, 55, 48, 44, 44, 52, 46, 54, 48, 42, 38, 51, 48, 46, 43, 47, 49, 45, 40, 43, 46, 41, 53, 55, 48, 43, 41, 46, 60, 56, 58, 54, 46, 48, 44, 47, 41, 45, 45, 46, 41, 41, 52, 49, 47, 48, 52, 40, 53, 44, 67, 53, 52, 57, 41, 53, 37, 49, 61, 47, 47, 61, 50, 44, 40, 57, 52, 53, 44, 50, 48, 48, 47, 51, 53, 48, 41, 46, 46, 45, 49, 46, 48, 53, 39, 46, 55, 58, 47, 49, 49, 43, 44, 49, 45, 54, 38, 43, 62, 49, 48, 60, 47, 47, 49, 53, 61, 39, 46, 52, 48, 47, 43, 53, 42, 51, 51, 55, 47, 45, 53, 45, 48, 57, 50, 44, 55, 39, 44, 49, 53, 45, 55, 47, 49, 42, 54, 43, 58, 46, 45, 45, 49, 53, 43, 48, 46, 42, 41, 40, 45, 45, 35, 53, 44, 37, 49, 57, 49, 49, 45, 41, 56, 39, 45, 48, 44, 53, 55, 58, 39, 45, 43, 55, 47, 46, 45, 37, 51, 43, 45, 54, 45, 51, 50, 48, 47, 42, 58, 57, 44, 63, 54, 46, 48, 45, 47, 52, 42, 53, 38, 46, 55, 38, 52, 47, 47, 46, 38, 46, 48, 55, 55, 54, 45, 50, 38, 53, 54, 50, 50, 60, 47, 52, 43, 53, 54, 48, 46, 48, 46, 57, 53, 47, 42, 55, 34, 56, 50, 49, 45, 55, 56, 45, 50, 49, 58, 43, 48, 46, 46, 55, 64, 47, 45, 33, 41, 44, 49, 53, 42, 49, 44, 44, 49, 49, 47, 39, 44, 41, 58, 59, 52, 47, 41, 51, 46, 60, 53, 51, 54, 52, 37, 50, 50, 46, 51, 47, 44, 39, 54, 54, 46, 52, 49, 47, 54, 52, 53, 40, 42, 54, 50, 44, 40, 51, 49, 56, 48, 44, 47, 52, 55, 44, 45, 49, 45, 52, 51, 47, 46, 44, 50, 45, 48, 56, 41, 51, 47, 50, 39, 41, 39, 46, 47, 46, 49, 41, 56, 48, 53, 50, 55, 45, 46, 41, 52, 48, 43, 53, 47, 54, 48, 52, 43, 40, 51, 51, 52, 50, 48, 57, 50, 46, 50, 51, 55, 43, 65, 51, 51, 59, 48, 44, 50, 40, 47, 66, 55, 45, 51, 43, 55, 55, 61, 46, 52, 49, 46, 51, 51, 38, 45, 53, 46, 49, 52, 51, 59, 58, 49, 52, 49, 43, 37, 44, 46, 48, 36, 56, 45, 48, 43, 47, 49, 62, 44, 49, 40, 45, 58, 52, 48, 46, 47, 39, 56, 51, 48, 52, 52, 46, 55, 46, 46, 46, 48, 38, 45, 46, 57, 43, 44, 47, 58, 39, 49, 48, 44, 58, 45, 49, 52, 50, 44, 50, 53, 43, 55, 53, 54, 49, 42, 52, 46, 48, 49, 60, 42, 48, 49, 44, 53, 45, 52, 50, 53, 45, 46, 46, 49, 44, 60, 43, 46, 48, 45, 43, 48, 37, 38, 47, 41, 46, 60, 54, 49, 54, 60, 53, 47, 45, 52, 44, 60, 52, 50, 53, 47, 55, 40, 47, 40, 46, 39, 55, 52, 43, 45, 48, 41, 38, 46, 50, 39, 41, 45, 51, 49, 51, 57, 52, 46, 42, 47, 47, 47, 39, 38, 51, 38, 52, 51, 52, 45, 51, 39, 50, 45, 52, 41, 43, 59, 48, 49, 47, 43, 50, 51, 50, 58, 50, 42, 38, 50, 55, 38, 42, 48, 41, 52, 50, 47, 51, 44, 48, 45, 44, 53, 50, 44, 53, 47, 51, 45, 52, 43, 54, 44, 49, 43, 50, 53, 46, 53, 42, 45, 47, 51, 48, 48, 39, 51, 53, 46, 50, 56, 47, 43, 48, 56, 51, 52, 48, 45, 54, 46, 52, 58, 54, 41, 55, 49, 48, 54, 45, 60, 43, 46, 57, 54, 48, 45, 49, 56, 44], dtype=int64)
# Forming the Dataframe from the obatined values
population_frame = pd.DataFrame({"city":people_of_spec_city,"population":population_of_spec_city})
# Dividing these values by the categorical variables into groups
groups = population_frame.groupby("city").groups
groups
{'bangalore': [12, 33, 45, 75, 96, 103, 105, 134, 147, 148, 180, 182, 184, 187, 202, 209, 222, 227, 302, 338, 344, 375, 387, 418, 422, 428, 438, 472, 486, 488, 563, 566, 570, 573, 574, 588, 605, 663, 699, 707, 711, 716, 741, 753, 758, 783, 787, 827, 897, 903, 931, 978, 979], 'delhi': [6, 14, 29, 32, 57, 58, 59, 61, 65, 93, 110, 114, 118, 119, 121, 122, 126, 129, 131, 155, 159, 163, 186, 189, 193, 196, 197, 205, 220, 225, 229, 230, 231, 237, 240, 249, 254, 268, 271, 274, 287, 290, 297, 303, 310, 318, 323, 326, 349, 361, 368, 378, 383, 391, 403, 404, 410, 411, 412, 413, 416, 423, 426, 431, 439, 441, 446, 447, 450, 463, 485, 492, 493, 495, 496, 515, 518, 530, 537, 539, 541, 542, 544, 553, 554, 562, 571, 581, 584, 596, 612, 623, 626, 630, 634, 637, 638, 659, 667, 668, ...], 'hyderabad': [1, 3, 7, 11, 13, 15, 25, 31, 34, 36, 38, 43, 46, 47, 50, 60, 67, 72, 73, 77, 83, 86, 87, 90, 101, 106, 109, 116, 117, 127, 136, 138, 139, 142, 150, 151, 153, 158, 161, 166, 168, 171, 172, 176, 178, 179, 183, 192, 194, 195, 200, 210, 211, 217, 219, 223, 226, 233, 238, 244, 247, 252, 255, 258, 259, 267, 269, 272, 273, 276, 277, 282, 283, 284, 294, 298, 299, 301, 305, 307, 311, 312, 313, 314, 315, 316, 322, 329, 331, 334, 335, 336, 337, 340, 342, 347, 351, 352, 357, 364, ...], 'mumbai': [0, 2, 4, 5, 8, 9, 10, 16, 17, 18, 19, 20, 21, 22, 23, 26, 27, 28, 35, 37, 39, 40, 41, 42, 44, 48, 49, 51, 52, 53, 54, 55, 56, 62, 63, 64, 66, 68, 69, 70, 71, 74, 76, 78, 79, 80, 81, 82, 84, 85, 89, 91, 92, 94, 95, 97, 98, 99, 100, 102, 104, 107, 108, 111, 112, 113, 115, 120, 123, 124, 125, 130, 132, 133, 135, 137, 140, 141, 143, 144, 149, 152, 154, 156, 157, 160, 164, 165, 167, 169, 170, 173, 174, 177, 185, 188, 190, 191, 198, 199, ...], 'punjab': [24, 30, 88, 128, 145, 146, 162, 175, 181, 243, 261, 292, 343, 346, 350, 360, 386, 393, 432, 464, 465, 514, 529, 625, 628, 635, 641, 643, 680, 687, 730, 845, 880, 884, 919, 933, 950, 957, 961, 964, 976, 989]}
# Etract individual groups into respective variables
punjab = population_of_spec_city [groups["punjab"]]
bangalore = population_of_spec_city[groups["bangalore"]]
delhi = population_of_spec_city[groups["delhi"]]
hyderabad = population_of_spec_city[groups["hyderabad"]]
mumbai = population_of_spec_city[groups["mumbai"]]
# Now calculate the one-way anova test for the obtained individual groups
stats.f_oneway(asian, black, hispanic, other, white)
F_onewayResult(statistic=0.9110431706569894, pvalue=0.45674036540270235)
From the above obtained result, we can decide on to the output that the pvalue which is 0.4 is greater than alpha value (0.05). Hence we can say that there is no significant difference among the variances of different groups and are almost equal. And the Null Hypothesis is accepted
3. Correlation coefficients¶
The correlation coefficient is a statistical measure that shows the degree to which, changes to a value of one variable predict change to the value of another. The letter r is used to represent the correlation coefficient and the r is a unit-free value between -1 and 1.
Correlation coefficient measures the relatability among the data, the strength of that relationship is obtained by using these correlation coefficient formulas. The values obtained from these correlation coefficients are -1, 0 or 1 where, -1 represents a relationship which is negative and weak 1 represents a relationsip which is positive and strong 0 (zero) represents no relationship
Let’s suppose, there are two variables x and y, for which the correlation coefficient need to be found
If the value of y goes up, when the value of x goes up, which means x is directly proportional to y then the correlation coefficient between x and y results out to be between 1 or positive values
If the value of y goes down, whenever the value of x goes up or vice versa, which means x is inversely proportional to x then the correlation coefficient between x and y results to be -1 or negative values
Though the value of y goes down, if there is no change in the other variable in our case it is x, then the correlation coefficient between these 2 variable results to be 0 (zero)
For Example,
Positive Correlation : If the quantity of milk increases, the price also increases Negative Correlation : If the price of a stock goes down, then the buying of that stock increases Zero Correlation : There is no relationship between score in video games and grades of an examination
Before we dig into how the correlation among 2 or more coefficients is calculated. It is necessary to understand a term called covariance
So, what is covariance? Covariance is a term that is used to describe the linear relationship between 2 variables. If the covariance is predicted to be positive, then the variables have a linear relationship i.e., both the variables can change in the same direction. If the covariance is predicted to be negative, then the variables don’t have that linear relationship i.e., both varibles tend to go in different directions
There are many types of correlation coefficients. But here we will discuss some of the important correlation coefficients that are widely being used
1. Pearson’s r
Pearson's r also known as Pearson's product momemnt correlation coefficient, is used for describing the strength of the linear relationship amomg 2 variables
This correlation coefficient is used when the data follows normal distribution, with no outliers, with no skewed data and when you expect linear relationship among the 2 variables.
The Pearson'r formula is as follows
The strength of the correlation is considered as
- weak positive correlation as 0<r<0.3, weak negative correlation as -0.3<r<0
- strong positive correlation as 0.5<r<1, strong negative correlation as -1<r<-0.5
- no correlation as r=0
- Spearman’s rho
Spearman’s rho, also known as Spearman’s rank correlation coefficient, is used as an alternative to the Pearson’s correlation coefficient. This correlation coefficient is a rank correlation coefficient as it uses the rankings to determine the strength of each varibal(say lowest to highest). Unlike Pearson’s r, Spearman’s rho is used to calculate monotonic relationship, which is Non-linear. Spearman’s rho formula is as follows
- Kendall’s tau
Kendall’s tau is used for the calculation of correlation coefficient when there are 2 variables, which may be continuous variables with outliers or ordinals, but exhibiting monotonic relationship. Spearman’s rho and Kendall’s tau are almost similar but it is better to use Kendall’s tau for better results.
These 3 correlation coefficients can be calculated in Machine Learning by using a function in pandas which is
DataFrame.corr(method='pearson', min_periods=1)
Paramaters:
method {'pearson','spearman','kendall')
Calculating correlation coefficient using pandas dataframe¶
import pandas as pd
import numpy as np
df = pd.read_csv('C:/Users/leena.ganta/Desktop/DataVedas/happyscore_income.csv',index_col=0)
df.head()
adjusted_satisfaction | avg_satisfaction | std_satisfaction | avg_income | median_income | income_inequality | region | happyScore | GDP | country.1 | |
---|---|---|---|---|---|---|---|---|---|---|
country | ||||||||||
Armenia | 37 | 4.9 | 2.42 | 2096.76 | 1731.506667 | 31.445556 | ‘Central and Eastern Europe’ | 4.350 | 0.76821 | Armenia |
Angola | 26 | 4.3 | 3.19 | 1448.88 | 1044.240000 | 42.720000 | ‘Sub-Saharan Africa’ | 4.033 | 0.75778 | Angola |
Argentina | 60 | 7.1 | 1.91 | 7101.12 | 5109.400000 | 45.475556 | ‘Latin America and Caribbean’ | 6.574 | 1.05351 | Argentina |
Austria | 59 | 7.2 | 2.11 | 19457.04 | 16879.620000 | 30.296250 | ‘Western Europe’ | 7.200 | 1.33723 | Austria |
Australia | 65 | 7.6 | 1.80 | 19917.00 | 15846.060000 | 35.285000 | ‘Australia and New Zealand’ | 7.284 | 1.33358 | Australia |
df.corr(method='pearson') # or df.corr() gives the same result as method pearson is defaulted
adjusted_satisfaction | avg_satisfaction | std_satisfaction | avg_income | median_income | income_inequality | happyScore | GDP | |
---|---|---|---|---|---|---|---|---|
adjusted_satisfaction | 1.000000 | 0.978067 | -0.527553 | 0.728006 | 0.704383 | -0.123835 | 0.901213 | 0.755578 |
avg_satisfaction | 0.978067 | 1.000000 | -0.341201 | 0.689043 | 0.661883 | -0.082471 | 0.885988 | 0.776679 |
std_satisfaction | -0.527553 | -0.341201 | 1.000000 | -0.478206 | -0.481429 | 0.221831 | -0.457896 | -0.242038 |
avg_income | 0.728006 | 0.689043 | -0.478206 | 1.000000 | 0.995605 | -0.382587 | 0.782122 | 0.814024 |
median_income | 0.704383 | 0.661883 | -0.481429 | 0.995605 | 1.000000 | -0.449053 | 0.760328 | 0.797905 |
income_inequality | -0.123835 | -0.082471 | 0.221831 | -0.382587 | -0.449053 | 1.000000 | -0.187222 | -0.303204 |
happyScore | 0.901213 | 0.885988 | -0.457896 | 0.782122 | 0.760328 | -0.187222 | 1.000000 | 0.790061 |
GDP | 0.755578 | 0.776679 | -0.242038 | 0.814024 | 0.797905 | -0.303204 | 0.790061 | 1.000000 |
df.corr(method='spearman') # correlaion coefficient using spearman method
adjusted_satisfaction | avg_satisfaction | std_satisfaction | avg_income | median_income | income_inequality | happyScore | GDP | |
---|---|---|---|---|---|---|---|---|
adjusted_satisfaction | 1.000000 | 0.981629 | -0.497192 | 0.803010 | 0.779671 | -0.168049 | 0.900697 | 0.766098 |
avg_satisfaction | 0.981629 | 1.000000 | -0.354810 | 0.808310 | 0.782479 | -0.137139 | 0.893395 | 0.773521 |
std_satisfaction | -0.497192 | -0.354810 | 1.000000 | -0.317653 | -0.309697 | 0.182610 | -0.421175 | -0.275832 |
avg_income | 0.803010 | 0.808310 | -0.317653 | 1.000000 | 0.990839 | -0.356069 | 0.819542 | 0.960969 |
median_income | 0.779671 | 0.782479 | -0.309697 | 0.990839 | 1.000000 | -0.448926 | 0.806704 | 0.961583 |
income_inequality | -0.168049 | -0.137139 | 0.182610 | -0.356069 | -0.448926 | 1.000000 | -0.242107 | -0.409767 |
happyScore | 0.900697 | 0.893395 | -0.421175 | 0.819542 | 0.806704 | -0.242107 | 1.000000 | 0.793673 |
GDP | 0.766098 | 0.773521 | -0.275832 | 0.960969 | 0.961583 | -0.409767 | 0.793673 | 1.000000 |
df.corr(method='kendall') # correlation coefficient using kendall method
adjusted_satisfaction | avg_satisfaction | std_satisfaction | avg_income | median_income | income_inequality | happyScore | GDP | |
---|---|---|---|---|---|---|---|---|
adjusted_satisfaction | 1.000000 | 0.905145 | -0.378239 | 0.614896 | 0.593379 | -0.124810 | 0.741682 | 0.581131 |
avg_satisfaction | 0.905145 | 1.000000 | -0.266347 | 0.618270 | 0.593810 | -0.104128 | 0.732966 | 0.591166 |
std_satisfaction | -0.378239 | -0.266347 | 1.000000 | -0.237797 | -0.233515 | 0.124672 | -0.320795 | -0.205190 |
avg_income | 0.614896 | 0.618270 | -0.237797 | 1.000000 | 0.929566 | -0.229994 | 0.622277 | 0.841441 |
median_income | 0.593379 | 0.593810 | -0.233515 | 0.929566 | 1.000000 | -0.299779 | 0.614087 | 0.847011 |
income_inequality | -0.124810 | -0.104128 | 0.124672 | -0.229994 | -0.299779 | 1.000000 | -0.166762 | -0.264067 |
happyScore | 0.741682 | 0.732966 | -0.320795 | 0.622277 | 0.614087 | -0.166762 | 1.000000 | 0.601310 |
GDP | 0.581131 | 0.591166 | -0.205190 | 0.841441 | 0.847011 | -0.264067 | 0.601310 | 1.000000 |
4. Chi-Square Test¶
Chi-Square Test is a non-parametric test, which tests the significance of the difference between observed frequencies and theoretical frequencies of distribution without any assumption about the distribution of the population. In simple words, chi-square test is used to determine the difference between the expected data and the observed data
Chi-square test is also used to determine whether the built regression model is good fit or not by assessing train and test datasets. This test is used on categorical variables
Two Chi-Sqaure tests that are mostly used are
- Independence : As the name suggests, it describes the dependence of two variable sets
Good Fit of data: This chi-square depicts whether the taken sample of a data is considered to be the representive sample that contributes to the good fit for the expected outcome from the taken population of data.
The formula for the chi-square test is as follows
In Chi-Square Test we assume two hypothesis,
Null Hypothesis (H0) : The 2 variables are independent Alternate Hypothesis (HA) : The 2 variables are not independent
We determine by performing chi-square test and come out to one reasonable hypothesis as the solution. This can be done by comparing the statistic value that is obtained after the test with the alpha value i.e., 0.05 using chi-square table with the help of pvalue.
If the pvalue is less than alpha we accept the alternate hypothesis (HA). If the pvalue is greater than alpha then we accept the Null hypothesis (H0).
Expected value is calculated as Ei= (Row Total * Column Total)/Total no of observations
In Machine learning, to perform chi-squared test we use a method named chisquare which is imported from scipy.stats.
scipy.stats.chisquare(f_obs, f_exp=None, ddof=0, axis=0)
where,
f_obs - an array, with observed frequencies in each category
f_exp - an array, optional; with expected frequencies in each category. If no array is given, it assumes the categories are almost likely
ddof - int, optional; stands for delta degrees of freedom, which is used for adjustment to the degrees of freedom for obtaining p-value. The p-value is determined using a chi-squared distribution with k-1 delte degrees of freedom (ddof), where k is said to be the umber of observed frequencies. By defualt the ddof value is 0
axis - int or None, optional; the axis along which to be applied test. If axis is given as none, then the f_obs values are treated as a single data set. Defaulted value is 0.
Returns,
chisq - float or ndarray, the value is a float if axis is None or f_obs or f_exp are 1-dimensional array
p-value - float or ndarry, the values is a float if ddof and the return value chisq are scalars.
Importing Library¶
from scipy.stats import chisquare
With f_obs values¶
f_obs=[16, 12, 16, 18, 14, 12]
chisq=chisquare(f_obs).statistic
pvalue=chisquare(f_obs).pvalue
print("chisquare statistic :",chisq)
print("p_value :",pvalue)
chisquare statistic : 2.0 p_value : 0.8491450360846096
# Since from the obtained result with pvalue (0.8) > alpha (0.05), we accept the Null hypothesis
With f_exp and f_obs values¶
f_obs=[16, 12, 16, 18, 14, 12]
f_exp=[16, 8, 16, 16, 16, 16]
chisquare(f_obs,f_exp)
Power_divergenceResult(statistic=3.5, pvalue=0.6233876277495822)
With f_obs as 2d¶
f_obs=[[16, 12, 16, 18, 14, 12],[24, 12, 32, 16, 32, 12]] # The test is automatically applied to each column
chisquare(f_obs)
Power_divergenceResult(statistic=array([1.6 , 0. , 5.33333333, 0.11764706, 7.04347826, 0. ]), pvalue=array([0.20590321, 1. , 0.02092134, 0.73160059, 0.00795544, 1. ]))
# Here most of the pvalues obtained are less than the alpha value, we accept the Alternate hypothesis(HA)
With axis as None¶
f_obs=[[16, 12, 16, 18, 14, 12],[24, 12, 32, 16, 32, 12]] # The test is applied to the whole data
chisquare(f_obs, axis=None)
Power_divergenceResult(statistic=33.33333333333333, pvalue=0.0004645423926184954)
With axis as 1¶
f_obs=[16, 12, 16, 18, 14, 12]
f_exp=[[16, 12, 16, 18, 14, 12],[16, 8, 16, 16, 16, 16]]
chisquare(f_obs,f_exp,axis=1)
Power_divergenceResult(statistic=array([0. , 3.5]), pvalue=array([1. , 0.62338763]))
With ddof specified¶
f_obs=[16, 12, 16, 18, 14, 12]
chisquare(f_obs, ddof=1)
Power_divergenceResult(statistic=2.0, pvalue=0.7357588823428847)
chisquare(f_obs,ddof=[0,1])
Power_divergenceResult(statistic=2.0, pvalue=array([0.84914504, 0.73575888]))
chisquare(f_obs,ddof=[0,1,2])
Power_divergenceResult(statistic=2.0, pvalue=array([0.84914504, 0.73575888, 0.5724067 ]))
The above results with the different degrees of freedom generates pvalues with values higher than the alpha value, so we accept the Null hypothesis (H0)