Inferential statistics is used for finding inferences on the data and make predictions about the data on a given sample of data.This uses probability to find conclusions.

There are possible methods to perform inferential statistics on the data. In this blog we will discuss about Z-Score, Z-Test, F-Test, Correlation Coefficients, chi-square Test for performing the analysis of the data and get a probable conclusion based on it.

When we use Inferential Statistics?

Inferential statistics mainly used for finding conclusions about the data, the data can be a sample or set of features so sometimes we use a large size of data for building a model at that time this inferential statistics comes in handy.

Contents¶

Z Scores, Z-Test

1.1 Z Value

1.2 Z Test

1.3 Two-sided One-Sample t-Test

1.4 Independent t-Test

1.5 Paired T-test

F-test

Correlation Coefficients

Chi-Square Test

1. Z scores, Z-Test¶

1.1 Z Value¶

Z- Value/ Z- Score tells a value (x) is how many standard deviations below or above the population mean. If the Z value is positive the value/ score (x) is higher than the mean and if the Z value is negative the value is lesser than the mean

Z-Score can be calculated as follows

   z = (X – μ) / σ

   where,
       X : Single data value
       μ : Mean value
       σ : Standard Deviation

Z-score in python can be calcualted by using scipy.stats.zscore such as, scipy.stats.zscore(a, axis=0, ddof=0, nan_policy=’propagate’)

where,

a : array_like
An array like object containing the sample data.

axis : int or None, optional
Axis , either horizontal or vertical

ddof : int, optional
Degrees of freedom correction in standard deviation calcualtion. Default value is 0 (zero).

nan_policy : {‘propagate’, ‘raise’, ‘omit’}, optional
This field defines a way of handling when input contains nan.
Default value is propagate, which returns nan
The value raise, throws an error
The value omit, ignores nan values and performs the calculation.

Note: Whenever the value is omit, the nan values in the input propagate to the output, but these nan values 
do not affect the z-scores that's been computed for the non-nan values

Ex: a = [0.8976,0.9989,0.5678,0.1234,0.7765,1,1.675,1.456]

==> Mean (μ) = Sum of all the elements/N , where N = total number of elements

mean (μ)= (0.8976+0.9989+0.5678+0.1234,0.7765+1+1.675+1.456)/8 =0.9369

==> standard deviation (σ) = sqrt((X-μ)/N) , where X = element

standard deviation (σ) = sqrt((0.8976-0.9369)^2+(0.9989-0.9369)^2+(0.5678-0.9369)^2+(0.1234-0.9369)^2+(0.7765-0.9369)^2+(1-0.9369)^2+(1.675-0.9369)^2+(1.456-0.9369)^2))/8 = 0.45378

==> Z-score (z) = (X-μ)/σ

z = [(0.8976-0.9369)/0.4537,(0.9989-0.9369)/0.4537,(0.5678-0.9369)/0.4537,(0.1234-0.9369)/0.4537,(0.7765-0.9369)/0.4537,(1-0.9369)/0.4537,(1.675-0.9369)/0.4537,(1.456-0.9369)/0.4537]

Result is ==> z = [-0.0866,0.1357,-0.8135,-1.7930,-0.3535,0.1390,1.6268,1.144]

Computing z-score using defualt values¶

In [2]:

import numpy as np
import scipy.stats as stats

a = np.array([0.8976,0.9989,0.5678,0.1234,0.7765,1,1.675,1.456])
stats.zscore(a)

Out[2]:

array([-0.08660476,  0.13662837, -0.81337952, -1.79269639, -0.35347081,
        0.13905242,  1.62653867,  1.14393202])

Computing z-score along specified axis using degrees of freedom¶

In [4]:

a = np.array([[0.1234,0.4567,0.7890,0.9876],
             [0.6789,0.7890,0.9987,0.6657],
             [0.2234,0.9987,0.3345,0.5567]])

stats.zscore(a,axis=1,ddof=1)

Out[4]:

array([[-1.22576827, -0.3486311 ,  0.52587439,  1.04852498],
       [-0.67641081,  0.03847117,  1.40005837, -0.76211873],
       [-0.88942498,  1.37202025, -0.56536131,  0.08276604]])

Computing z-score using nan_policy¶

In [15]:

a = np.array([[0.1234,np.nan,0.7890,0.9876],
             [0.6789,0.7890,0.9987,0.6657],
             [np.nan,0.9987,0.3345,np.nan]])

stats.zscore(a,axis=1) # default value of nan_policy is propagate, which returns nan 

Out[15]:

array([[        nan,         nan,         nan,         nan],
       [-0.78105192,  0.04442268,  1.61664815, -0.88001891],
       [        nan,         nan,         nan,         nan]])

In [16]:

a = np.array([[0.1234,np.nan,0.7890,0.9876],
             [0.6789,0.7890,0.9987,0.6657],
             [np.nan,0.9987,0.3345,np.nan]])
 

# nan_policy='raise', throws error

stats.zscore(a,axis=1,nan_policy='raise')

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-7d7bc298bb30> in <module>
      3              [np.nan,0.9987,0.3345,np.nan]])
      4 
----> 5 stats.zscore(a,axis=1,nan_policy='raise') # nan_policy='raise', throws error

~\anaconda3\lib\site-packages\scipy\stats\stats.py in zscore(a, axis, ddof, nan_policy)
   2545         return np.empty(a.shape)
   2546 
-> 2547     contains_nan, nan_policy = _contains_nan(a, nan_policy)
   2548 
   2549     if contains_nan and nan_policy == 'omit':

~\anaconda3\lib\site-packages\scipy\stats\stats.py in _contains_nan(a, nan_policy)
    237 
    238     if contains_nan and nan_policy == 'raise':
--> 239         raise ValueError("The input contains nan values")
    240 
    241     return contains_nan, nan_policy

ValueError: The input contains nan values

In [17]:

a = np.array([[0.1234,np.nan,0.7890,0.9876],
             [0.6789,0.7890,0.9987,0.6657],
             [np.nan,0.9987,0.3345,np.nan]])

stats.zscore(a,axis=1,nan_policy='omit') # nan_policy='omit', computes the z-score, ignoring all the nans

Out[17]:

array([[-1.37976297,         nan,  0.4211984 ,  0.95856458],
       [-0.78105192,  0.04442268,  1.61664815, -0.88001891],
       [        nan,  1.        , -1.        ,         nan]])

1.2 Z-Test¶

Z- Test is to test the population proportion. Z-Test can be used to test the given mean, when the sample is large, which means the length of the data is more than 30 , and when the population standard deviation is known as well as variance is known. This test is perfromed to check if the 2 sample means are approximately equal or not.

To perfrom this z-test the samples should be taken at random from the population and the data should be normally distributed. If the data taken is larger than 30 then it is assumed that the data is normally distributed. If the sample size is less than 30 then the t-test is considered.

We check if the value obtained is approimately equal or not by considering hypothesis such as Null Hypothesis (H0) : If the value is equal to the other value, this hypothesis is accepted Alternate Hypothesis (HA) :If the values is not equal to the other value, this hypothesis is accepted

This Z-test is calculated by using the formula

After performing the z-test, the value obtained should be compared with the alpha value, which is assumed to be 0.05 in the z-score table, which is considered to be pvalue.

This pvalue if it is less than the alpha value, then the Null Hypothesis is rejected which means the Alternate Hypothesis is considered. In other words the means of the two samples are not equal

The pvalue if it is greater than the alpha value, then the Null Hypothesis is accepted. In other words the means or averages of the two samples are equal.

In Machine Learning, we calculate z-test by using method ztest from statsmodels.stats.weightstats

     statsmodels.stats.weightstats.ztest(x1, x2=None, value=0, alternative='two-sided', usevar='pooled', ddof=1.0)

   where,
       x1,x2 are arrays

       value : float
       In the one sample case, value is the mean of x1 under the Null hypothesis. In the two sample case, value is the difference between mean of x1 and mean of x2 under the Null hypothesis. The test statistic is x1_mean - x2_mean - value.

       alternative : str
       The alternative hypothesis, H1, has to be one of the following
       ‘two-sided’: H1: difference in means not equal to value (default)
       ‘larger’ : H1: difference in means larger than value 
       ‘smaller’ : H1: difference in means smaller than value

       usevar : str, ‘pooled’
       Currently, only ‘pooled’ is implemented. If pooled, then the standard deviation of the samples is assumed to be the same. see CompareMeans.ztest_ind for different options.

       ddof : int
       Degrees of freedom use in the calculation of the variance of the mean estimate. In the case of comparing means this is one, however it can be adjusted for testing other statistics (proportion, correlation)

        Returns, 
            tstat : float
            test statistic

            pvalue : float
            pvalue of the t-test

Example¶

In [27]:

import numpy as np
import pandas as pd
from numpy.random import randn
from statsmodels.stats.weightstats import ztest

x1 = [20, 30, 40, 50, 10, 20]
z = ztest(x1,value= 25) # where value is a mean value
z

Out[27]:

(0.5547001962252289, 0.5790997419539189)

The first value from the above result is statistic value and the other value is pvalue. From the above output, we can understand that the pvalue of the taken data is 0.9 which is greater than the value of alpha which is 0.05. Hence we come to the output that Null Hypothesis is correct and it is accepted, which means that the given data and the assumed mean are approximately equal.

In [42]:

x1 = [20, 30, 40, 50, 10, 20]
x2 = [11, 12, 13, 14, 15, 16]

z=ztest(x1, x2, value= 0, alternative = 'larger')
z

Out[42]:

(2.448717008689441, 0.007168301924196878)

From the above output, we can understand that the Null Hypothesis is rejected and Alternate Hypothesis is accepted as the pvalue is less than the alpha value

T-Test¶

T-test, also known as Student’s T-test, is used to determine the difference among two groups of variables by comparing their mean values or the averages. This T-test not only determines the difference but also determines the significance of their differences. In other words, this test simply explains that the differeneces among the varibale groups is occurred by a chance or relevant to the data taken.

The 3 types of T-test are

        1. Independent T-test
        2. Paired Sample T-test
        3. One-Sample T-test

One-Sample T-test : This one-sample T-test is a t-test where the one group’s mean or avergae is compared with one significant value which is a mean of the population

     Types of One-Sample T-test are
         1. One tailed One-Sample T-test
         2. Two tailed One-Sample T-test
         3. Upper tailed One-Sample T-test
         4. Lower tailed One-Sample T-test

1.3 Two Sided One-Sample T-test¶

Two Sided One-Sample T-test or Two tailed One-Sample T-test ———————

1.4 Independent t-Test¶

Independent T-test, also known as Two Sample T-test is used to test whether the means of the taken 2 groups are equl or not. This Independent T-test assumes that the variances of the taken population has equal variance by defualt.

    In Machine Learning, we can perform this test using
              - Using scipy library
              - Using Statsmodels

Using scipy library¶

scipy.stats.ttest_ind(a, b, axis=0, equal_var=True) where, a, b are two arrays of 2 groups

axis : int or None, optional Axis along which to compute test. If None, compute over the whole arrays, a, and b.

equal_var : bool, optional If True (default), perform a standard independent 2 sample test that assumes equal population variances If False, perform Welch’s t-test, which does not assume equal population variance.

Returns statistic : float or array The calculated t-statistic.

pvalue : float or array The p-value.

Example¶

For example, we are given 2 different groups of data among which Bag-A has a bunch of apples and Bag-B has a bunch of mangoes. We need to check if the both bags have the same averages or means.

For this we need to assume 2 hypothesis, null hypothesis and alternate hypothesis H0 -> The means of two bags are equal HA -> The means of two bags are not equal

Two find which hypothesis is predicted to be different, we check the value of alpha, which is assumed to be 0.05, with the pvalue that is obtained after performing t-test. If the pvalue is less than the alpha value, then the HA is considered to be true If the pvalue is greater than the alpha value, then the H0 is considere to be true

Lets test the above example with ttest using scipy library

In [111]:

import scipy.stats as stats

a = np.array([5,6,7,8,2,3,4,5])
b = np.array([12,13,14,15,16,2,3,4])

stats.ttest_ind(a, b, equal_var=True) # Assuming that the 2 groups have equal variance

Out[111]:

Ttest_indResult(statistic=-2.2331335038240865, pvalue=0.042379219768910015)

In [112]:

stats.ttest_ind(a, b, equal_var=False) # Assuming that the 2 groups doesn't have equal variance

Out[112]:

Ttest_indResult(statistic=-2.2331335038240865, pvalue=0.05369587840008499)

Understanding the result from the above test, assuming that the alpha value is 0.05, the both results assuming that the 2 groups have same variances and different variances, returns out pvalue which is less than the value of alpha. Hence the 2 bags or the 2 groups means are not supposed to be equal

Using statsmodels¶

statsmodels.stats.weightstats.ttest_ind(x1, x2)

where, 
    x1 and x2 are two array groups

Returns:
    tstat : float
    test statistic

    pvalue : float
    pvalue of the t-test

    df : int or float
    degrees of freedom used in the t-test

In [115]:

from statsmodels.stats.weightstats import ttest_ind

a = np.array([12,14,16,4,5,11,12,11])
b = np.array([12,13,14,15,16,2,3,4])

ttest_ind(a,b)

Out[115]:

(0.2963188789948769, 0.7713367820262194, 14.0)

Understanding the result from the above test, assuming that the alpha value is 0.05, the resulted pvalue is 0.7 which is greater than the assumed alpha value. Hence it can be said that the assumption H0 is true, i.e., the means of the 2 groups are equal

1.5 Paired t-Test¶

A Paired t-Test explains the difference between two variables for the same subject. It compares one set of measurements with the second set from the same sample. This test is also known as Dependent Sample T-test

In simple words, this T-test measures the difference between two averages or means of two different groups. This test similar to other tests assumes that there are 2 hypothesis. Null Hypothesis (H0) : The difference between two means of the two groups is zero Alternate Hypothesis (HA) : The difference between two means of the two groups is not equal to zero.

In Machine Learning, this Paired T-test can be calculated by using ttest_rel() method defined in scipy.stats library

     scipy.stats.ttest_rel(a, b, axis=0, nan_policy='propagate', alternative='two-sided')

     where,

          a, b : array_like

          axis : int or None, optional
          Axis along which to compute test. If None, compute over the whole arrays, a, and b.

          nan_policy : {‘propagate’, ‘raise’, ‘omit’}, optional
          Defines how to handle when input contains nan. The following options are available (default is ‘propagate’):
          ‘propagate’: returns nan
          ‘raise’: throws an error
          ‘omit’: performs the calculations ignoring nan values

          alternative : {‘two-sided’, ‘less’, ‘greater’}, optional
          Defines the alternative hypothesis. The following options are available (default is ‘two-sided’):
          ‘two-sided’: the means of the distributions underlying the samples are unequal.
          ‘less’: the mean of the distribution underlying the first sample is less than the mean of the distribution underlying the second sample.
          ‘greater’: the mean of the distribution underlying the first sample is greater than the mean of the distribution underlying the second sample.

    Returns
        statistic : float or array
        t-statistic.

        pvalue : float or array
        The p-value.

In [3]:

import scipy.stats as stats

a = np.array([12, 14, 16, 4, 5, 11, 12, 11])
b = np.array([12, 13, 14, 15, 16, 2, 3, 4])

stats.ttest_rel(a,b)

Out[3]:

Ttest_relResult(statistic=0.26355219111613715, pvalue=0.7997147761519707)

From the above result, the assumed alpha value which is 0.05 is less than the obtained pvalue. Hence we can accpet the Null hypothesis H0, saying that the difference between two means of two groups is zero

2. F-Test¶

F-Test can be applied to test the significant difference between the variance of two populations, based on the small samples drawn from those populations. The test based on this statistic is known as F-Test.

Simply said, this F-test compares the variances of 2 values by perfroming division.The result of the f-test is always positive, because the variances are always positive. Let’s assume that the two variables are s1 and s2, the formula is considered as F = s1^2/s2^2

The Hypothesis for this F-test are defined as, Null Hypothesis (H0) : The variances of two variables are equal and there is no significant difference Alternate Hypothesis (HA) : The variances of two variables are not equal

F-Statistic, also known to be as F-Value is used in Analysis of Variance (ANOVA) and in regression models to find the significance between the means of the two populations by comparing variances. F-Statistic is used in F-test. F-Test is almost similar to T-test, except in F-test we check for the significance among group of variables, whereas in T-test, we check for the significance among 2 variables. F-test is used to check for the similarity among the means of different variables.

  For an F-test to be conducted we need to assume
       - that the data taken is normally distributed
       - The numerator variance should be larger and the denominator variance should be smaller

  There are many statistics in which F-Statistic is used, but mostly used F-test is Analysis Of Variance (ANOVA)

ANOVA : Analysis Of Variances, called as ANOVA, is used to test two or more than two groups of means differences. This ANOVA uses F-Statistic to calculate the difference between means of 2 groups or more than that

 The Hypothesis here is taken as,
         Null Hypothesis (H0) : The groups are significantly equal
         Alternate Hypothesis (H1) : The groups are not equal

   There are different types of ANOVA such as,
       - One-way ANOVA
       - Two-way ANOVA
       - Factorial ANOVA
       - Repeated Measures ANOVA
       - MANOVA etc.,

   The most used ANOVA is One-way ANOVA, which is used to compare groups mean with an independent variable to check whether the groups are likely or not 

   This test is performed by using a method f_oneway from scipy.stats as
       scipy.stats.f_oneway(*samples, axis=0)

       where, 
           samples can be any number of groups or array like variables
           axis defines along which the test to be performed, by default set to zero and is optional

       returns,
           statistic : float
           The computed F statistic of the test.

           pvalue : float
           The associated p-value from the F distribution.

 As per this if the pvalue is less than the alpha value (0.05) then the Null Hypothesis is rejected and if the pvalue is higher than the alpha value then the Null Hypothesis is accepted.

In [ ]:

# Importing Libraries

In [2]:

import numpy as np
import pandas as pd
import scipy.stats as stats

In [10]:

# Creating a dataset

In [12]:

cities = ["punjab","delhi","hyderabad","bangalore","mumbai"]

In [15]:

people_of_spec_city = np.random.choice(a= cities, p = [0.05, 0.15 ,0.25, 0.05, 0.5], size=1000)
# np.random.choice, returns some random values from the given value a, with the probabilities mentioned in p of size given

people_of_spec_city

Out[15]:

array(['mumbai', 'hyderabad', 'mumbai', 'hyderabad', 'mumbai', 'mumbai',
       'delhi', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'hyderabad',
       'bangalore', 'hyderabad', 'delhi', 'hyderabad', 'mumbai', 'mumbai',
       'mumbai', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'mumbai',
       'punjab', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'delhi',
       'punjab', 'hyderabad', 'delhi', 'bangalore', 'hyderabad', 'mumbai',
       'hyderabad', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai',
       'mumbai', 'hyderabad', 'mumbai', 'bangalore', 'hyderabad',
       'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'mumbai',
       'mumbai', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'delhi', 'delhi',
       'hyderabad', 'delhi', 'mumbai', 'mumbai', 'mumbai', 'delhi',
       'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'mumbai',
       'hyderabad', 'hyderabad', 'mumbai', 'bangalore', 'mumbai',
       'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'mumbai',
       'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'hyderabad',
       'punjab', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'delhi',
       'mumbai', 'mumbai', 'bangalore', 'mumbai', 'mumbai', 'mumbai',
       'mumbai', 'hyderabad', 'mumbai', 'bangalore', 'mumbai',
       'bangalore', 'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'delhi',
       'mumbai', 'mumbai', 'mumbai', 'delhi', 'mumbai', 'hyderabad',
       'hyderabad', 'delhi', 'delhi', 'mumbai', 'delhi', 'delhi',
       'mumbai', 'mumbai', 'mumbai', 'delhi', 'hyderabad', 'punjab',
       'delhi', 'mumbai', 'delhi', 'mumbai', 'mumbai', 'bangalore',
       'mumbai', 'hyderabad', 'mumbai', 'hyderabad', 'hyderabad',
       'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'punjab',
       'punjab', 'bangalore', 'bangalore', 'mumbai', 'hyderabad',
       'hyderabad', 'mumbai', 'hyderabad', 'mumbai', 'delhi', 'mumbai',
       'mumbai', 'hyderabad', 'delhi', 'mumbai', 'hyderabad', 'punjab',
       'delhi', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'hyderabad',
       'mumbai', 'mumbai', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai',
       'punjab', 'hyderabad', 'mumbai', 'hyderabad', 'hyderabad',
       'bangalore', 'punjab', 'bangalore', 'hyderabad', 'bangalore',
       'mumbai', 'delhi', 'bangalore', 'mumbai', 'delhi', 'mumbai',
       'mumbai', 'hyderabad', 'delhi', 'hyderabad', 'hyderabad', 'delhi',
       'delhi', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'bangalore',
       'mumbai', 'mumbai', 'delhi', 'mumbai', 'mumbai', 'mumbai',
       'bangalore', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai',
       'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'hyderabad',
       'delhi', 'mumbai', 'bangalore', 'hyderabad', 'mumbai', 'delhi',
       'hyderabad', 'bangalore', 'mumbai', 'delhi', 'delhi', 'delhi',
       'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'delhi',
       'hyderabad', 'mumbai', 'delhi', 'mumbai', 'mumbai', 'punjab',
       'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'delhi',
       'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'delhi', 'hyderabad',
       'mumbai', 'mumbai', 'hyderabad', 'hyderabad', 'mumbai', 'punjab',
       'mumbai', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'hyderabad',
       'delhi', 'hyderabad', 'mumbai', 'delhi', 'hyderabad', 'hyderabad',
       'delhi', 'mumbai', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai',
       'mumbai', 'mumbai', 'hyderabad', 'hyderabad', 'hyderabad',
       'mumbai', 'mumbai', 'delhi', 'mumbai', 'mumbai', 'delhi', 'mumbai',
       'punjab', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'delhi',
       'hyderabad', 'hyderabad', 'mumbai', 'hyderabad', 'bangalore',
       'delhi', 'mumbai', 'hyderabad', 'mumbai', 'hyderabad', 'mumbai',
       'mumbai', 'delhi', 'hyderabad', 'hyderabad', 'hyderabad',
       'hyderabad', 'hyderabad', 'hyderabad', 'mumbai', 'delhi', 'mumbai',
       'mumbai', 'mumbai', 'hyderabad', 'delhi', 'mumbai', 'mumbai',
       'delhi', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'hyderabad',
       'mumbai', 'mumbai', 'hyderabad', 'hyderabad', 'hyderabad',
       'hyderabad', 'bangalore', 'mumbai', 'hyderabad', 'mumbai',
       'hyderabad', 'punjab', 'bangalore', 'mumbai', 'punjab',
       'hyderabad', 'mumbai', 'delhi', 'punjab', 'hyderabad', 'hyderabad',
       'mumbai', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'mumbai',
       'mumbai', 'punjab', 'delhi', 'mumbai', 'mumbai', 'hyderabad',
       'mumbai', 'mumbai', 'mumbai', 'delhi', 'mumbai', 'hyderabad',
       'hyderabad', 'mumbai', 'hyderabad', 'mumbai', 'bangalore',
       'mumbai', 'mumbai', 'delhi', 'hyderabad', 'mumbai', 'mumbai',
       'hyderabad', 'delhi', 'mumbai', 'hyderabad', 'punjab', 'bangalore',
       'mumbai', 'mumbai', 'hyderabad', 'delhi', 'hyderabad', 'punjab',
       'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'hyderabad',
       'hyderabad', 'hyderabad', 'mumbai', 'delhi', 'delhi', 'mumbai',
       'mumbai', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'delhi', 'delhi',
       'delhi', 'mumbai', 'mumbai', 'delhi', 'mumbai', 'bangalore',
       'mumbai', 'hyderabad', 'hyderabad', 'bangalore', 'delhi', 'mumbai',
       'mumbai', 'delhi', 'hyderabad', 'bangalore', 'mumbai', 'mumbai',
       'delhi', 'punjab', 'hyderabad', 'mumbai', 'mumbai', 'mumbai',
       'mumbai', 'bangalore', 'delhi', 'hyderabad', 'delhi', 'hyderabad',
       'mumbai', 'mumbai', 'mumbai', 'delhi', 'delhi', 'mumbai', 'mumbai',
       'delhi', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai', 'hyderabad',
       'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'mumbai',
       'mumbai', 'delhi', 'punjab', 'punjab', 'mumbai', 'hyderabad',
       'hyderabad', 'hyderabad', 'hyderabad', 'mumbai', 'bangalore',
       'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'hyderabad',
       'mumbai', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai',
       'hyderabad', 'delhi', 'bangalore', 'hyderabad', 'bangalore',
       'mumbai', 'mumbai', 'mumbai', 'delhi', 'delhi', 'hyderabad',
       'delhi', 'delhi', 'hyderabad', 'mumbai', 'hyderabad', 'hyderabad',
       'mumbai', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'mumbai',
       'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'mumbai',
       'mumbai', 'punjab', 'delhi', 'mumbai', 'mumbai', 'delhi',
       'hyderabad', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai',
       'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'punjab',
       'delhi', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai', 'hyderabad',
       'mumbai', 'delhi', 'mumbai', 'delhi', 'hyderabad', 'delhi',
       'delhi', 'mumbai', 'delhi', 'mumbai', 'mumbai', 'mumbai', 'mumbai',
       'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'delhi', 'delhi',
       'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'hyderabad',
       'mumbai', 'delhi', 'bangalore', 'mumbai', 'mumbai', 'bangalore',
       'mumbai', 'mumbai', 'mumbai', 'bangalore', 'delhi', 'mumbai',
       'bangalore', 'bangalore', 'hyderabad', 'mumbai', 'mumbai',
       'mumbai', 'mumbai', 'mumbai', 'delhi', 'mumbai', 'mumbai', 'delhi',
       'hyderabad', 'hyderabad', 'mumbai', 'bangalore', 'mumbai',
       'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'mumbai',
       'delhi', 'mumbai', 'mumbai', 'hyderabad', 'hyderabad', 'hyderabad',
       'mumbai', 'mumbai', 'mumbai', 'bangalore', 'mumbai', 'mumbai',
       'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'delhi', 'mumbai',
       'hyderabad', 'hyderabad', 'mumbai', 'mumbai', 'mumbai',
       'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'mumbai',
       'punjab', 'delhi', 'mumbai', 'punjab', 'hyderabad', 'delhi',
       'hyderabad', 'mumbai', 'mumbai', 'delhi', 'punjab', 'mumbai',
       'delhi', 'delhi', 'hyderabad', 'mumbai', 'punjab', 'mumbai',
       'punjab', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai',
       'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'mumbai',
       'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'delhi', 'hyderabad',
       'mumbai', 'mumbai', 'bangalore', 'mumbai', 'mumbai', 'hyderabad',
       'delhi', 'delhi', 'mumbai', 'mumbai', 'mumbai', 'mumbai',
       'hyderabad', 'mumbai', 'hyderabad', 'delhi', 'mumbai', 'mumbai',
       'delhi', 'punjab', 'mumbai', 'hyderabad', 'mumbai', 'hyderabad',
       'mumbai', 'delhi', 'punjab', 'hyderabad', 'mumbai', 'mumbai',
       'mumbai', 'delhi', 'hyderabad', 'mumbai', 'delhi', 'mumbai',
       'delhi', 'hyderabad', 'bangalore', 'mumbai', 'hyderabad',
       'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'mumbai',
       'bangalore', 'hyderabad', 'mumbai', 'hyderabad', 'bangalore',
       'mumbai', 'mumbai', 'mumbai', 'mumbai', 'bangalore', 'delhi',
       'hyderabad', 'mumbai', 'hyderabad', 'hyderabad', 'mumbai',
       'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'mumbai',
       'mumbai', 'punjab', 'hyderabad', 'mumbai', 'hyderabad', 'mumbai',
       'mumbai', 'mumbai', 'hyderabad', 'delhi', 'hyderabad', 'hyderabad',
       'bangalore', 'delhi', 'mumbai', 'mumbai', 'mumbai', 'mumbai',
       'hyderabad', 'hyderabad', 'mumbai', 'delhi', 'mumbai', 'mumbai',
       'bangalore', 'mumbai', 'mumbai', 'delhi', 'mumbai', 'bangalore',
       'mumbai', 'hyderabad', 'delhi', 'delhi', 'hyderabad', 'mumbai',
       'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'mumbai',
       'mumbai', 'mumbai', 'delhi', 'mumbai', 'hyderabad', 'hyderabad',
       'hyderabad', 'mumbai', 'delhi', 'mumbai', 'hyderabad', 'delhi',
       'bangalore', 'hyderabad', 'mumbai', 'hyderabad', 'bangalore',
       'hyderabad', 'delhi', 'delhi', 'mumbai', 'mumbai', 'mumbai',
       'hyderabad', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai', 'delhi',
       'hyderabad', 'hyderabad', 'mumbai', 'mumbai', 'mumbai', 'mumbai',
       'hyderabad', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai',
       'delhi', 'hyderabad', 'hyderabad', 'mumbai', 'mumbai', 'delhi',
       'mumbai', 'hyderabad', 'mumbai', 'hyderabad', 'delhi', 'delhi',
       'mumbai', 'mumbai', 'hyderabad', 'bangalore', 'hyderabad',
       'hyderabad', 'hyderabad', 'delhi', 'mumbai', 'mumbai', 'mumbai',
       'delhi', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'mumbai',
       'mumbai', 'delhi', 'hyderabad', 'delhi', 'punjab', 'mumbai',
       'hyderabad', 'delhi', 'mumbai', 'mumbai', 'mumbai', 'mumbai',
       'mumbai', 'delhi', 'mumbai', 'hyderabad', 'mumbai', 'mumbai',
       'mumbai', 'mumbai', 'mumbai', 'delhi', 'mumbai', 'delhi', 'delhi',
       'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'mumbai', 'mumbai',
       'mumbai', 'mumbai', 'hyderabad', 'hyderabad', 'mumbai', 'delhi',
       'mumbai', 'mumbai', 'punjab', 'hyderabad', 'hyderabad', 'mumbai',
       'punjab', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'mumbai', 'delhi',
       'mumbai', 'mumbai', 'mumbai', 'hyderabad', 'mumbai', 'hyderabad',
       'bangalore', 'hyderabad', 'hyderabad', 'mumbai', 'delhi', 'mumbai',
       'bangalore', 'delhi', 'hyderabad', 'mumbai', 'delhi', 'hyderabad',
       'hyderabad', 'mumbai', 'delhi', 'delhi', 'delhi', 'mumbai',
       'hyderabad', 'hyderabad', 'hyderabad', 'hyderabad', 'punjab',
       'hyderabad', 'mumbai', 'mumbai', 'hyderabad', 'delhi', 'mumbai',
       'mumbai', 'mumbai', 'delhi', 'hyderabad', 'delhi', 'bangalore',
       'delhi', 'punjab', 'mumbai', 'mumbai', 'mumbai', 'delhi', 'mumbai',
       'mumbai', 'mumbai', 'mumbai', 'delhi', 'hyderabad', 'delhi',
       'mumbai', 'mumbai', 'delhi', 'delhi', 'hyderabad', 'punjab',
       'delhi', 'mumbai', 'delhi', 'hyderabad', 'hyderabad', 'mumbai',
       'punjab', 'mumbai', 'hyderabad', 'mumbai', 'punjab', 'mumbai',
       'delhi', 'punjab', 'mumbai', 'mumbai', 'mumbai', 'hyderabad',
       'hyderabad', 'mumbai', 'delhi', 'delhi', 'mumbai', 'hyderabad',
       'mumbai', 'punjab', 'mumbai', 'bangalore', 'bangalore', 'mumbai',
       'hyderabad', 'hyderabad', 'mumbai', 'mumbai', 'delhi', 'mumbai',
       'delhi', 'mumbai', 'punjab', 'hyderabad', 'mumbai', 'mumbai',
       'mumbai', 'mumbai', 'mumbai', 'delhi', 'hyderabad', 'delhi',
       'mumbai'], dtype='<U9')

In [16]:

population_of_spec_city = stats.poisson.rvs(loc=18,  mu=30,  size= 1000)
# stats.poisson.rvs method is used to generate random numbers, where loc is to define mean, mu is used to specify shape 
#paramters and as for the size, it defines the number of values

population_of_spec_city

Out[16]:

array([61, 43, 54, 46, 55, 52, 43, 48, 42, 52, 55, 50, 39, 50, 54, 41, 42,
       51, 48, 58, 50, 49, 43, 48, 44, 49, 48, 49, 44, 47, 48, 58, 51, 39,
       39, 44, 48, 44, 47, 47, 45, 54, 49, 43, 47, 57, 44, 49, 57, 39, 48,
       48, 39, 42, 52, 53, 51, 52, 46, 55, 43, 45, 51, 52, 52, 42, 40, 40,
       40, 52, 48, 59, 48, 52, 56, 48, 56, 49, 43, 57, 52, 42, 58, 45, 52,
       53, 49, 49, 40, 44, 52, 55, 52, 60, 49, 36, 47, 42, 46, 49, 51, 44,
       45, 42, 49, 41, 46, 46, 51, 57, 50, 58, 47, 49, 47, 40, 49, 50, 50,
       58, 50, 47, 53, 50, 55, 43, 51, 52, 54, 56, 44, 41, 47, 38, 52, 48,
       52, 43, 47, 60, 41, 59, 51, 41, 50, 50, 42, 42, 48, 36, 43, 48, 44,
       51, 43, 46, 45, 49, 44, 55, 39, 51, 65, 47, 54, 48, 42, 45, 56, 49,
       44, 41, 40, 41, 51, 38, 57, 49, 40, 50, 39, 50, 45, 55, 49, 47, 49,
       48, 46, 46, 47, 52, 54, 50, 42, 60, 55, 50, 52, 41, 50, 52, 41, 44,
       51, 45, 41, 46, 57, 49, 41, 51, 41, 40, 51, 46, 41, 47, 46, 49, 52,
       44, 45, 48, 58, 52, 55, 39, 45, 53, 36, 43, 50, 48, 49, 43, 54, 46,
       46, 62, 40, 47, 51, 49, 41, 58, 50, 62, 43, 48, 40, 50, 55, 48, 51,
       50, 36, 44, 46, 39, 54, 48, 49, 48, 45, 49, 41, 41, 40, 53, 41, 35,
       52, 36, 39, 51, 47, 52, 43, 41, 47, 58, 45, 38, 47, 47, 48, 48, 47,
       43, 57, 48, 43, 46, 48, 47, 42, 52, 52, 55, 50, 54, 52, 40, 50, 52,
       40, 41, 46, 59, 44, 61, 44, 48, 37, 47, 52, 41, 43, 44, 45, 43, 54,
       40, 37, 51, 53, 53, 50, 37, 52, 46, 46, 42, 43, 49, 43, 46, 48, 53,
       50, 53, 57, 43, 48, 57, 47, 53, 49, 47, 44, 53, 44, 55, 53, 47, 41,
       44, 49, 51, 48, 50, 45, 52, 54, 55, 48, 44, 44, 52, 46, 54, 48, 42,
       38, 51, 48, 46, 43, 47, 49, 45, 40, 43, 46, 41, 53, 55, 48, 43, 41,
       46, 60, 56, 58, 54, 46, 48, 44, 47, 41, 45, 45, 46, 41, 41, 52, 49,
       47, 48, 52, 40, 53, 44, 67, 53, 52, 57, 41, 53, 37, 49, 61, 47, 47,
       61, 50, 44, 40, 57, 52, 53, 44, 50, 48, 48, 47, 51, 53, 48, 41, 46,
       46, 45, 49, 46, 48, 53, 39, 46, 55, 58, 47, 49, 49, 43, 44, 49, 45,
       54, 38, 43, 62, 49, 48, 60, 47, 47, 49, 53, 61, 39, 46, 52, 48, 47,
       43, 53, 42, 51, 51, 55, 47, 45, 53, 45, 48, 57, 50, 44, 55, 39, 44,
       49, 53, 45, 55, 47, 49, 42, 54, 43, 58, 46, 45, 45, 49, 53, 43, 48,
       46, 42, 41, 40, 45, 45, 35, 53, 44, 37, 49, 57, 49, 49, 45, 41, 56,
       39, 45, 48, 44, 53, 55, 58, 39, 45, 43, 55, 47, 46, 45, 37, 51, 43,
       45, 54, 45, 51, 50, 48, 47, 42, 58, 57, 44, 63, 54, 46, 48, 45, 47,
       52, 42, 53, 38, 46, 55, 38, 52, 47, 47, 46, 38, 46, 48, 55, 55, 54,
       45, 50, 38, 53, 54, 50, 50, 60, 47, 52, 43, 53, 54, 48, 46, 48, 46,
       57, 53, 47, 42, 55, 34, 56, 50, 49, 45, 55, 56, 45, 50, 49, 58, 43,
       48, 46, 46, 55, 64, 47, 45, 33, 41, 44, 49, 53, 42, 49, 44, 44, 49,
       49, 47, 39, 44, 41, 58, 59, 52, 47, 41, 51, 46, 60, 53, 51, 54, 52,
       37, 50, 50, 46, 51, 47, 44, 39, 54, 54, 46, 52, 49, 47, 54, 52, 53,
       40, 42, 54, 50, 44, 40, 51, 49, 56, 48, 44, 47, 52, 55, 44, 45, 49,
       45, 52, 51, 47, 46, 44, 50, 45, 48, 56, 41, 51, 47, 50, 39, 41, 39,
       46, 47, 46, 49, 41, 56, 48, 53, 50, 55, 45, 46, 41, 52, 48, 43, 53,
       47, 54, 48, 52, 43, 40, 51, 51, 52, 50, 48, 57, 50, 46, 50, 51, 55,
       43, 65, 51, 51, 59, 48, 44, 50, 40, 47, 66, 55, 45, 51, 43, 55, 55,
       61, 46, 52, 49, 46, 51, 51, 38, 45, 53, 46, 49, 52, 51, 59, 58, 49,
       52, 49, 43, 37, 44, 46, 48, 36, 56, 45, 48, 43, 47, 49, 62, 44, 49,
       40, 45, 58, 52, 48, 46, 47, 39, 56, 51, 48, 52, 52, 46, 55, 46, 46,
       46, 48, 38, 45, 46, 57, 43, 44, 47, 58, 39, 49, 48, 44, 58, 45, 49,
       52, 50, 44, 50, 53, 43, 55, 53, 54, 49, 42, 52, 46, 48, 49, 60, 42,
       48, 49, 44, 53, 45, 52, 50, 53, 45, 46, 46, 49, 44, 60, 43, 46, 48,
       45, 43, 48, 37, 38, 47, 41, 46, 60, 54, 49, 54, 60, 53, 47, 45, 52,
       44, 60, 52, 50, 53, 47, 55, 40, 47, 40, 46, 39, 55, 52, 43, 45, 48,
       41, 38, 46, 50, 39, 41, 45, 51, 49, 51, 57, 52, 46, 42, 47, 47, 47,
       39, 38, 51, 38, 52, 51, 52, 45, 51, 39, 50, 45, 52, 41, 43, 59, 48,
       49, 47, 43, 50, 51, 50, 58, 50, 42, 38, 50, 55, 38, 42, 48, 41, 52,
       50, 47, 51, 44, 48, 45, 44, 53, 50, 44, 53, 47, 51, 45, 52, 43, 54,
       44, 49, 43, 50, 53, 46, 53, 42, 45, 47, 51, 48, 48, 39, 51, 53, 46,
       50, 56, 47, 43, 48, 56, 51, 52, 48, 45, 54, 46, 52, 58, 54, 41, 55,
       49, 48, 54, 45, 60, 43, 46, 57, 54, 48, 45, 49, 56, 44],
      dtype=int64)

In [19]:

# Forming the Dataframe from the obatined values
population_frame = pd.DataFrame({"city":people_of_spec_city,"population":population_of_spec_city})

# Dividing these values by the categorical variables into groups
groups = population_frame.groupby("city").groups

groups

Out[19]:

{'bangalore': [12, 33, 45, 75, 96, 103, 105, 134, 147, 148, 180, 182, 184, 187, 202, 209, 222, 227, 302, 338, 344, 375, 387, 418, 422, 428, 438, 472, 486, 488, 563, 566, 570, 573, 574, 588, 605, 663, 699, 707, 711, 716, 741, 753, 758, 783, 787, 827, 897, 903, 931, 978, 979], 'delhi': [6, 14, 29, 32, 57, 58, 59, 61, 65, 93, 110, 114, 118, 119, 121, 122, 126, 129, 131, 155, 159, 163, 186, 189, 193, 196, 197, 205, 220, 225, 229, 230, 231, 237, 240, 249, 254, 268, 271, 274, 287, 290, 297, 303, 310, 318, 323, 326, 349, 361, 368, 378, 383, 391, 403, 404, 410, 411, 412, 413, 416, 423, 426, 431, 439, 441, 446, 447, 450, 463, 485, 492, 493, 495, 496, 515, 518, 530, 537, 539, 541, 542, 544, 553, 554, 562, 571, 581, 584, 596, 612, 623, 626, 630, 634, 637, 638, 659, 667, 668, ...], 'hyderabad': [1, 3, 7, 11, 13, 15, 25, 31, 34, 36, 38, 43, 46, 47, 50, 60, 67, 72, 73, 77, 83, 86, 87, 90, 101, 106, 109, 116, 117, 127, 136, 138, 139, 142, 150, 151, 153, 158, 161, 166, 168, 171, 172, 176, 178, 179, 183, 192, 194, 195, 200, 210, 211, 217, 219, 223, 226, 233, 238, 244, 247, 252, 255, 258, 259, 267, 269, 272, 273, 276, 277, 282, 283, 284, 294, 298, 299, 301, 305, 307, 311, 312, 313, 314, 315, 316, 322, 329, 331, 334, 335, 336, 337, 340, 342, 347, 351, 352, 357, 364, ...], 'mumbai': [0, 2, 4, 5, 8, 9, 10, 16, 17, 18, 19, 20, 21, 22, 23, 26, 27, 28, 35, 37, 39, 40, 41, 42, 44, 48, 49, 51, 52, 53, 54, 55, 56, 62, 63, 64, 66, 68, 69, 70, 71, 74, 76, 78, 79, 80, 81, 82, 84, 85, 89, 91, 92, 94, 95, 97, 98, 99, 100, 102, 104, 107, 108, 111, 112, 113, 115, 120, 123, 124, 125, 130, 132, 133, 135, 137, 140, 141, 143, 144, 149, 152, 154, 156, 157, 160, 164, 165, 167, 169, 170, 173, 174, 177, 185, 188, 190, 191, 198, 199, ...], 'punjab': [24, 30, 88, 128, 145, 146, 162, 175, 181, 243, 261, 292, 343, 346, 350, 360, 386, 393, 432, 464, 465, 514, 529, 625, 628, 635, 641, 643, 680, 687, 730, 845, 880, 884, 919, 933, 950, 957, 961, 964, 976, 989]}

In [20]:

# Etract individual groups into respective variables

punjab = population_of_spec_city [groups["punjab"]]
bangalore = population_of_spec_city[groups["bangalore"]]
delhi = population_of_spec_city[groups["delhi"]]
hyderabad = population_of_spec_city[groups["hyderabad"]]
mumbai = population_of_spec_city[groups["mumbai"]]

In [21]:

# Now calculate the one-way anova test for the obtained individual groups

stats.f_oneway(asian, black, hispanic, other, white)

Out[21]:

F_onewayResult(statistic=0.9110431706569894, pvalue=0.45674036540270235)

From the above obtained result, we can decide on to the output that the pvalue which is 0.4 is greater than alpha value (0.05). Hence we can say that there is no significant difference among the variances of different groups and are almost equal. And the Null Hypothesis is accepted

3. Correlation coefficients¶

The correlation coefficient is a statistical measure that shows the degree to which, changes to a value of one variable predict change to the value of another. The letter r is used to represent the correlation coefficient and the r is a unit-free value between -1 and 1.

Correlation coefficient measures the relatability among the data, the strength of that relationship is obtained by using these correlation coefficient formulas. The values obtained from these correlation coefficients are -1, 0 or 1 where, -1 represents a relationship which is negative and weak 1 represents a relationsip which is positive and strong 0 (zero) represents no relationship

Let’s suppose, there are two variables x and y, for which the correlation coefficient need to be found

If the value of y goes up, when the value of x goes up, which means x is directly proportional to y then the correlation coefficient between x and y results out to be between 1 or positive values
If the value of y goes down, whenever the value of x goes up or vice versa, which means x is inversely proportional to x then the correlation coefficient between x and y results to be -1 or negative values
Though the value of y goes down, if there is no change in the other variable in our case it is x, then the correlation coefficient between these 2 variable results to be 0 (zero)

For Example,

Positive Correlation : If the quantity of milk increases, the price also increases Negative Correlation : If the price of a stock goes down, then the buying of that stock increases Zero Correlation : There is no relationship between score in video games and grades of an examination

Before we dig into how the correlation among 2 or more coefficients is calculated. It is necessary to understand a term called covariance

So, what is covariance? Covariance is a term that is used to describe the linear relationship between 2 variables. If the covariance is predicted to be positive, then the variables have a linear relationship i.e., both the variables can change in the same direction. If the covariance is predicted to be negative, then the variables don’t have that linear relationship i.e., both varibles tend to go in different directions

There are many types of correlation coefficients. But here we will discuss some of the important correlation coefficients that are widely being used

 1. Pearson’s r
         Pearson's r also known as Pearson's product momemnt correlation coefficient, is used for describing the strength of the linear relationship amomg 2 variables
         This correlation coefficient is used when the data follows normal distribution, with no outliers, with no skewed data and when you expect linear relationship among the 2 variables. 
         The Pearson'r formula is as follows

    The strength of the correlation is considered as
       - weak positive correlation as 0<r<0.3, weak negative correlation as -0.3<r<0
       - strong positive correlation as 0.5<r<1, strong negative correlation as -1<r<-0.5
       - no correlation as r=0

Spearman’s rho

Spearman’s rho, also known as Spearman’s rank correlation coefficient, is used as an alternative to the Pearson’s correlation coefficient. This correlation coefficient is a rank correlation coefficient as it uses the rankings to determine the strength of each varibal(say lowest to highest). Unlike Pearson’s r, Spearman’s rho is used to calculate monotonic relationship, which is Non-linear. Spearman’s rho formula is as follows

spearman's.png

Kendall’s tau

Kendall’s tau is used for the calculation of correlation coefficient when there are 2 variables, which may be continuous variables with outliers or ordinals, but exhibiting monotonic relationship. Spearman’s rho and Kendall’s tau are almost similar but it is better to use Kendall’s tau for better results.

These 3 correlation coefficients can be calculated in Machine Learning by using a function in pandas which is

    DataFrame.corr(method='pearson', min_periods=1)

   Paramaters:
      method {'pearson','spearman','kendall')

Calculating correlation coefficient using pandas dataframe¶

In [2]:

import pandas as pd
import numpy as np

df = pd.read_csv('C:/Users/leena.ganta/Desktop/DataVedas/happyscore_income.csv',index_col=0)
df.head()

Out[2]:

	adjusted_satisfaction	avg_satisfaction	std_satisfaction	avg_income	median_income	income_inequality	region	happyScore	GDP	country.1
country
Armenia	37	4.9	2.42	2096.76	1731.506667	31.445556	‘Central and Eastern Europe’	4.350	0.76821	Armenia
Angola	26	4.3	3.19	1448.88	1044.240000	42.720000	‘Sub-Saharan Africa’	4.033	0.75778	Angola
Argentina	60	7.1	1.91	7101.12	5109.400000	45.475556	‘Latin America and Caribbean’	6.574	1.05351	Argentina
Austria	59	7.2	2.11	19457.04	16879.620000	30.296250	‘Western Europe’	7.200	1.33723	Austria
Australia	65	7.6	1.80	19917.00	15846.060000	35.285000	‘Australia and New Zealand’	7.284	1.33358	Australia

In [55]:

df.corr(method='pearson') # or df.corr() gives the same result as method pearson is defaulted

Out[55]:

	adjusted_satisfaction	avg_satisfaction	std_satisfaction	avg_income	median_income	income_inequality	happyScore	GDP
adjusted_satisfaction	1.000000	0.978067	-0.527553	0.728006	0.704383	-0.123835	0.901213	0.755578
avg_satisfaction	0.978067	1.000000	-0.341201	0.689043	0.661883	-0.082471	0.885988	0.776679
std_satisfaction	-0.527553	-0.341201	1.000000	-0.478206	-0.481429	0.221831	-0.457896	-0.242038
avg_income	0.728006	0.689043	-0.478206	1.000000	0.995605	-0.382587	0.782122	0.814024
median_income	0.704383	0.661883	-0.481429	0.995605	1.000000	-0.449053	0.760328	0.797905
income_inequality	-0.123835	-0.082471	0.221831	-0.382587	-0.449053	1.000000	-0.187222	-0.303204
happyScore	0.901213	0.885988	-0.457896	0.782122	0.760328	-0.187222	1.000000	0.790061
GDP	0.755578	0.776679	-0.242038	0.814024	0.797905	-0.303204	0.790061	1.000000

In [61]:

df.corr(method='spearman') # correlaion coefficient using spearman method

Out[61]:

	adjusted_satisfaction	avg_satisfaction	std_satisfaction	avg_income	median_income	income_inequality	happyScore	GDP
adjusted_satisfaction	1.000000	0.981629	-0.497192	0.803010	0.779671	-0.168049	0.900697	0.766098
avg_satisfaction	0.981629	1.000000	-0.354810	0.808310	0.782479	-0.137139	0.893395	0.773521
std_satisfaction	-0.497192	-0.354810	1.000000	-0.317653	-0.309697	0.182610	-0.421175	-0.275832
avg_income	0.803010	0.808310	-0.317653	1.000000	0.990839	-0.356069	0.819542	0.960969
median_income	0.779671	0.782479	-0.309697	0.990839	1.000000	-0.448926	0.806704	0.961583
income_inequality	-0.168049	-0.137139	0.182610	-0.356069	-0.448926	1.000000	-0.242107	-0.409767
happyScore	0.900697	0.893395	-0.421175	0.819542	0.806704	-0.242107	1.000000	0.793673
GDP	0.766098	0.773521	-0.275832	0.960969	0.961583	-0.409767	0.793673	1.000000

In [58]:

df.corr(method='kendall') # correlation coefficient using kendall method

Out[58]:

	adjusted_satisfaction	avg_satisfaction	std_satisfaction	avg_income	median_income	income_inequality	happyScore	GDP
adjusted_satisfaction	1.000000	0.905145	-0.378239	0.614896	0.593379	-0.124810	0.741682	0.581131
avg_satisfaction	0.905145	1.000000	-0.266347	0.618270	0.593810	-0.104128	0.732966	0.591166
std_satisfaction	-0.378239	-0.266347	1.000000	-0.237797	-0.233515	0.124672	-0.320795	-0.205190
avg_income	0.614896	0.618270	-0.237797	1.000000	0.929566	-0.229994	0.622277	0.841441
median_income	0.593379	0.593810	-0.233515	0.929566	1.000000	-0.299779	0.614087	0.847011
income_inequality	-0.124810	-0.104128	0.124672	-0.229994	-0.299779	1.000000	-0.166762	-0.264067
happyScore	0.741682	0.732966	-0.320795	0.622277	0.614087	-0.166762	1.000000	0.601310
GDP	0.581131	0.591166	-0.205190	0.841441	0.847011	-0.264067	0.601310	1.000000

4. Chi-Square Test¶

Chi-Square Test is a non-parametric test, which tests the significance of the difference between observed frequencies and theoretical frequencies of distribution without any assumption about the distribution of the population. In simple words, chi-square test is used to determine the difference between the expected data and the observed data

Chi-square test is also used to determine whether the built regression model is good fit or not by assessing train and test datasets. This test is used on categorical variables

Two Chi-Sqaure tests that are mostly used are

Independence : As the name suggests, it describes the dependence of two variable sets
Good Fit of data: This chi-square depicts whether the taken sample of a data is considered to be the representive sample that contributes to the good fit for the expected outcome from the taken population of data.

The formula for the chi-square test is as follows

In Chi-Square Test we assume two hypothesis,
```
   Null Hypothesis (H0) : The 2 variables are independent
   Alternate Hypothesis (HA) : The 2 variables are not independent
```
We determine by performing chi-square test and come out to one reasonable hypothesis as the solution. This can be done by comparing the statistic value that is obtained after the test with the alpha value i.e., 0.05 using chi-square table with the help of pvalue.

If the pvalue is less than alpha we accept the alternate hypothesis (HA). If the pvalue is greater than alpha then we accept the Null hypothesis (H0).

Expected value is calculated as Ei= (Row Total * Column Total)/Total no of observations

 In Machine learning, to perform chi-squared test we use a method named chisquare which is imported from scipy.stats.

       scipy.stats.chisquare(f_obs, f_exp=None, ddof=0, axis=0)
       where,
           f_obs - an array, with observed frequencies in each category
           f_exp - an array, optional; with expected frequencies in each category. If no array is given, it assumes the categories are almost likely
           ddof - int, optional; stands for delta degrees of freedom, which is used for adjustment to the degrees of freedom for obtaining p-value. The p-value is determined using a chi-squared distribution with k-1 delte degrees of freedom (ddof), where k is said to be the umber of observed frequencies. By defualt the ddof value is 0
           axis - int or None, optional; the axis along which to be applied test. If axis is given as none, then the f_obs values are treated as a single data set. Defaulted value is 0.
       Returns,
           chisq - float or ndarray, the value is a float if axis is None or f_obs or f_exp are 1-dimensional array
           p-value - float or ndarry, the values is a float if ddof and the return value chisq are scalars.

Importing Library¶

In [37]:

from scipy.stats import chisquare

With f_obs values¶

In [92]:

f_obs=[16, 12, 16, 18, 14, 12]
chisq=chisquare(f_obs).statistic
pvalue=chisquare(f_obs).pvalue
print("chisquare statistic :",chisq)
print("p_value :",pvalue)

chisquare statistic : 2.0
p_value : 0.8491450360846096

In [ ]:

# Since from the obtained result with pvalue (0.8) > alpha (0.05), we accept the Null hypothesis

With f_exp and f_obs values¶

In [100]:

f_obs=[16, 12, 16, 18, 14, 12]
f_exp=[16, 8, 16, 16, 16, 16]
chisquare(f_obs,f_exp)

Out[100]:

Power_divergenceResult(statistic=3.5, pvalue=0.6233876277495822)

With f_obs as 2d¶

In [101]:

f_obs=[[16, 12, 16, 18, 14, 12],[24, 12, 32, 16, 32, 12]] # The test is automatically applied to each column
chisquare(f_obs)

Out[101]:

Power_divergenceResult(statistic=array([1.6       , 0.        , 5.33333333, 0.11764706, 7.04347826,
       0.        ]), pvalue=array([0.20590321, 1.        , 0.02092134, 0.73160059, 0.00795544,
       1.        ]))

In [ ]:

# Here most of the pvalues obtained are less than the alpha value, we accept the Alternate hypothesis(HA)

With axis as None¶

In [102]:

f_obs=[[16, 12, 16, 18, 14, 12],[24, 12, 32, 16, 32, 12]] # The test is applied to the whole data
chisquare(f_obs, axis=None)

Out[102]:

Power_divergenceResult(statistic=33.33333333333333, pvalue=0.0004645423926184954)

With axis as 1¶

In [107]:

f_obs=[16, 12, 16, 18, 14, 12]
f_exp=[[16, 12, 16, 18, 14, 12],[16, 8, 16, 16, 16, 16]]
chisquare(f_obs,f_exp,axis=1)

Out[107]:

Power_divergenceResult(statistic=array([0. , 3.5]), pvalue=array([1.        , 0.62338763]))

With ddof specified¶

In [103]:

f_obs=[16, 12, 16, 18, 14, 12]
chisquare(f_obs, ddof=1)

Out[103]:

Power_divergenceResult(statistic=2.0, pvalue=0.7357588823428847)

In [104]:

chisquare(f_obs,ddof=[0,1])

Out[104]:

Power_divergenceResult(statistic=2.0, pvalue=array([0.84914504, 0.73575888]))

In [106]:

chisquare(f_obs,ddof=[0,1,2])

Out[106]:

Power_divergenceResult(statistic=2.0, pvalue=array([0.84914504, 0.73575888, 0.5724067 ]))

The above results with the different degrees of freedom generates pvalues with values higher than the alpha value, so we accept the Null hypothesis (H0)