DATA EXPLORATION & PREPARATION IN PYTHON (MISCELLANEOUS METHODS)¶
Miscellaneous methods.¶
What is Miscellaneous?
Miscellaneous means combining or adding different types of values into a set. Something miscellaneous is defined by an odd set of things — things you might not expect to go together. A breakfast bar, a DVD, and a credit card bill are miscellaneous items in your backpack.
What is Data Exploration and Preparation in Data science?
Data Exploration and Preparation is a technique used for exploring data and cleaning the dataset. There are some standards for dataset how data should be before training the dataset into a machine learning algorithm. The three methods that are usually performed to treat the data are Consolidation of datasets, Missing value treatment, and Outlier treatment. We don’t really clean data every time before performing these techniques we should explore the data.
- For treating missing values we should first check whether a dataset contains missing values and then impute the missed value with mean, median, or mode, else remove missing values from the dataset the function to check missing values is dataset.isnull()
- For treating outliers in the dataset we should first plot a boxplot on the dataset to check the distribution of data and then treat the outliers in the dataset using
We will discuss more methods for cleaning the data in this blog.
1. Consolidation of datasets.¶
Consolidation of datasets is a preprocessing phase of data. where we perform operations like appending new data to the dataset also merging the dataset with other datasets. we perform this operation only when the data requires consolidation.
1.1 Appending¶
Appending is one of the easiest method to use when you want to add new values to the existing dataset. We can append data row-wise or column-wise based on the requirement.
1.1.1 Appending Datasets Row-wise¶
Concatenating two datasets row-wise where both datasets should have same variable names.
import pandas as pd
import numpy as np
import seaborn as sns
# creating a dataset mobile brands with price.
dict = {'brandName':['lenovo', 'apple', 'oneplus'],
'price':[10000, 50000, 40000]
}
# dataset1
mobiledata1 = pd.DataFrame(dict)
display(mobiledata1)
brandName | price | |
---|---|---|
0 | lenovo | 10000 |
1 | apple | 50000 |
2 | oneplus | 40000 |
# creating one more dataset similar to mobiles dataset
dict = {'brandName':['samsung', 'Vivo'],
'price':[30000, 20000],
}
# dataset2
mobiledata2 = pd.DataFrame(dict)
display(mobiledata2)
brandName | price | |
---|---|---|
0 | samsung | 30000 |
1 | Vivo | 20000 |
# Using pandas concat() we can append the two mobile datasets row-wise
appended_row = pd.concat([mobiledata1, mobiledata2], ignore_index = True)
appended_row.reset_index()
display(appended_row)
brandName | price | |
---|---|---|
0 | lenovo | 10000 |
1 | apple | 50000 |
2 | oneplus | 40000 |
3 | samsung | 30000 |
4 | Vivo | 20000 |
The above table displays the output after concating the two datasets row-wise.
1.1.2 Appending Datasets Column-wise¶
Concatenating two datasets column-wise.
# Using pandas concat() we can append the two mobile datasets column-wise, here we have to set the axis=1 to perform appending column-wise.
appended_column = pd.concat([mobiledata1, mobiledata2],axis=1)
appended_column.reset_index()
display(appended_column)
brandName | price | brandName | price | |
---|---|---|---|---|
0 | lenovo | 10000 | samsung | 30000.0 |
1 | apple | 50000 | Vivo | 20000.0 |
2 | oneplus | 40000 | NaN | NaN |
The above table displays the output after concating the two datasets column-wise.
1.2 Merging¶
Merging is a most used method when there is a need of combining similar records into a single record. This method also comes under joining of tables.
# Creating mobiles dataset with two columns brandName and Color.
dict = {'brandName':['samsung', 'Vivo'],
'color':['blue','yellow']
}
# dataset
mobiledata3 = pd.DataFrame(dict)
display(mobiledata3)
brandName | color | |
---|---|---|
0 | samsung | blue |
1 | Vivo | yellow |
# The pandas Merge() is used to merge two datasets using Mobiledata2 and Mobiledata3 datasets to merge using "brandName" column
Merge = pd.merge(mobiledata2,mobiledata3,on='brandName',sort=False)
Merge
brandName | price | color | |
---|---|---|---|
0 | samsung | 30000 | blue |
1 | Vivo | 20000 | yellow |
- we are combining the two datasets using “brandName” column to merge based on that column
1.3.1 Inner join¶
Inner join works in such a way that we will take two tables, it will test the data then match the values in both tables, and give the output that both tables have in common. Inner join is used as an operation in any FROM clause.
For example, with the departments and employees table to select all the employees in each department.
# Creating a Customer data and food items ordered.
dict = {'customerid':[100, 101,102,103,104,105],
'fooditem':['biryani','burger','momos','Dosa','Paratha','Lassi']
}
# dataset
CustFood = pd.DataFrame(dict)
display(CustFood)
customerid | fooditem | |
---|---|---|
0 | 100 | biryani |
1 | 101 | burger |
2 | 102 | momos |
3 | 103 | Dosa |
4 | 104 | Paratha |
5 | 105 | Lassi |
# Creating a Customer data and no.of food items ordered.
dict = {'customerid':[100, 101,102,109,111],
'no.of orders':[50,80,25,56,39]
}
# dataset
Custorders = pd.DataFrame(dict)
display(Custorders)
customerid | no.of orders | |
---|---|---|
0 | 100 | 50 |
1 | 101 | 80 |
2 | 102 | 25 |
3 | 109 | 56 |
4 | 111 | 39 |
# Using pandas merge() here we are extracting the values that satisfies inner join functionality
Innerjoin = pd.merge(CustFood,Custorders,how='inner',on='customerid')
Innerjoin
customerid | fooditem | no.of orders | |
---|---|---|---|
0 | 100 | biryani | 50 |
1 | 101 | burger | 80 |
2 | 102 | momos | 25 |
1.3.2 Outer Join¶
Outer joins result in the union of two tables which is used. The result of the outer join will contain the entirety of both two tables.
For example, A table with the name and IDs of students and another table with names and IDs of books which they are reading. Using the outer join, we can combine the name and IDs of the students with the names and IDs of the books they are reading.
# Using pandas merge() here we are extracting the values that satisfies outer join functionality
outerjoin = pd.merge(CustFood,Custorders,how='outer',on='customerid')
outerjoin
customerid | fooditem | no.of orders | |
---|---|---|---|
0 | 100 | biryani | 50.0 |
1 | 101 | burger | 80.0 |
2 | 102 | momos | 25.0 |
3 | 103 | Dosa | NaN |
4 | 104 | Paratha | NaN |
5 | 105 | Lassi | NaN |
6 | 109 | NaN | 56.0 |
7 | 111 | NaN | 39.0 |
Right Join¶
Right join considers righter part of the table and joins the merges those values which are primarily present in right table.
# Using pandas merge() here we are extracting the values that satisfies rght join functionality
right = pd.merge(CustFood,Custorders,how='right',on='customerid')
right
customerid | fooditem | no.of orders | |
---|---|---|---|
0 | 100 | biryani | 50 |
1 | 101 | burger | 80 |
2 | 102 | momos | 25 |
3 | 109 | NaN | 56 |
4 | 111 | NaN | 39 |
Left¶
left join considers left part of the table and joins the merges those values which are primarily present in left table.
# Using pandas merge() here we are extracting the values that satisfies left join functionality
left = pd.merge(CustFood,Custorders,how='left',on='customerid')
left
customerid | fooditem | no.of orders | |
---|---|---|---|
0 | 100 | biryani | 50.0 |
1 | 101 | burger | 80.0 |
2 | 102 | momos | 25.0 |
3 | 103 | Dosa | NaN |
4 | 104 | Paratha | NaN |
5 | 105 | Lassi | NaN |
2. Missing value treatment¶
In Data preprocessing missing value treatment is one of the most important step, we do this only when the data has missing values init. So, how to treat missing values either we impute the missing values with the mean,median or mode of that variable or else we will remove the missing values from the data.
# Creating mobile dataset.
mobiledata = pd.read_csv("Mobileorders.csv")
mobiledata.head()
Customerid | Mobile | no.of orders | |
---|---|---|---|
0 | 101 | Apple | 15.0 |
1 | 102 | Oneplus | 20.0 |
2 | 103 | Samsung | 15.0 |
3 | 104 | Vivo | NaN |
4 | 105 | oppo | 19.0 |
2.1 Checking Missing Values¶
Checking for Missing values is one of the most mandatory step while pre-procesing the data. To check this we will use isnull() function.
# Using isnull().sum() for finding the missing values. amd total no.of missing values present.
mobiledata.isnull().sum()
Customerid 0 Mobile 0 no.of orders 2 dtype: int64
- The above output displays no.of missing values present in the dataset. Only one column i.e “no.of orders” has two missing values.
2.2 Methods of Treating Missing Values¶
As we mentioned above treating of Outliers can be either of these two.
Removing missing values
Imputing missing values with Mean / Median / Mode of that variable.
2.2.1 Removing the missing values¶
It’s our decision to drop or remove the data, we drop the data only if the dataset is very large and when we have very less count of missing values. so, that will not effect our output.
- Using dropna() from pandas dropping the missing values from the data
# Dropping the null values / missing values
remove_values = mobiledata.dropna()
# Checking if the dataset has missing values
remove_values.isnull().sum( )
Customerid 0 Mobile 0 no.of orders 0 dtype: int64
2.2.2 Mean, Median,mode imputation¶
This is an other way to treat missing values by imputing the missing values with mean,median,mode values of that variable
Mean¶
It is the average value of the given set of data. It is known as the arithmetic average also it is the most used method in measure of central tendency
# Importing Mobileorders dataset
meanimpute = pd.read_csv("Mobileorders.csv")
# Checking for missing values
meanimpute.isnull().sum()
Customerid 0 Mobile 0 no.of orders 2 dtype: int64
- “no.of orders” feature has 2 missing values
meanimpute.head(25)
Customerid | Mobile | no.of orders | |
---|---|---|---|
0 | 101 | Apple | 15.0 |
1 | 102 | Oneplus | 20.0 |
2 | 103 | Samsung | 15.0 |
3 | 104 | Vivo | NaN |
4 | 105 | oppo | 19.0 |
5 | 106 | MI | 59.0 |
6 | 107 | Lenovo | NaN |
7 | 108 | Motorola | 23.0 |
8 | 109 | Apple | 34.0 |
9 | 110 | Oneplus | 57.0 |
10 | 111 | Samsung | 20.0 |
11 | 112 | oppo | 12.0 |
12 | 113 | vivo | 12.0 |
13 | 114 | MI | 35.0 |
14 | 115 | Lenovo | 76.0 |
15 | 116 | Motorola | 54.0 |
16 | 117 | Samsung | 87.0 |
17 | 118 | vivo | 76.0 |
18 | 119 | Oneplus | 45.0 |
19 | 120 | Apple | 89.0 |
20 | 121 | Vivo | 96.0 |
21 | 122 | Oneplus | 67.0 |
22 | 123 | oppo | 45.0 |
23 | 124 | Oneplus | 55.0 |
# Finding the mean of the "no.of orders" feature using mean()
meanimpute['no.of orders'].mean()
45.95454545454545
# Imputing the missing values with mean of the "no.of orders" column
meanimpute['no.of orders'].fillna(int(meanimpute['no.of orders'].mean()), inplace=True)
meanimpute.head(25)
Customerid | Mobile | no.of orders | |
---|---|---|---|
0 | 101 | Apple | 15.0 |
1 | 102 | Oneplus | 20.0 |
2 | 103 | Samsung | 15.0 |
3 | 104 | Vivo | 45.0 |
4 | 105 | oppo | 19.0 |
5 | 106 | MI | 59.0 |
6 | 107 | Lenovo | 45.0 |
7 | 108 | Motorola | 23.0 |
8 | 109 | Apple | 34.0 |
9 | 110 | Oneplus | 57.0 |
10 | 111 | Samsung | 20.0 |
11 | 112 | oppo | 12.0 |
12 | 113 | vivo | 12.0 |
13 | 114 | MI | 35.0 |
14 | 115 | Lenovo | 76.0 |
15 | 116 | Motorola | 54.0 |
16 | 117 | Samsung | 87.0 |
17 | 118 | vivo | 76.0 |
18 | 119 | Oneplus | 45.0 |
19 | 120 | Apple | 89.0 |
20 | 121 | Vivo | 96.0 |
21 | 122 | Oneplus | 67.0 |
22 | 123 | oppo | 45.0 |
23 | 124 | Oneplus | 55.0 |
# Checking for missing values
meanimpute.isnull().sum()
Customerid 0 Mobile 0 no.of orders 0 dtype: int64
Median¶
The median is the exact middle number in a set of data. The data should be sorted in ascending or descending order before finding the median value.
# Importing Mobileorders dataset
medianimpute = pd.read_csv("Mobileorders.csv")
# Finding the median of the "no.of orders" feature using median()
medianimpute['no.of orders'].median()
45.0
# Imputing the missing values with median of the "no.of orders" column
medianimpute['no.of orders'].fillna(int(medianimpute['no.of orders'].median()), inplace=True)
# Checking for missing values
medianimpute.isnull().sum()
Customerid 0 Mobile 0 no.of orders 0 dtype: int64
Mode¶
Mode is the most frequently repeated observation in a distribution. If all the numbers in the given data appear a single time, then there is no Mode.
# Importing Mobileproducts.csv dataset
modeimpute = pd.read_csv("Mobileproducts.csv")
# Checking for missing values
modeimpute.isnull().sum()
Customerid 0 Mobile 2 no.of orders 0 dtype: int64
- There are 2 missing values in the “Mobile” feature
# Finding the mode of the "no.of orders" feature using ode()
modeimpute['Mobile'].mode()
0 Oneplus Name: Mobile, dtype: object
# Imputing the missing values with median of the "no.of orders" column
mode_impute=modeimpute['Mobile'].fillna(modeimpute['Mobile'].mode)
# Checking for missing values
mode_impute.isnull().sum()
0
Outlier Treatment¶
It is important to check outliers in a dataset before training the data into model. This step comes under data preprocessing.
Outlier is the variation in the data, which means when some variable value differs from others in a particular data. The effects of an outlier on measures of central tendency are in such a way that it will cause a wrong analysis of the dataset. Outlier mostly, effects on mean other than the median and mode.
# Importing outliers dataset
outliers = pd.read_csv('LV_OUTLIERS.csv')
# Using boxplot() from seaborn finding the outliers
outliers.boxplot(column="Score",vert=False)
<AxesSubplot:>
Quartiles¶
Quartile method used to find out the interquartile range, which is measures variability around the median. It divides the data into lower quartiles, middle quartiles, and upper quartiles.
- Finding the 75 percentile and 25 percentile value from the datastet and storing it to the variables
# creating q75 and q25 variables to store the output
q75, q25 = np.percentile(outliers.Score, [75 ,25])
# Output of the 75 percentile value from the dataset
q75
30.0
# Output of the 25 percentile value from the dataset
q25
19.0
# Sbtracting the 75 perentile value with the 25 percentile value and storing the output into the iqr vaiable
iqr = q75-q25
iqr
11.0
# creating a variable called bench and multiplying the iqr value with 1.5
bench = q75 + (iqr*1.5)
bench
46.5
# Replacing the outliers with the bench value
outliers.loc[outliers.Score > 28, 'Score'] = bench
outliers
Studentid | Score | |
---|---|---|
0 | 101 | 10.0 |
1 | 102 | 20.0 |
2 | 103 | 19.0 |
3 | 104 | 25.0 |
4 | 105 | 46.5 |
5 | 106 | 46.5 |
6 | 107 | 46.5 |
7 | 108 | 9.0 |
8 | 109 | 46.5 |
# Plotiing boxplot() after treating the outliers
outliers.boxplot(column="Score",vert=False)
<AxesSubplot:>
Standard Deviation¶
Standard deviation in descriptive statistics is the degree of dispersion of the dataset related to its mean. It helps to compare the data with the same mean but a different range.
# Using standard deviation values we can cap the values and treat the outliers
bench = outliers['Score'].mean() + 3*(outliers['Score'].std())
bench
79.36489317379451