Miscellaneous means combining or adding different types of values into a set. Something miscellaneous is defined by an odd set of things — things you might not expect to go together. A breakfast bar, a DVD, and a credit card bill are miscellaneous items in your backpack.

What is Data Exploration and Preparation in Data science?

Data Exploration and Preparation is a technique used for exploring data and cleaning the dataset. There are some standards for dataset how data should be before training the dataset into a machine learning algorithm. The three methods that are usually performed to treat the data are Consolidation of datasets, Missing value treatment, and Outlier treatment. We don’t really clean data every time before performing these techniques we should explore the data.

For treating missing values we should first check whether a dataset contains missing values and then impute the missed value with mean, median, or mode, else remove missing values from the dataset the function to check missing values is dataset.isnull()

For treating outliers in the dataset we should first plot a boxplot on the dataset to check the distribution of data and then treat the outliers in the dataset using

We will discuss more methods for cleaning the data in this blog.

1. Consolidation of datasets.¶

Consolidation of datasets is a preprocessing phase of data. where we perform operations like appending new data to the dataset also merging the dataset with other datasets. we perform this operation only when the data requires consolidation.

1.1 Appending¶

Appending is one of the easiest method to use when you want to add new values to the existing dataset. We can append data row-wise or column-wise based on the requirement.

1.1.1 Appending Datasets Row-wise¶

Concatenating two datasets row-wise where both datasets should have same variable names.

In [1]:

import pandas as pd
import numpy as np
import seaborn as sns

In [2]:

# creating a dataset mobile brands with price.

dict = {'brandName':['lenovo', 'apple', 'oneplus'],
        'price':[10000, 50000, 40000]
       }
  
# dataset1
mobiledata1 = pd.DataFrame(dict)
display(mobiledata1)

	brandName	price
0	lenovo	10000
1	apple	50000
2	oneplus	40000

In [3]:

# creating one more dataset similar to mobiles dataset 
dict = {'brandName':['samsung', 'Vivo'],
        'price':[30000, 20000],
   }
  
# dataset2
mobiledata2 = pd.DataFrame(dict)
display(mobiledata2)

	brandName	price
0	samsung	30000
1	Vivo	20000

In [4]:

# Using pandas concat() we can append the two mobile datasets row-wise 
appended_row = pd.concat([mobiledata1, mobiledata2], ignore_index = True)
appended_row.reset_index()
  
display(appended_row)

	brandName	price
0	lenovo	10000
1	apple	50000
2	oneplus	40000
3	samsung	30000
4	Vivo	20000

The above table displays the output after concating the two datasets row-wise.

1.1.2 Appending Datasets Column-wise¶

Concatenating two datasets column-wise.

In [5]:

# Using pandas concat() we can append the two mobile datasets column-wise, here we have to set the axis=1 to perform appending column-wise. 

appended_column = pd.concat([mobiledata1, mobiledata2],axis=1)
appended_column.reset_index()
  
display(appended_column)

	brandName	price	brandName	price
0	lenovo	10000	samsung	30000.0
1	apple	50000	Vivo	20000.0
2	oneplus	40000	NaN	NaN

The above table displays the output after concating the two datasets column-wise.

1.2 Merging¶

Merging is a most used method when there is a need of combining similar records into a single record. This method also comes under joining of tables.

In [6]:

# Creating mobiles dataset with two columns brandName and Color.

dict = {'brandName':['samsung', 'Vivo'],
        'color':['blue','yellow']
   }
  
# dataset
mobiledata3 = pd.DataFrame(dict)
display(mobiledata3)

	brandName	color
0	samsung	blue
1	Vivo	yellow

In [7]:

# The pandas Merge() is used to merge two datasets using Mobiledata2 and Mobiledata3 datasets to merge using "brandName" column
Merge = pd.merge(mobiledata2,mobiledata3,on='brandName',sort=False)
Merge

Out[7]:

	brandName	price	color
0	samsung	30000	blue
1	Vivo	20000	yellow

we are combining the two datasets using “brandName” column to merge based on that column

1.3.1 Inner join¶

Inner join works in such a way that we will take two tables, it will test the data then match the values in both tables, and give the output that both tables have in common. Inner join is used as an operation in any FROM clause.

For example, with the departments and employees table to select all the employees in each department.

In [8]:

# Creating a Customer data and food items ordered.
dict = {'customerid':[100, 101,102,103,104,105],
        'fooditem':['biryani','burger','momos','Dosa','Paratha','Lassi']
   }
  
# dataset
CustFood = pd.DataFrame(dict)
display(CustFood)

	customerid	fooditem
0	100	biryani
1	101	burger
2	102	momos
3	103	Dosa
4	104	Paratha
5	105	Lassi

In [9]:

# Creating a Customer data and no.of food items ordered.

dict = {'customerid':[100, 101,102,109,111],
        'no.of orders':[50,80,25,56,39]
   }
  
# dataset
Custorders = pd.DataFrame(dict)
display(Custorders)

	customerid	no.of orders
0	100	50
1	101	80
2	102	25
3	109	56
4	111	39

In [10]:

# Using pandas merge() here we are extracting the values that satisfies inner join functionality
Innerjoin = pd.merge(CustFood,Custorders,how='inner',on='customerid')
Innerjoin

Out[10]:

	customerid	fooditem	no.of orders
0	100	biryani	50
1	101	burger	80
2	102	momos	25

1.3.2 Outer Join¶

Outer joins result in the union of two tables which is used. The result of the outer join will contain the entirety of both two tables.

For example, A table with the name and IDs of students and another table with names and IDs of books which they are reading. Using the outer join, we can combine the name and IDs of the students with the names and IDs of the books they are reading.

In [11]:

# Using pandas merge() here we are extracting the values that satisfies outer join functionality


outerjoin = pd.merge(CustFood,Custorders,how='outer',on='customerid')
outerjoin

Out[11]:

	customerid	fooditem	no.of orders
0	100	biryani	50.0
1	101	burger	80.0
2	102	momos	25.0
3	103	Dosa	NaN
4	104	Paratha	NaN
5	105	Lassi	NaN
6	109	NaN	56.0
7	111	NaN	39.0

Right Join¶

Right join considers righter part of the table and joins the merges those values which are primarily present in right table.

In [12]:

# Using pandas merge() here we are extracting the values that satisfies rght join functionality

right = pd.merge(CustFood,Custorders,how='right',on='customerid')
right

Out[12]:

	customerid	fooditem	no.of orders
0	100	biryani	50
1	101	burger	80
2	102	momos	25
3	109	NaN	56
4	111	NaN	39

Left¶

left join considers left part of the table and joins the merges those values which are primarily present in left table.

In [13]:

# Using pandas merge() here we are extracting the values that satisfies left join functionality

left = pd.merge(CustFood,Custorders,how='left',on='customerid')
left

Out[13]:

	customerid	fooditem	no.of orders
0	100	biryani	50.0
1	101	burger	80.0
2	102	momos	25.0
3	103	Dosa	NaN
4	104	Paratha	NaN
5	105	Lassi	NaN

2. Missing value treatment¶

In Data preprocessing missing value treatment is one of the most important step, we do this only when the data has missing values init. So, how to treat missing values either we impute the missing values with the mean,median or mode of that variable or else we will remove the missing values from the data.

In [14]:

# Creating mobile dataset.

mobiledata = pd.read_csv("Mobileorders.csv")

In [15]:

mobiledata.head()

Out[15]:

	Customerid	Mobile	no.of orders
0	101	Apple	15.0
1	102	Oneplus	20.0
2	103	Samsung	15.0
3	104	Vivo	NaN
4	105	oppo	19.0

2.1 Checking Missing Values¶

Checking for Missing values is one of the most mandatory step while pre-procesing the data. To check this we will use isnull() function.

In [16]:

# Using isnull().sum() for finding the missing values. amd total no.of missing values present.

mobiledata.isnull().sum()

Out[16]:

Customerid      0
Mobile          0
no.of orders    2
dtype: int64

The above output displays no.of missing values present in the dataset. Only one column i.e “no.of orders” has two missing values.

2.2 Methods of Treating Missing Values¶

As we mentioned above treating of Outliers can be either of these two.

Removing missing values
Imputing missing values with Mean / Median / Mode of that variable.

2.2.1 Removing the missing values¶

It’s our decision to drop or remove the data, we drop the data only if the dataset is very large and when we have very less count of missing values. so, that will not effect our output.

Using dropna() from pandas dropping the missing values from the data

In [17]:

# Dropping the null values / missing values

remove_values = mobiledata.dropna()

In [18]:

# Checking if the dataset has missing values
remove_values.isnull().sum( )

Out[18]:

Customerid      0
Mobile          0
no.of orders    0
dtype: int64

2.2.2 Mean, Median,mode imputation¶

This is an other way to treat missing values by imputing the missing values with mean,median,mode values of that variable

Mean¶

It is the average value of the given set of data. It is known as the arithmetic average also it is the most used method in measure of central tendency

In [19]:

# Importing Mobileorders dataset
meanimpute =  pd.read_csv("Mobileorders.csv")

In [20]:

# Checking for missing values

meanimpute.isnull().sum()

Out[20]:

Customerid      0
Mobile          0
no.of orders    2
dtype: int64

“no.of orders” feature has 2 missing values

In [21]:

meanimpute.head(25)

Out[21]:

	Customerid	Mobile	no.of orders
0	101	Apple	15.0
1	102	Oneplus	20.0
2	103	Samsung	15.0
3	104	Vivo	NaN
4	105	oppo	19.0
5	106	MI	59.0
6	107	Lenovo	NaN
7	108	Motorola	23.0
8	109	Apple	34.0
9	110	Oneplus	57.0
10	111	Samsung	20.0
11	112	oppo	12.0
12	113	vivo	12.0
13	114	MI	35.0
14	115	Lenovo	76.0
15	116	Motorola	54.0
16	117	Samsung	87.0
17	118	vivo	76.0
18	119	Oneplus	45.0
19	120	Apple	89.0
20	121	Vivo	96.0
21	122	Oneplus	67.0
22	123	oppo	45.0
23	124	Oneplus	55.0

In [22]:

# Finding the mean of the "no.of orders" feature using mean()
meanimpute['no.of orders'].mean()

Out[22]:

45.95454545454545

In [23]:

# Imputing the missing values with mean of the "no.of orders" column

meanimpute['no.of orders'].fillna(int(meanimpute['no.of orders'].mean()), inplace=True)

In [24]:

meanimpute.head(25)

Out[24]:

	Customerid	Mobile	no.of orders
0	101	Apple	15.0
1	102	Oneplus	20.0
2	103	Samsung	15.0
3	104	Vivo	45.0
4	105	oppo	19.0
5	106	MI	59.0
6	107	Lenovo	45.0
7	108	Motorola	23.0
8	109	Apple	34.0
9	110	Oneplus	57.0
10	111	Samsung	20.0
11	112	oppo	12.0
12	113	vivo	12.0
13	114	MI	35.0
14	115	Lenovo	76.0
15	116	Motorola	54.0
16	117	Samsung	87.0
17	118	vivo	76.0
18	119	Oneplus	45.0
19	120	Apple	89.0
20	121	Vivo	96.0
21	122	Oneplus	67.0
22	123	oppo	45.0
23	124	Oneplus	55.0

In [25]:

# Checking for missing values

meanimpute.isnull().sum()

Out[25]:

Customerid      0
Mobile          0
no.of orders    0
dtype: int64

Median¶

The median is the exact middle number in a set of data. The data should be sorted in ascending or descending order before finding the median value.

In [26]:

# Importing Mobileorders dataset

medianimpute =  pd.read_csv("Mobileorders.csv")

In [27]:

# Finding the median of the "no.of orders" feature using median()

medianimpute['no.of orders'].median()

Out[27]:

45.0

In [28]:

# Imputing the missing values with median of the "no.of orders" column

medianimpute['no.of orders'].fillna(int(medianimpute['no.of orders'].median()), inplace=True)

In [29]:

# Checking for missing values

medianimpute.isnull().sum()

Out[29]:

Customerid      0
Mobile          0
no.of orders    0
dtype: int64

Mode¶

Mode is the most frequently repeated observation in a distribution. If all the numbers in the given data appear a single time, then there is no Mode.

In [30]:

# Importing Mobileproducts.csv dataset

modeimpute =  pd.read_csv("Mobileproducts.csv")

In [31]:

# Checking for missing values

modeimpute.isnull().sum()

Out[31]:

Customerid      0
Mobile          2
no.of orders    0
dtype: int64

There are 2 missing values in the “Mobile” feature

In [32]:

# Finding the mode of the "no.of orders" feature using ode()

modeimpute['Mobile'].mode()

Out[32]:

0    Oneplus
Name: Mobile, dtype: object

In [33]:

# Imputing the missing values with median of the "no.of orders" column

mode_impute=modeimpute['Mobile'].fillna(modeimpute['Mobile'].mode)

In [45]:

# Checking for missing values

mode_impute.isnull().sum()

Out[45]:

Outlier Treatment¶

It is important to check outliers in a dataset before training the data into model. This step comes under data preprocessing.

Outlier is the variation in the data, which means when some variable value differs from others in a particular data. The effects of an outlier on measures of central tendency are in such a way that it will cause a wrong analysis of the dataset. Outlier mostly, effects on mean other than the median and mode.

In [35]:

# Importing outliers dataset

outliers = pd.read_csv('LV_OUTLIERS.csv')

In [36]:

# Using boxplot() from seaborn finding the outliers

outliers.boxplot(column="Score",vert=False)

Out[36]:

<AxesSubplot:>

Quartiles¶

Quartile method used to find out the interquartile range, which is measures variability around the median. It divides the data into lower quartiles, middle quartiles, and upper quartiles.

Finding the 75 percentile and 25 percentile value from the datastet and storing it to the variables

In [37]:

# creating q75 and q25 variables to store the output

q75, q25 = np.percentile(outliers.Score, [75 ,25])

In [38]:

# Output of the 75 percentile value from the dataset

q75

Out[38]:

30.0

In [39]:

# Output of the 25 percentile value from the dataset

q25

Out[39]:

19.0

In [40]:

# Sbtracting the 75 perentile value with the 25 percentile value and storing the output into the iqr vaiable

iqr = q75-q25
iqr

Out[40]:

11.0

In [41]:

# creating a variable called bench and multiplying the iqr value with 1.5

bench = q75 + (iqr*1.5)
bench

Out[41]:

46.5

In [46]:

# Replacing the outliers with the bench value

outliers.loc[outliers.Score > 28, 'Score'] = bench
outliers

Out[46]:

	Studentid	Score
0	101	10.0
1	102	20.0
2	103	19.0
3	104	25.0
4	105	46.5
5	106	46.5
6	107	46.5
7	108	9.0
8	109	46.5

In [43]:

# Plotiing boxplot() after treating the outliers

outliers.boxplot(column="Score",vert=False)

Out[43]:

<AxesSubplot:>

Standard Deviation¶

Standard deviation in descriptive statistics is the degree of dispersion of the dataset related to its mean. It helps to compare the data with the same mean but a different range.

In [44]:

# Using standard deviation values we can cap the values and treat the outliers

bench = outliers['Score'].mean() + 3*(outliers['Score'].std())
bench

Out[44]:

79.36489317379451

In [ ]:

Miscellaneous Methods In Python

DATA EXPLORATION & PREPARATION IN PYTHON (MISCELLANEOUS METHODS)¶

Miscellaneous methods.¶

1. Consolidation of datasets.¶

1.1 Appending¶

1.1.1 Appending Datasets Row-wise¶

1.1.2 Appending Datasets Column-wise¶

1.2 Merging¶

1.3.1 Inner join¶

1.3.2 Outer Join¶

Right Join¶

Left¶

2. Missing value treatment¶

2.1 Checking Missing Values¶

2.2 Methods of Treating Missing Values¶

2.2.1 Removing the missing values¶

2.2.2 Mean, Median,mode imputation¶

Mean¶

Median¶

Mode¶

Outlier Treatment¶

Quartiles¶

Standard Deviation¶

Leave a Reply Cancel reply

USEFUL LINK

POLICIES

CONTACT INFORMATION

DATA EXPLORATION & PREPARATION IN PYTHON (MISCELLANEOUS METHODS)¶

Miscellaneous methods.¶

1. Consolidation of datasets.¶

1.1 Appending¶

1.1.1 Appending Datasets Row-wise¶

1.1.2 Appending Datasets Column-wise¶

1.2 Merging¶

1.3.1 Inner join¶

1.3.2 Outer Join¶

Right Join¶

Left¶

2. Missing value treatment¶

2.1 Checking Missing Values¶

2.2 Methods of Treating Missing Values¶

2.2.1 Removing the missing values¶

2.2.2 Mean, Median,mode imputation¶

Mean¶

Median¶

Mode¶

Outlier Treatment¶

Quartiles¶

Standard Deviation¶

You Might Also Like

MEASURES OF CENTRAL TENDENCY

WRAPPER METHODS

ANOVA REPEATED MEASURES

Leave a Reply Cancel reply

USEFUL LINK

POLICIES

CONTACT INFORMATION