The various actions performed on data can be classified in Data Exploration and Preparation. One of these actions is merging data such as bi-variate or uni-variate analysis and missing value and outlier treatment.
We can perform various univariate and bi-variate analyses to assist us in exploring data.
Consolidation of the data is a crucial process in which different datasets are combined by appending, merging, etc.
Outlier Treatment is an additional crucial step in identifying outliers and then treating them. There are also extremely sophisticated methods for identifying outliers, which were studied in The Anomaly Detected (Section 3 modeling).
Values that are not present, as stated in the introduction of the section on theory, could be extremely harmful, and a variety of easy and sophisticated procedures are available to deal with the issue.
These methods need to be implemented in the data preparation for modeling. These methods are discussed in the subsequent blogs.
The appending method and merging are the two most common methods for consolidating data. When appending, various datasets are joined vertically. When merging, the data are combined horizontally. In this way, we can also recognize the different types of relationships between two datasets like One-to-1, One-to-Many Many-to-Many. When the relationships are established, the various ways of merging can be considered: the Inner Join and Left Join Full Join, Right Join, etc.
Different descriptive and inferential statistical techniques are employed to study the data and gain a greater understanding of the data. In univariate analysis, every aspect of the dataset is examined individually. Different descriptive statistics are employed to study a collection’s categorical and numerical aspects. Bi-variate analysis, however, deals with two features were combinations of features such as numerical-numerical, categorical-categorical, and numerical-categorical are analyzed and explored.
Outlier treatment is one of the most critical and challenging aspects of data processing because it could significantly affect the outputs generated by algorithms for learning. Methods of outlier treatment include deleting observations having outliers, identifying and replacing outliers using box-plots, quartile ranges, quantiles/percentiles, standard deviation, etc. Other methods to identify outliers involve clustering, as well as the many detection techniques for anomalies discussed in the Modeling section.
Different kinds of missing values can be discussed in this article. After identifying the missing value, various ways of treating them can be used to limit the negative effect that missing values can impact your model’s efficiency. The most popular methods are the treatment of missing values by eliminating observations or through means/median/mode imputation. Other advanced methods use predictive models like Linear/logistic or the most popular KNN.