The section in question has basic statistics like Descriptive and Inferential Statistics that have been explored and are the basic elements to understand how data analysis can be carried out.
Let’s go back to the beginning and first understand what data means.
In simplest terms, data is information that is measured, and when it is stored and processed in computers, then this “Computer Data” is used for Data Analysis. At the utmost basic level, it comprises binary digits 0 and 1. However, today, we can find different expressions of this data in the format of text documents, images, videos, and Software.
With the rapid advancement of the computer’s processing and storage capabilities, the volume of data generated is massive, and frequently there’s a need to come up with a solution to the chaos. That’s the point at which Data Analysis kicks in.
Before determining the kinds of analysis that could be conducted with the data, it’s crucial to be aware of the kinds of data available. In general, there are two types of data-
— Qualitative (Categorical)
— Quantitative (Numerical).
Qualitative Data, sometimes referred to by the name Categorical Data, is generally non-numeric. The kind of data described above is composed of words and is not quantifiable. Examples of qualitative data can include gender, location, Color, Shape, etc.
Qualitative Data comprises three types: Binary or Nominal, Ordinal, or Nominal.
Binary Data Binary Data is a type of data that has only two distinct categories. A good example could be the result of the toss of a coin, which could be either heads or tails.
Nominal Data is a type of Qualitative Data with no number of categories (just like Binary Data); however, it contains at least two different categories. Each category is mutually exclusive, and each category is superior to the other. So, it is possible to say that these categories are discrete. Examples of Nominal Categories could be Colors in which, by definition, each color is superior to one over the other. It is necessary to keep in mind; that Nominal Data may indeed be represented by numbers like the number 1 could be assigned to Red, and number 2 is for Blue, etc. however, these numbers are simply labels, and they have no value; therefore, neither is number 1 superior to number 2, and we are unable to comprehend the distinction or “distance” from the numbers.
Ordinal Data is where categories are put in an organized, ordered, logical sequence. The values are not weighted. Examples include the Top 5 poorest countries’ clothing sizes (Small, Medium, Large, etc.). It is not clear the distance that is between the intervals or the values.
Quantitative Data is numerical in its nature. As it is named is the type of data that is quantified. You can further subdivide Quantitative Data into two categories. Quantitative Data into two categories which we have Ratios and Intervals.
The Data are weighty and contain details about their value.
Interval Data is like Ordinal data, with the main distinction being that intervals of values are equally divided. One example is the height of a person in inches. The differences between the two values are easily quantifiable with great accuracy.
Ratio Data contains an absolute zero. The best example is the temperature at which zero Celsius also has significance.
Quantitative Data is also divided into two categories, namely Continuous and Discreet. In the latter case, Continuous Data is the type of data where values are divided into fractions and include all values that fall between their variations, such as height, temperature, and so on.
Discreet Data is when the data aren’t separated by their variations but are measured on an array of fixed numbers such as the number of pupils in the classroom.
To fully comprehend the types of analysis that can be performed on different data types, it is necessary to know what we mean by Statistics, Population, and Sample.
The first thought when you hear the word “population” is typically the number of people in a nation. At the same time, a sample refers to a tiny portion of that population to represent the total population. If you are familiar with this definition of Population and Sample, then you may be too far from the meaning behind the terms used in Statistics.
People refer to everyone who is entire for a specific group. It can be defined as a group of people or individuals who comprise everyone and everything that can be the focus of a statistical study. It is crucial to remember that the population size does not have to be massive, and it could be as small as two people if they are representative of the entire group being studied. For instance, if we determine the length of every 1969 Chevrolet AstroVette, the total population will comprise only three cars since only three have ever been constructed. Additionally, the population will not comprise any other Chevrolet but just this particular kind of Chevrolet. The population is the entirety, and Sample is nothing but one of the subsets of this population. Different methods of analysis can be conducted on this sample data, and the outcomes and inferences drawn from the data of this sample are referred to as Statistics. For instance, if we perform an analysis using computing the mean, the result calculated using the population generates a parameter. In contrast, the mean drawn from an individual sample will be recognized as Statistics.
In most instances, it’s difficult to determine the total number of people who make up the population. Different methods of selecting the right samples are employed, such as –
Simple Random Sampling: The word “random” means impartial, which means that everyone in the population has the same chance of being selected for the sample. This method is frequently employed in the study of customer satisfaction.
Representative (Stratified) Sampling: It’s also random. Still, it is based on the same patterns and proportions found in the actual population to correspond to and represent the greater number of people in characters. One example is creating A Representative Sample from the people of Mumbai in which 100 individuals are randomly selected and ensuring that of the 100 people selected, 55 are males, 45 are female, and those who fall into the categories of male and female are randomly selected. In this way, genders are depicted in the same way as within the population. (As per Mumbai City District 2011 Census Data)
Convenience Sampling: The sampling procedure is carried out with consideration to accessibility and people’s willingness to participate. This type of sample is something we see daily, with representatives of companies handing out pamphlets with forms to fill in for surveys. It is vital to understand that Convenience Sampling does not constitute a faulty or incorrect method of collecting Samples and is acceptable if it can accurately represent the people of interest.
Cluster Sampling: This method of sampling is typically used in the course of marketing or exit polls. There are variations between subgroups, even though they’re similar. This sampling technique is used for an unusual type of analysis. A good example of such a sampling technique is a prediction of Delhi’s election results by dividing Delhi into six zones and further dividing it into three (3) localities. And after that, from each location, we randomly sample from any of the two blocks.
Once we have a good knowledge of the information above and the knowledge we have gained, we can dive into the realm of statistics, where we will discuss specific aspects of the basics of statistics required to conduct any sophisticated data analysis.
THEORY
This section focuses on the equations as well as distributions involved in the calculation of various types of statistics that are used to analyze the data discussed. From calculating an Arithmetic Mean to comparing the means of two sets of data, the way certain calculations are carried out and what conclusions we can make from them, and how they can be used to conduct more advanced types of analysis on data, These questions can be addressed within this chapter. Writing a single-line code and concocting shortcuts is feasible to help us remember what data means given a particular number of numbers. Still, to get a full knowledge of what we're doing and, most importantly, why we're doing something, it's essential to know the theory that underlies it.
APPLICATION
In the digital era, it is essential to utilize the formal information employing machines, as it provides faster results that can be more secure and durable. Once we understand the theories behind the statistics, we can enlist the advantage of computers and utilize their full potential. The fundamental statistics knowledge applies to vast datasets, which require complicated calculations and an enormous amount of time when done manually. In our modern world, it is vital to strike the right equilibrium between information behind the scenes and the understanding of the application. The section in this article focuses on programming languages such as R as well as Python examined to see how the various fundamental statistics covered in the theory section could be applied to huge data sets using simple code.