Data is no less than an asset in today’s world. But—
Can we really use this abundant data in its raw form for training machine learning algorithms?
Well, not exactly.
Data in the real world is quite dirty and corrupted with inconsistencies, noise, incomplete information, and missing values. It is aggregated from diversified sources using data mining and warehousing techniques.
It is a common thumb rule in machine learning that the greater the amount of data we have, the better models we can train.
In this article, we will discuss all Data Preprocessing steps one needs to follow to convert raw data into the processed form.
Here’s what we’ll cover:
Data Preprocessing includes the steps we need to follow to transform or encode data so that it may be easily parsed by the machine.
The main agenda for a model to be accurate and precise in predictions is that the algorithm should be able to easily interpret the data's features.
The majority of the real-world datasets are highly susceptible to missing, inconsistent, and noisy data due to their heterogeneous origin.
Applying data mining algorithms on this noisy data would not give quality results as they would fail to identify patterns effectively. Data Processing is, therefore, important to improve the overall data quality.
Quality decisions must be based on quality data. Data Preprocessing is important to get this quality data, without which it would just be a Garbage In, Garbage Out scenario.
Individual independent variables that operate as an input in our machine learning model are referred to as features. They can be thought of as representations or attributes that describe the data and help the models to predict the classes/labels.
For example, features in a structured dataset like in a CSV format refer to each column representing a measurable piece of data that can be used for analysis: Name, Age, Sex, Fare, and so on.
Now, let's discuss more in-depth four main stages of data preprocessing.
Data Cleaning is particularly done as part of data preprocessing to clean the data by filling missing values, smoothing the noisy data, resolving the inconsistency, and removing outliers.
Here are a few ways to solve this issue:
This method should be considered when the dataset is huge and numerous missing values are present within a tuple.
There are many methods to achieve this, such as filling in the values manually, predicting the missing values using regression method, or numerical methods like attribute mean.
It involves removing a random error or variance in a measured variable. It can be done with the help of the following techniques:
It is the technique that works on sorted data values to smoothen any noise present in it. The data is divided into equal-sized bins, and each bin/bucket is dealt with independently. All data in a segment can be replaced by its mean, median or boundary values.
This data mining technique is generally used for prediction. It helps to smoothen noise by fitting all the data points in a regression function. The linear regression equation is used if there is only one independent attribute; else Polynomial equations are used.
Creation of groups/clusters from data having similar values. The values that don't lie in the cluster can be treated as noisy data and can be removed.
Clustering techniques group together similar data points. The tuples that lie outside the cluster are outliers/inconsistent data.
Data Integration is one of the data preprocessing steps that are used to merge the data present in multiple sources into a single larger data store like a data warehouse.
Data Integration is needed especially when we are aiming to solve a real-world scenario like detecting the presence of nodules from CT Scan images. The only option is to integrate the images from multiple medical nodes to form a larger database.
We might run into some issues while adopting Data Integration as one of the Data Preprocessing steps:
Once data clearing has been done, we need to consolidate the quality data into alternate forms by changing the value, structure, or format of data using the below-mentioned Data Transformation strategies.
The low-level or granular data that we have converted to high-level information by using concept hierarchies. We can transform the primitive data in the address like the city to higher-level information like the country.
It is the most important Data Transformation technique widely used. The numerical attributes are scaled up or down to fit within a specified range. In this approach, we are constraining our data attribute to a particular container to develop a correlation among different data points. Normalization can be done in multiple ways, which are highlighted here:
New properties of data are created from existing attributes to help in the data mining process. For example, date of birth, data attribute can be transformed to another property like is_senior_citizen for each tuple, which will directly influence predicting diseases or chances of survival, etc.
It is a method of storing and presenting data in a summary format. For example sales, data can be aggregated and transformed to show as per month and year format.
The size of the dataset in a data warehouse can be too large to be handled by data analysis and data mining algorithms.
One possible solution is to obtain a reduced representation of the dataset that is much smaller in volume but produces the same quality of analytical results.
Here is a walkthrough of various Data Reduction strategies.
It is a way of data reduction, in which the gathered data is expressed in a summary form.
Dimensionality reduction techniques are used to perform feature extraction. The dimensionality of a dataset refers to the attributes or individual features of the data. This technique aims to reduce the number of redundant features we consider in machine learning algorithms. Dimensionality reduction can be done using techniques like Principal Component Analysis etc.
By using encoding technologies, the size of the data can significantly reduce. But compressing data can be either lossy or non-lossy. If original data can be obtained after reconstruction from compressed data, this is referred to as lossless reduction; otherwise, it is referred to as lossy reduction.
Data discretization is used to divide the attributes of the continuous nature into data with intervals. This is done because continuous features tend to have a smaller chance of correlation with the target variable. Thus, it may be harder to interpret the results. After discretizing a variable, groups corresponding to the target can be interpreted. For example, attribute age can be discretized into bins like below 18, 18-44, 44-60, above 60.
The data can be represented as a model or equation like a regression model. This would save the burden of storing huge datasets instead of a model.
It is very important to be specific in the selection of attributes. Otherwise, it might lead to high dimensional data, which are difficult to train due to underfitting/overfitting problems. Only attributes that add more value towards model training should be considered, and the rest all can be discarded.
Data Quality Assessment includes the statistical approaches one needs to follow to ensure that the data has no issues. Data is to be used for operations, customer management, marketing analysis, and decision making—hence it needs to be of high quality.
The main components of Data Quality Assessment include:
Data Quality Assurance process has involves three main activities.
Here's a short recap of everything we've learnt about data preprocessing:
💡 Read more: