Data wrangling is a process of gathering, selecting, and transforming data to support the next phase of analytics and learning. This data preparation takes up most of the time for projects, so data efficiency dramatically improves the life cycle of machine learning and frees up time for the data scientist.
The main goal of effective data wrangling is to produce analytical data sets with structures (e.g. Analytic Base Tables) to perform better modeling; and once done, to feed into a myriad of machine learning strategies.
For the data scientist, data wrangling transforms and maps raw data into various formats based on use cases. In practice, that means looking for inconsistencies, missing information, or data format issues. Better data translates well into efficient machine learning. Some of the key areas to consider include:
After consideration of these questions to start, a deeper workflow is needed beyond just data wrangling; and it can be represented in a Machine Learning Life Cycle.
Disciplined machine learning (ML) is represented by certain steps that determine a coherent and holistic ML life cycle to derive business value, with a correct execution to gain meaningful insights with practical benefits. It is all about the data to start.
At a top level when we deal with data, we generate an insight to take an action in the process of dynamic decision-making. We can see this in the diagram to the right.
If we remember that most ML projects never make it past the experimental phase due to reasons, ranging from poor data quality to inadequate modeling to substandard data preparation, then what is the best way to start? If we follow certain steps, we can establish both good workflow and data wrangling to successfully develop and deploy an effective model.
With a thoughtful approach to machine learning, the focus is on the main activities of data wrangling and evaluation modeling & hyperparameters. Software automation during these stages is important because it augments the human manual processes and increases speed, while enhancing data quality.
To start, the data pipeline aligns the transformation of datasets with rules and governance, which then feeds the data wrangling where data exploration and cleaning takes place. Once the data is “wrangled,” the processed data moves to the feature engineering phase after performing the train-test split depending upon the learning task, if needed. This then feeds data modeling and eventual deployment for predictions.
With the know-how of a data scientist and subject matter expert (SME), the use of an automated software tool makes the ML Life Cycle process more complete. Using RightData’s Dextrus ML Studio platform architecture, an effective, step-by-step approach is shown with the steps in the following pages.
1. Data Collection
Shown to the right is a screenshot of the data collection for uploading a CSV formatted file from local storage.
From raw data, APIs, or cloud-delivered data sources, data collection is a crucial step in the life cycle process. Today, Dextrus provides connectors to access the data and feed data into a governed data pipeline where enterprises can ingest data from any source and location, at speed and scale.
Note that Dextrus offers 150+ connectors for databases, applications, events, flat file data sources, cloud platform technologies, SAP sources, REST APIs, and social media platforms.
2. Data Wrangling
Data wrangling is a key step in the cycle because the quality of the results obtained from the model is highly dependent upon the quality of the data used by the model. Raw data may contain inconsistencies and noise before an ideal form can be reached to perform modeling operations. So, this phase is critical to performance for machine learning models – you need high quality data to trust model outcomes which drive strategy and spending at the business level.
Dextrus provides a range of data wrangling capabilities using an easy-to-use no-code user interface for identifying, cleaning, and fixing issues related to data. The platform makes use of open-source tools such as Apache Spark that allows it to scale with the data size. In addition, because Dextrus data integration is integrated with ML Studio, it is easy to manage the workflow described in this paper.
3. Data Understanding
Shown below are some of the data quality metrics providing initial understanding of the imported data.
In order to perform any wrangling operations, it is first important to understand the user data and important metrics that define the data.
Tabular data in CSV format can be imported into the Dextrus platform with any required import customizations to further perform wrangling operations on it.
Once the data is imported, users can have an overview of the data. There are various important metrics that play a significant role in defining the data for each column. With the help of the Dextrus platform, it becomes easy to view all the essential information so users can understand the important metrics and identify potential issues.
The first step in understanding the data is identifying the categorical and numerical columns which can be known from the data type and distinct values when compared to the total available rows of the data.
The platform provides information regarding the number of valid values for each of the columns, which helps in identifying issues like nulls and the values inconsistent with the type and domain for each of those columns.
Shown below is a screenshot of the statistical information of the data.
It is important to understand the distribution of data in each column for better insights to identify the right engineering and modeling operations in the later steps in the cycle. For getting those insights, the platform provides descriptive statistical information for each column based on their data type.
For example, for numerical columns, users can view the statistical information such as mean, median, and unique values as well as viewing a histogram plot for checking normal distribution.
4. Data Processing
Shown below is a screenshot of different wrangling for custom replacement operations for the values in the columns.
Getting an overview of the data from the previous steps helps to identify issues and then sets an approach for fixing them with different wrangling operations.
For instance, some issues could be missing values or skewness in the distribution due to presence of outliers.
One of the most common parts of this process is the replacement of missing data, with appropriate values or metrics, such as the mean of the distribution for a column.
Shown below is a screenshot of the wide range of data cleaning options.
Apart from fixing the missing data issue, there are other operations such as removing the outliers, encoding the data, and cleaning operations like masking, splitting and others.
All the data different versions of the data cleaning recipes can be accessed at one place which can then be further used in the data modeling steps.
Shown below is a screenshot of the different versions of the processed and cleaner data after the data wrangling operations step.
Dextrus provides the capability of saving different versions of the processed and clean data which can then be independently fed into the modeling pipeline.
5. Data Preparation for Modeling
Shown below is a screenshot of the step where the data is split into train and test for the supervised learning task.
After the data cleaning process, the data needs to be further divided into train and test data so that we can test the performance of the data and there is no data leakage that would lead to any biased information being included in the training step.
6. Feature Engineering
Shown below is a screenshot of the different data transformations that can be performed on the columns, sorted by feature importance.
It is called feature engineering because we are engineering an ideal set of features for optimal extraction of the information.
After splitting the data into training data and testing data, feature engineering can be performed on the training data that helps generate the ideal set of features.
For example, Dextrus provides encoding capabilities like one-hot encoding, ordinal encoding for the categorical columns as well as normalization for numerical columns like standard scalar, min-max scalar, and others.
Shown below is a screenshot of different feature selection options.
Dimensionality reduction is a major step in selecting the right features for modeling.
Dextrus enables this capability by providing different feature selection operations using techniques that make use of metrics such as variance or ANOVA F-value.
7. Modeling and Tuning Hyperparameters
Shown below is a screenshot of the different models that users can select for classification.
Dextrus supports a range of different algorithms for modeling which can be used for classification tasks like predicting binary outputs. It allows the users to select the metrics (e.g. accuracy, precision, recall, F-1 score) for the evaluation of those modeling algorithms.
Shown below is a screenshot of customizing the limits of parameter for the search of optimal hyperparameters.
Hyperparameter tuning is a key step where the parameters of the model are optimized to get the best model results for the selected metrics.
On the Dextrus platform, users can customize their search for the hyperparameter tuning and compare results for each of those selections.
8. Testing and prediction
Shown below is a screenshot of all the relevant information related to the model. Once the modeling has been completed along with hyperparameter tuning, Dextrus provides detailed information for different versions of models on the dashboard.
The data science professional can make about all the relevant metrics and model parameters to select the best model for the data and use it for predictions.
Balance. For the data scientist there needs to be a balanced approach as data is managed in various stages, which can be automated for faster results and greater scalability.
Iterative. More importantly, creating machine learning solutions is an iterative process, where self-service and collaboration dramatically increases learning outcomes.
Integrated. If you want to deploy ML solutions faster, you need a platform that has an agile solution that combines different steps of the cycle and integrates easily with the rest of the existing system pipeline.
A widened role for learning scientists. Finally, moving from experimentation to operational machine learning requires ML modeling and deployment at scale. With an automated approach that is integrated into both the data wrangling phase and machine learning phase, the concept of a “data scientist” now blends into a new role of a “learning scientist,” where both analytical and data skills are combined and augmented with automated software for the data wrangling.
The Dextrus ML Studio meets these challenges today by creating an automated software platform, all the way from raw data to pipeline to wrangling to modeling to machine learning to predictions. With the speed of modern management and processing, ML operational models are emerging quickly. Further, with a simplified user interface for business users and next generation “learning scientists,” the importance of software platforms that can both wrangle and model are making the job easier every day.
Dextrus ML Studio represents the future of machine learning at scale. Learn more about how Dextrus manages the entire process and Life Cycle.
Chirag Rank serves as RightData’s Senior Software Engineer and has deep experience in both the science and art of machine learning. Chirag has two master’s degrees – one in Information Technology and Analytics from Rutgers University and a Master of Data Science from the University of British Columbia. He is also completing his Ph.D. in Information Technology as well. firstname.lastname@example.org
RightData is a trusted total software company that empowers end-to-end capabilities for modern data, analytics, and machine learning using modern data lakehouse and data mesh frameworks. The combination of Dextrus software for data integration and the RDt for data quality and observability provides a comprehensive DataOps approach. With a commitment to a no-code approach and a user-friendly user interface, RightData increases speed to market and provides significant cost savings to its customers. www.getrightdata.com