Whitepaper

Machine Learning with DataFactory ML Studio

September 11, 2023
Machine Learning with DataFactory ML Studio

This paper is the second in a series of machine learning using the Dextrus ML Studio.

Use Case for RightData’s DataFactory ML Studio – Airline Satisfaction

Customer satisfaction is based upon a customer evaluation process by comparing the results of actual experiences versus expectations. Understanding this dynamic helps airlines make predictions leading to decision-making about which services have the highest payoff and how that provides a competitive advantage in the market. The outcomes of machine learning also provide a baseline for evolving needs in the industry as well.

RightData investigated how data and machine learning work together for an airline satisfaction model. This use case allows thinking about both data management and how it feeds into effective machine learning.

This survey data used was sourced from Kaggle, which is collected and consolidated from different U.S. based airlines. To help gain hands-on experience, the data scientists at RightData modelled data to understand how a predictable customer level of satisfaction is related to customer experiences of numerous services. To address a relevant business problem, a classification model was created using DataFactory ML Studio (formerly called “Dextrus” ML Studio”) to further emphasize that satisfaction outcomes are predictable and not random.

Let’s look at the use case in further detail and investigate the steps across the use case.

1)Data Analysis: Exploratory data analysis refers to the critical process of performing initial investigations on data to discover patterns, to spot anomalies, to test hypotheses, and to check assumptions with the help of summary statistics and graphical representations. Understanding the data first and gathering insights is critical for any analysis. This then feeds the process of the next steps.

2)Correlation Matrix: The Correlation graph helps us understand the relation between different variables. In an airline satisfaction example, using the graph we see “easeofonlinebooking” and “inflightwifisercices” have more correlation (shown on x-axis of the graph).

Descriptive statistics can be viewed and are a statistical measure that describe data through certain numbers like mean, median, mode, etc. Descriptive statistics are reported numerically by text and/or in its tables. Numerical Statistics are also used for numeric columns: count, mean, standard deviation, min, max, 25% 50%, 75%, number of nulls, percentage of Nulls, Kurtosis, number of outliers, etc. Finally, Categorical Statistics are used for count, number of nulls, percentage of nulls, count distinct, most frequent value.

3) Select Target: The target variable values are modeled and predicted by other variables. At Dextrus the target validation is performed only with columns which might be the target are enabled for selection.

For the airline satisfaction example, the target column is Airline Passenger Satisfaction with two levels, either:

1. neutral or dissatisfied

2. satisfied

The target selection in ML Studio is depicted for this example.

4) Train-Test Validation Split is used to estimate the performance of machine learning algorithms applicable for prediction-based algorithms. At Dextrus ML Studio, the user has an option to select a validation dataset and the ratio of the dataset to be divided into the train-test validation dataset. The ML algorithms are then trained on training data and tested on test and validation data.

When tested, both on test and validation, the datasets ensure more accurate results. Keeping the same random seed value ensure the same rows are selected when sampling is done.

5) Transform the Dataset and Feature Importance refers to techniques that assign a score/rank to input features, based on how useful they are at predicting a target variable. Considerations include:
Encoding to convert the string values to numeric. We have two methods in encoding:  1) Ordinal encoding; and 2) one hot encoding.

Scaling to normalize the numeric data. We have three distinct types of scaling techniques: 1) robust scaler for outlier; 2) minmax for normalizing the data; is 3) standard scaler.

Most of the machine learning algorithms need input in numerical format. Selecting columns based on feature importance would help improve the model performance. The example of the next page shows this:

6) Class Balance:  When the target column is highly imbalanced, machine learning classifier tends to be more biased towards the majority class, causing bad classification of the minority class, Balance data is an excellent feature which will help the model avoid such problems.

7) Feature Selection is the method of reducing the input variable to your model by using only relevant data and getting rid of noise in data. At ML Studio, user can select different feature selection algorithm based on the data. The percentage of columns will allow the algorithm to select relevant columns to avoid noise.

8) Select Estimators: After transforming the data, different algorithms are applied to train the models using training dataset; and models are evaluated against the test and validation datasets. Selecting an estimator is the critical part of any model building. Using Dextrus, a user is allowed to select one-to-many estimators to evaluate and select the best performing estimator. In addition, each algorithm has its own mathematical way to train and evaluate the model.

9) Score Card uses evaluation metrics to evaluate the performance of the machine leaning algorithms. Considerations include:

  • Accuracy is the fraction of predictions the model got right.
  • Precision of a model describes how many detected items are truly relevant.
  • Recall is calculated by dividing the true positives by anything that should have been predicted as positive.

F1 score is number between 0 and 1 and is the harmonic mean of precision and recall. Confusion Matrix is an N x N matrix used for evaluating the performance of a classification model, where N is the number of target classes.

10) Tuning is the process of maximizing a model's performance without overfitting or creating too high of a variance. This is done by finding a set of optimal hyperparameter values which maximizes the model performance.

11) Model Details. Now the model is ready to be used and all the model details can be found in this page. Predicting or editing the model can be based on user needs. Afterwards, the model is ready to be consumed or to predict the outcomes on new data.

12) Prediction with new data. This is the point where we get answers, and the value of machine learning can be realized. With DataFactory, once the new data is ingested in the model the user can get the predictions by easy clicks. Based on the use case of predictions, this will help solve real business problems.

About the Author

Bindu serves as Senior Data Scientist and is developing RightData's machine learning platform, DataFactory ML Studio. She has been instrumental in the development of data science components and architecture surrounding machine learning strategies. She holds a Master’s degree in Data Science and Business Analytics from Wayne State University. bindu.bharatha@getrightdata.com

About RightData

RightData is a trusted total software company that empowers end-to-end capabilities for modern data, analytics, and machine learning using modern data lakehouse and data mesh frameworks. The combination of Dextrus software for data integration and the DataTrust for data quality and observability provides a comprehensive DataOps approach. With a commitment to a no-code approach and a user-friendly user interface, RightData increases speed to market and provides significant cost savings to its customers. www.getrightdata.com

-->