Abstract:One of the most important areas of machine learning is feature engineering, which has been seriously neglected. The most mature tool in this important area is Featuretools, an open source Python library. In this article, we’ll use this library to see how feature engineering automation will change the way you do machine learning.
With the rapid development of technology, in the field of data science, libraries, tools and algorithms will always change. However, there has always been a trend that the level of automation is constantly improving.
In recent years, some progress has been made in the automatic selection of models and hyper-parameter adjustment, but the most important area of machine learning, feature engineering, has been seriously ignored. The most mature tool in this important area is Featuretools, an open source Python library. In this article, we will use this library to see how feature engineering automation will change the way you learn better.
Feature Engineering Automation (FEA) is a relatively new technology, but it solves many practical data set problems. Here, we’ll use the code provided by Jupyter Notebooks on GitHub to see the results and final conclusions of two of these projects.
Each project highlights some of the benefits of Feature Engineering automation:
- Forecast of loan repayment capacity:Compared with artificial feature engineering, automated feature engineering can shorten the time of machine learning development by 10 times and provide better performance of modeling. (Notes)
- Consumption expenditure forecast:Automated Feature Engineering achieves successful model deployment by creating meaningful features through internal processing of time series filters while preventing data leakage; (Notes)
Feature Engineering: Manpower and Automation
Feature engineering is a process of acquiring data sets and constructing interpretable variables-features, which are used to train machine learning models for prediction problems. Usually, data is distributed in multiple tables and must be aggregated into one table where rows contain observations and characteristics in columns.
The traditional feature engineering method is to create a feature using relevant domain knowledge, which is a long, time-consuming and error-prone process, called artificial feature engineering. Artificial Feature Engineering relies on specific problems and must be rewritten for each new data set.
Feature Engineering Automation improves this standard workflow by automatically extracting useful and meaningful features from a set of related data tables and using a framework that can be applied to any problem. It not only reduces the time spent on feature engineering, but also creates interpretable features and prevents data leakage by filtering time-dependent data.
Loan repayment: building a faster and better model
When data scientists deal with household credit loans, the main problem they face is the size and distribution of data. Looking at the complete data set, you will find that 58 million rows of data are distributed in seven tables.
I used traditional artificial feature engineering and spent 10 hours creating a set of features. First, I looked at the results of other data scientists, looked at the relevant data, and studied the problem domain in order to obtain the necessary relevant domain knowledge. Then I translate this knowledge into code and create a feature at a time. As an example of a single manual feature, I found the total amount of overdue repayments on previous loans of customers, which required three different tables.
Finally, the performance of the artificial design features is quite good, which is 65% higher than the baseline features, indicating the importance of correct feature design.
However, the efficiency is very low. For artificial feature engineering, it took me more than 15 minutes to complete each feature, because I used traditional methods to generate one feature at a time.
In addition to tedious and time-consuming, there are also the following problems in artificial feature engineering:
- For specific issues:The code I’ve spent a long time writing can’t be applied to any other problem.
- Error prone:Every line of code can lead to other errors.
In addition, the final features of artificial design are limited by human creativity and patience: we can only consider creating so many features, and only spend so much time.
The promise of feature engineering automation is to automate the creation of hundreds of useful features by capturing a set of related tables and using code that can be applied to all problems, thus overcoming these limitations.
From Artificial to Automation Characteristic Engineering
Feature engineering automation even allows novices like me to create thousands of related features in a set of related data tables. We just need to know the basic structure of tables and their relationships, and we track them in a single data structure called entity set. Once we have a set of entities, using a method called Deep Feature Synthesis (DFS), we can create thousands of features in a function call.
DFS uses functions called “primitives” to aggregate and transform data. These primitives can be as simple as getting only the maximum of an average or column, or as complex as subject-based expertise, because Feature Tools allow us to define our own primitives.
Feature primitives include many manual operations, but by using FeatureTools, we can use the same exact syntax in any relational database instead of rewriting code and using the same operations in different datasets. In addition, when we stack primitives together to create deep features, the power of DFS comes.
Depth feature synthesis is flexible and is allowed to be applied to any problem in data science. It’s also powerful, creating depth features to reveal our inferences about data.
I’ll save you a few lines of code for setting up the environment, but DFS runs in only one line. Here, we use all seven tables in the data set to generate thousands of features for each customer:
# Deep feature synthesis feature_matrix, features = ft.dfs(entityset=es, target_entity='clients', agg_primitives = agg_primitives, trans_primitives = trans_primitives)
Below are some of the 1820 features we automatically acquired from Feature Tools:
- The maximum total amount of a customer’s previous loan. This is obtained by using primitive of one MAX and one SUM in three tables.
- Primitive of Percentage (PERCENTILE) and Mean (MEAN) is used in the two tables.
- In the application process, whether the customer submitted two documents, which will use an AND conversion primitive and a form;
Any of these features is created with simple aggregation. Feature Tools create many of the same features that I created by hand, but there are thousands that I have never considered. Not every feature is related to the problem. Some features are highly relevant. However, having too many features is a better problem to solve than having too few features.
After some function selection and model optimization, these features in the prediction model are slightly better than those in the manual process. The overall development time is 1 hour, which is 10 times less than that in the manual process. Feature Tools are faster because they require less domain knowledge and considerably fewer lines of code to write.
I admit it takes a little time to learn Featuretools, but it’s a rewarding investment. After spending about an hour learning Featuretools, you can apply them to any machine learning problem.
The following chart summarizes my experience on loan repayment:
- Development time:10 hours manual and 1 hour automatic;
- The number of features created by this method:Thirty artificial features and 1820 automatic features;
- The percentage increase relative to the baseline is:65% manual vs 66% automatic
My conclusion is that feature engineering automation will not replace data scientists, but will enable them to spend more time on other aspects of machine learning by significantly improving efficiency.
In addition, the feature tools code I wrote for the first project can be applied to any data set, while the manual engineering code can not be reused.
Consumer spending: creating meaningful features and preventing data leakage
The second data set, the online timestamp customer transaction record, predicts that the problem is to divide the customer into two parts, the customer who consumes more than $500 and the customer who consumes no more than $500. However, instead of using all tags for one month, each customer uses one tag more than once. We can label their spending in May, then in June, and so on.
In deployment, we will never have future data, so we can’t use it for training models. Enterprises often encounter this problem and often deploy a model that is worse in practice than in development, because it is trained with invalid data.
Fortunately, to ensure that our data is valid in time series issues, this is simple in Feature Tools. In the deep feature synthesis function, we pass a data frame as shown in the figure above, where the deadline indicates that we can’t use the past time points in any label data. Feature Tools automatically take time into account when creating features.
Customers are created with data filtered into the month before the specified month. Note that calls to create feature sets are the same as calls to add deadlines for loan repayment problems.
# Deep feature synthesis feature_matrix, features = ft.dfs(entityset=es, target_entity='customers', agg_primitives = agg_primitives, trans_primitives = trans_primitives, cutoff_time = cutoff_times)
The result of executing depth feature synthesis is a feature table, one for each customer per month. We can use these features to train a model with labels and then predict any month. In addition, we can rest assured that the features in the model will not use future information leading to unfair advantages and will generate misleading training scores.
With automation features, I was able to create a machine learning model that predicted consumer spending categories within a month, and ROC AUC reached 0.90 compared with the known baseline of 0.69.
In addition to providing impressive predictive capabilities, the implementation of FeatureTools also provides me with something equally valuable: explanable features. Look at the 15 most important features of the Stochastic Forest model:
The importance of features tells us that the most important factor in predicting how much customers will spend next month is how much SUM they spent before, and how much SUM they will buy. These are features that can be created manually, but we have to worry about data leakage and create models that work better in development than in deployment.
If there is a tool for creating meaningful features that already exists without worrying about the validity of any features, why do we need to implement it manually? In addition, the automation features are completely clear in the context of the problem and can provide information for our practical reasoning.
Automated Feature Engineering identifies the most important signals and achieves the main goal of data science: to reveal the rules hidden in massive data.
Even if the time spent on artificial feature engineering is much longer than I spent using Feature Tools, I can’t develop a set of features with similar performance. The following figure shows ROC curves that classify customer sales over the next month using models trained on two data sets. The upper left curve represents a more accurate prediction:
I can’t even fully determine whether the artificial features use valid data, but I don’t have to worry about data leaks in time-dependent issues through Feature Tools.
We use automatic security system in our daily life. Feature Engineering Automation in Featuretools is a safe method to create meaningful machine learning features in time series problems, and provides excellent performance of prediction.
After these projects, I am convinced that feature engineering automation should be an integral part of machine learning workflow. This technology is not perfect, but it can still significantly improve efficiency.
The main conclusion is that feature engineering automation:
- The execution time is reduced by 10 times.
- Modeling performance at the same level or higher;
- Explanatory characteristics of delivery with practical significance;
- Prevent model invalidation caused by incorrect data;
- Adapt to the existing workflow and machine learning model;
Read the original text
This article is the original content of Yunqi Community, which can not be reproduced without permission.