The Way of Feature Engineering
Feature Engineering is arguably the most important part of any Machine Learning system. Feature Engineering is where signals are created from raw data, noise is minimized, and data is transformed into an appropriate vector space.
Whether it be user behavior data that needs to be converted into features or an image that needs to be transformed into a feature matrix, engineering features are the backbone and core of modeling and statistical inference.
In this article I have tried to share a few of my learnings over years of working on a large scale ML system in production. I highlight the learnings as practices you should consider when building your own system.
What is a Simple Feature Engineering Flow?
Feature engineering is not just creating the mathematical and statistical logic to transform that raw data into variables of high signal. Feature Engineering is an entire system within the larger ML system which has different moving parts. Feature Engineering is a continuous process which evolves over time. As over time your organization/team grows bigger, your model receives more volume of requests, more models want to access the features, the data grows bigger — the complexity of the system will grow rapidly.
Therefore, the understanding that building features happens over time brings qualities like scalability, usability (ease of use), latency, security, and maintainability into the picture.
Models read features as input. The features are the electricity to the electric motor we call a model. We go from collecting raw data to building useful features from the raw data to training and inferring the model.
What design considerations should we have?
- Features should update and build overtime
- Features should be built on request and/or on a regular cadence
- Features can be preprocessed and/or cached in many cases
- Features should have monitoring and cataloging service
- Features can or should have a privacy and retention policy
- Not all features need to be used. Features should be selected and reduced to the most useful features among a forrest of features to avoid overfitting or under-fitting.
- Features should have quality analysis and anomaly alerting system
- Features should have ease of use and flexibility to be used across different software projects
- Feature builds should have metadata about them stored and be made accessible
Packaging the benefits above into a service is usually called a Feature Store.
Should the feature store be a service or an application?
The answer is both, ideally. You would need to build Applications (ex: Spark/Python/Scala applications the perform the feature engineering logic i.e. transforming raw data aka ETL. In addition, you need to build a service which orchestrates the application and does the cataloging of pre-processed features. The model should ideally consume the features via the service and not directly. The service should provide the model with the specific feature build it requests from a range of feature builds. Think of features as books on a library shelve.
The library shelve is the service and books are the feature builds. The shelves help organize different books (feature builds) and help you access them easily.
The service acts as the interface layer for the application(s). The application interacts with the storage layer i.e. reading in raw data and storing the engineered features. The application also creates the features based on ETL and statistical logic. The diagram above illustrates the flow. To conclude, building both the application layer and service layer is important for longevity of your feature engineering process.
Note: You can start off with just creating the stand alone application. This can be a great first step and a wonderful prototype.
Should features be stored as files or database?
Where and how are you going to store your features are very critical. Many factors are to be considered —where do I store my features — object store or block store? which file format should I store my features in? how often will the features be used i.e. read by the model? are there any limited usage/security concerns your features might have ? and so on..
Key factors to consider are scalability, usability (ease of use), latency (data I/O), functionality, performance, and security.
I am going to compare two popular storage approaches.
Possible storage options:
- Parquet file on Object Store (Ex: S3)
- Table in Block Store (Ex: MySQL)
An example case: I want to store my features where they are easily accessible by a number of different Spark/Python/Scala applications on average 100 times every day. I do not want to have any restrictions on the accessibility of the data and want most applications to access the entire feature set with ease. I do no want to offer any rich read operations to applications (developers). By rich read operations I mean any other functionality you might get with a database like filtering data in query when you read it. My Spark/Python/Scala application will most probably using the features as a DataFrame after they read it. Latency is not critical in this particular case. It does not matter if application runs little bit slower.
Since my client application is going to be in Python and Spark and the data format is going to be DataFrame for analysis the file format for storage to consider is a distributed file format like Parquet and object storage like S3. Parquet on S3 stores in distributed partitions and easily converted to a distributed DataFrame in memory of the Python and Spark applications. Parquet on S3 will be simple and easily accessible to applications (developers) when they write code for the model to read the features. In most cases, my developers will just read the entire feature set and would not require rich in query read operations. Parquet and S3 fit nice here as well.
Look for what fits well with your use case and choose the right storage approach. Always remember there are different options for storage.
Another use case is where I have highly sensitive features about users of my product and I want them to be in a secured location without any direct and simple access. Block Storage like MySQL database is a good fit in this case. You get high functionality with permission grants to specific applications.
How should we process and transform the data?
Processing and transforming of data is where raw data is transformed into features. This is where the compute happens and feature creation logic takes place. For example, you might have raw user logs from your mobile app. Now you want to create features about those users — how often they use the app i.e. average visits per week? when do they use the app i.e. day or night? how many interactions (likes, shares, comments etc) per session do the users have?
Raw data is the inputs to your feature engineering application. Inside the application is where you will transform the raw log data into features as talked about above. This will require statistical and data wrangling logic such as aggregation, group by, window function, dropping null values, ranking, using feature reduction statistical libraries etc. So a lot of compute on the raw data will happen at this stage.
Try building your application as simple as possible. If the application is for a small amount of data and/or data size will not scale over time, use Python to and run it on a single compute instance.
In contrast, if you can see your data becoming very large and scale fast over time, use Spark and run the application on a cluster of computers (a Spark cluster). Spark offers faster processing and runs your application in less time. Spark is a distributed framework which computes data in parallel across different compute nodes. This distributed approach allows it to process data faster and also store it in a distributed manner with compatible file formats like Parquet. Spark also has long range of ETL functions that can be used for the transformation of the data. A Python application with single node will not run as fast as Spark. The parallelism in Spark accelerates compute. Spark is open-source and comes with Python and Scala versions.
Spark also scales very well as the data size increases and scales over time. This is because Spark clusters (all major cloud providers offer Spark clusters) can horizontally scale and add new compute nodes when required, almost instantly. A Python application with a single compute instance does not offer that benefit. Even if you have vertically scaled your compute instance i.e. a very big compute instance with huge memory and storage, the instant availability of bigger size as data grows as not possible with it. You will have to upgrade the compute node and that might take more time and work.
For the application, you can use many feature engineering libraries like scikit-learn, SparkMLib. These libraries have useful libraries and functions that can help you transform your data better and in more desirable ways.
Conclusion and parting thoughts
Firstly thank you for reading this article. I hope this was useful for you. I have tried to briefly explain a few learnings and practices that I strongly believe are crucial to any Machine Learning (ML) system. These are not based on any book or course but I have written these based on my experience over days, weeks, and years working with large scale Feature Stores and ML systems.
Always see your feature engineering system as a process that changes and scales overtime. See it as a service and see it as part of the larger ML system. Making an analogy, if an ML system is a university campus, Feature Store is the library and feature builds in the library are the books. Libraries have existed in the physical world for hundreds of years. Like the libraries, we need to make our Feature Stores exists for lifetimes.