In simple terms, feature engineering involves feeding knowledge into a Machine Learning model. As a refresher, a machine learning model is an algorithm that takes features as input and produces as output a prediction or classification. A feature itself is the item that represents some facet of the knowledge that the machine learning model will use.
Feature engineering isn’t necessarily a rigorous topic. In fact, even the definition is a bit fuzzy; is it just picking features or cleaning them and transforming them too? What is clear is how important this topic is to applied machine learning.
“At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used.” 
As an example, it’s easy for people to differentiate an image of an apple from a mango. And it’s obvious to a human that the Apple logo isn’t a real apple. But think about how you might describe these differences to another person. You would talk about the shape, the color, whether there is natural shading or not. These distinguishing features (lines, color, shape, etc.) are exactly the features that would feed a machine learning model! Therefore, picking these features (regardless of when you clean them up and transform them in some way) is what Feature Engineering is all about.
In the table below, we list a few machine learning problems and the major features used to address those problems:
Detecting faces in an image
eye colors, outlines, contours, edges
phonemes, noise ratios, length of sounds, relative power
Filtering email spam
words, parts-of-speech, email structure, grammatical correctness
Table: Examples of machine learning problems and their features
Importance – Why should we care?
Many machine learning models only know about the data they are given to learn from, and therefore follow a “garbage in, garbage out” principle. An exception are models that have embedded common sense and knowledge about the world. Just like baking cookies, you can mix flour, butter, eggs, sugar etc., but unless you mix them in the right way, with the right proportions and include all the necessary ingredients, you will end up with a gross mess. In the same way, predictive models are heavily influenced by the features you choose and prepare. If you choose good features, you end up with a clean and predictive model. If you don’t, you get a gross mess.
But good feature engineering is about more than just producing a good model. If done correctly, feature engineering can yield features that are general and flexible, so they can be re-used in other models for other purposes. They will also be easier to understand and maintain, and should lead to simpler models that need less manual tuning and optimization to maintain. Simply put, better features mean better results and more productive data science.
Feature Engineering Process
So, features are important, and producing good ones can have a huge impact. So let’s talk about how we approach the feature engineering process. Our process is iterative, and alternates between feature selection and model evaluation. The process might look as follows:
Brainstorm features: Really get into the problem, look at a lot of data, study feature engineering on other problems and see what you can use from them. Imagine you have a categorical attribute in your data like “Item_Color”, that can be Red, Blue or Unknown. You could create a new binary feature called “Has_Color” and assign it a value of “1” when an item has a color and “0” when the color is unknown. You can go a level further and create a binary feature for each value for Item_Color (ex: Is_Red, Is_Blue and Is_Unknown). These additional features could be used instead of the Item_Color feature.
Devise features: The next step is actually define your features. This involves two processes:
Feature extraction is the process of defining and pulling out a set of features that efficiently and meaningfully represent some information that is important for the analysis. Depending on the problem set-up, this can be automated, a manual process, or some combination of both. As an example, suppose we are building a classifier to determine how many syllables are in a word. Some potentially useful features to extract would be the length of the word, the number of contiguous vowels, the number of contiguous constants, the type of character of the first letter, etc.
Feature construction involves transforming a given set of input features to generate a new set of more powerful features which can then be used for prediction (sometimes these are called “derived” features). As an example, in healthcare research, it may be useful to know someone’s age as a feature. But if the data only has “birthday” then age needs to be derived as age = today – birthday. Of course, features can be arbitrarily complex, but as we note above, we prefer simpler features with equivalent predictive power to their complex counterparts, since they are easier to maintain and re-use across models.
Select features: Once a bit is known about the data, and the potential features are defined, the next step involves picking the right features. Again, this has two major components. The first is feature selection, which is the process of selecting some subset of the most relevant features to the task at hand. Doing this automatically continues to be an active area of machine learning research. For example, a spam classifier might select words, parts of speech, and email structure, but it might note during its feature selection process that grammatical correctness has little extra to add. Feature scoring is an assessment of the usefulness of the feature for prediction. This is usually done in an automated way, for instance by adding features one at a time and noting the resulting increase in accuracy. Using our spam classifier we may find that the words themselves may have a strong influence on accuracy, but the length of the email isn’t predictive.
Evaluate models: Finally we evaluate the features by evaluating the model’s accuracy on unseen data using the chosen features.
About the Author
Avi Sanadhya is a Data Scientist at Retention Science. His interest lies in solving real-world problems at the intersection of Big Data and Machine Learning. He is pursuing his Masters in Data Science from the University of Southern California, and received his B.E in Computer Science from Manipal University.
 Pedro Domingos, http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf