Feature Engineering A-Z

Author

Emil Hvitfeldt

Published

January 28, 2024

Preface

Welcome to โ€œFeature Engineering A-Zโ€! This book is written to be used as a reference guide to nearly all feature engineering methods you will encounter. This is reflected in the chapter structure. Any question a practitioner is having should be answered by looking at the index and finding the right chapter.

Each section tries to be as comprehensive as possible with the number of different methods and solutions that are presented. A section on dimensionality reduction should list all the practical methods that could be used, as well as a comparison between the methods to help the reader decide what would be most appropriate. This does not mean that all methods are recommended to use. A number of these methods have little and narrow use cases. Methods that are deemed too domain-specific have been excluded from this book.

Missing methods

If you think this book is missing a method, then please file an issue and we will evaluate if it should be added.

Each chapter will cover a specific method or small group of methods. This will include motivations and explanations for the method. Whenever possible each method will be accompanied by mathematical formulas and visualizations to illustrate the mechanics of the method. A small pros and cons list is provided for each method. Lastly, each section will include code snippets showcasing how to implement the methods. This is done in R and Python, using tidymodels and scikit-learn respectively. This book is a methods book first, and a coding book second.

Empty chapters

A chapter is prefixed with the emoji ๐Ÿ—๏ธ to indicate that it hasnโ€™t been fully written yet.

What does this book not cover?

To keep the scope of this book as focused as possible, the following topics will not be covered in this book:

  • whole process modeling
  • case studies
  • deployment details
  • domain-specific methods

For whole process modeling see instead โ€œHands-On Machine Learning with Scikit-learn, Keras & Tensorflowโ€ (2017), โ€œTidy modeling with Rโ€ (2022), โ€œApproaching (almost) any machine learning problemโ€ (2020) and โ€œApplied Predictive Modelingโ€ (2013) are all great resources. For feature engineering books that tell more of a story by going through case studies, I recommended: โ€œPython Feature Engineering Cookbookโ€ (2020), โ€œFeature Engineering Bookcampโ€ (2022) And โ€œFeature Engineering and Selectionโ€ (2019). I have found that books on deployment domain-specific methods are highly related to the field and stack that you are using and am not able to give broad advice.

Who is this book for?

This book is designed to be used by people involved in the modeling of data. These can include but are not limited to data scientists, students, professors, data analysts and machine learning engineers. The reference style nature of the book makes it useful for beginners and seasoned professionals. A background in the basics of modeling, statistics and machine learning would be helpful. Feature engineering as a practice is tightly connected to the rest of the machine learning pipeline so knowledge of the other components is key.

Many educational resources skip over the finer details of feature engineering methods, which is where this book tries to fill the gap.

License

This book is licensed to you under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Rendering details

This book is rendered using quarto (1.4.553), R (4.3.3) and Python (3.11.7). The website source is hosted on Github.

The following R packages are used to render the book, with tidymodels, recipes, embed, themis, textrecipes and timetk being the main packages.

corrr (0.4.4) embed (1.1.4) extrasteps (0.0.0.9000)
ggforce (0.4.2) ggraph (2.2.1) janeaustenr (1.0.0)
jsonlite (1.8.8) lme4 (1.1-35.3) Matrix (1.6-5)
patchwork (1.2.0) readr (2.1.5) remotes (2.5.0)
reshape (0.8.9) reticulate (1.36.1) rmarkdown (2.26)
splines2 (0.5.1) stopwords (2.3) text2vec (0.6.4)
textfeatures (0.3.3) textrecipes (1.0.6) tidymodels (1.2.0)
tidyverse (2.0.0)

The following Python libraries are used to render the book, with scikit-learn and feature-engine being the main ones.

appnope (0.1.4) asttokens (2.4.1) category-encoders (2.6.3)
cffi (1.16.0) colorama (0.4.6) comm (0.2.1)
debugpy (1.8.1) decorator (5.1.1) executing (2.0.1)
feature-engine (1.6.2) feazdata (0.0.1) ipykernel (6.29.2)
ipython (8.21.0) jedi (0.19.1) joblib (1.3.2)
jupyter-client (8.6.0) jupyter-core (5.7.1) matplotlib-inline (0.1.6)
nest-asyncio (1.6.0) numpy (1.26.4) packaging (23.2)
pandas (2.2.0) parso (0.8.3) patsy (0.5.6)
pexpect (4.9.0) platformdirs (4.2.0) prompt-toolkit (3.0.43)
psutil (5.9.8) ptyprocess (0.7.0) pure-eval (0.2.2)
pycparser (2.21) pygments (2.17.2) python-dateutil (2.8.2)
pytz (2024.1) pywin32 (306) pyyaml (6.0.1)
pyzmq (25.1.2) scikit-learn (1.4.0) scipy (1.12.0)
six (1.16.0) stack-data (0.6.3) statsmodels (0.14.1)
threadpoolctl (3.3.0) tornado (6.4) traitlets (5.14.1)
tzdata (2024.1) wcwidth (0.2.13)

Can I contribute?

Please feel free to improve the quality of this content by submitting pull requests. A merged PR will make you appear in the contributor list. It will, however, be considered a donation of your work to this project. You are still bound by the conditions of the license, meaning that you are not considered an author, copyright holder, or owner of the content once it has been merged in.

Acknowledgements

Iโ€™m so thankful for the contributions, help, and perspectives of people who have supported us in this project. There are several I would like to thank in particular.

I would like to thank my Posit colleagues on the tidymodels team (Hannah Frick, Max Kuhn, and Simon Couch) as well as the rest of our coworkers on the Posit open-source team. I also thank Javier Orraca-Deatcu, Matt Dancho and Mike Mahoney for looking over some of the chapters before the first release.

This book was written in the open, and multiple people contributed via pull requests or issues. Special thanks goes to the two people who contributed via GitHub pull requests (in alphabetical order by username): Javier Orraca-Deatcu (@JavOrraca), Sol Feuerwerker (@sfeuerwerker).