References

Angelov, Dimo. 2020. “Top2Vec: Distributed Representations of Topics.” https://arxiv.org/abs/2008.09470.

Araci, Dogu. 2019. “FinBERT: Financial Sentiment Analysis with Pre-Trained Language Models.” https://arxiv.org/abs/1908.10063.

Asgari, Mohammad R. K., Ehsaneddin AND Mofrad. 2015. “Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics.” PLOS ONE 10 (11): 1–15. https://doi.org/10.1371/journal.pone.0141287.

Beltagy, Iz, Kyle Lo, and Arman Cohan. 2019. “SciBERT: A Pretrained Language Model for Scientific Text.” https://arxiv.org/abs/1903.10676.

Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. “Latent Dirichlet Allocation.” J. Mach. Learn. Res. 3 (null): 993–1022.

Buuren, S. van. 2012. Flexible Imputation of Missing Data. Chapman & Hall/CRC Interdisciplinary Statistics. CRC Press. https://books.google.com/books?id=elDNBQAAQBAJ.

Candès, Emmanuel J., Xiaodong Li, Yi Ma, and John Wright. 2011. “Robust Principal Component Analysis?” J. ACM 58 (3). https://doi.org/10.1145/1970392.1970395.

Cañete, José, Gabriel Chaperon, Rodrigo Fuentes, Jou-Hui Ho, Hojin Kang, and Jorge Pérez. 2023. “Spanish Pre-Trained BERT Model and Evaluation Data.” https://arxiv.org/abs/2308.02976.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), edited by Jill Burstein, Christy Doran, and Thamar Solorio, 4171–86. Minneapolis, Minnesota: Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423.

Galli, S. 2020. Python Feature Engineering Cookbook: Over 70 Recipes for Creating, Engineering, and Transforming Features to Build Machine Learning Models. Packt Publishing. https://books.google.com/books?id=2c_LDwAAQBAJ.

Géron, Aurélien. 2017. Hands-on Machine Learning with Scikit-Learn and TensorFlow : Concepts, Tools, and Techniques to Build Intelligent Systems. Sebastopol, CA: O’Reilly Media.

Honnibal, Matthew, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. 2020. “spaCy: Industrial-strength Natural Language Processing in Python.” https://doi.org/10.5281/zenodo.1212303.

Huang, Kexin, Jaan Altosaar, and Rajesh Ranganath. 2020. “ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission.” https://arxiv.org/abs/1904.05342.

Kuhn, M., and K. Johnson. 2013. Applied Predictive Modeling. SpringerLink : Bücher. Springer New York. https://books.google.com/books?id=xYRDAAAAQBAJ.

———. 2019. Feature Engineering and Selection: A Practical Approach for Predictive Models. Chapman & Hall/CRC Data Science Series. CRC Press. https://books.google.com/books?id=q5alDwAAQBAJ.

Kuhn, M., and J. Silge. 2022. Tidy Modeling with r. O’Reilly Media. https://books.google.com/books?id=98J6EAAAQBAJ.

Lan, Zhenzhong, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. “ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations.” https://arxiv.org/abs/1909.11942.

Le, Quoc V., and Tomas Mikolov. 2014. “Distributed Representations of Sentences and Documents.” https://arxiv.org/abs/1405.4053.

Lee, Jieh-Sheng, and Jieh Hsiang. 2019. “PatentBERT: Patent Classification with Fine-Tuning a Pre-Trained BERT Model.” https://arxiv.org/abs/1906.02124.

Lee, Jinhyuk, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. “BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining.” Edited by Jonathan Wren. Bioinformatics 36 (4): 1234–40. https://doi.org/10.1093/bioinformatics/btz682.

Lewis, David D., Yiming Yang, Tony G. Rose, and Fan Li. 2004. “RCV1: A New Benchmark Collection for Text Categorization Research.” Journal of Machine Learning Research 5: 361–97. https://www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf.

Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. “RoBERTa: A Robustly Optimized BERT Pretraining Approach.” https://arxiv.org/abs/1907.11692.

Luhn, H. P. 1960. “Key Word-in-Context Index for Technical Literature (Kwic Index).” American Documentation 11 (4): 288–95. https://doi.org/https://doi.org/10.1002/asi.5090110403.

Micci-Barreca, Daniele. 2001. “A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems.” SIGKDD Explor. Newsl. 3 (1): 27–32. https://doi.org/10.1145/507533.507538.

Mika, Sebastian, Bernhard Schölkopf, Alex Smola, Klaus-Robert Müller, Matthias Scholz, and Gunnar Rätsch. 1998. “Kernel PCA and de-Noising in Feature Spaces.” In Advances in Neural Information Processing Systems, edited by M. Kearns, S. Solla, and D. Cohn. Vol. 11. MIT Press. https://proceedings.neurips.cc/paper_files/paper/1998/file/226d1f15ecd35f784d2a20c3ecf56d7f-Paper.pdf.

Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” https://arxiv.org/abs/1301.3781.

Mougan, Carlos, David Masip, Jordi Nin, and Oriol Pujol. 2021. “Quantile Encoder: Tackling High Cardinality Categorical Features in Regression Problems.” In Modeling Decisions for Artificial Intelligence, edited by Vicenç Torra and Yasuo Narukawa, 168–80. Cham: Springer International Publishing.

Ng, Patrick. 2017. “Dna2vec: Consistent Vector Representations of Variable-Length k-Mers.” https://arxiv.org/abs/1701.06279.

Nothman, Joel, Hanmin Qin, and Roman Yurchak. 2018. “Stop Word Lists in Free Open-Source Software Packages.” In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), edited by Eunjeong L. Park, Masato Hagiwara, Dmitrijs Milajevs, and Liling Tan, 7–12. Melbourne, Australia: Association for Computational Linguistics. https://doi.org/10.18653/v1/W18-2502.

Ozdemir, S. 2022. Feature Engineering Bookcamp. Manning. https://books.google.com/books?id=3n6HEAAAQBAJ.

Pargent, Florian, Florian Pfisterer, Janek Thomas, and Bernd Bischl. 2022. “Regularized Target Encoding Outperforms Traditional Methods in Supervised Machine Learning with High Cardinality Features.” Computational Statistics 37 (5): 2671–92. https://doi.org/10.1007/s00180-022-01207-6.

Porter, Martin F. 1980. “An Algorithm for Suffix Stripping.” Program 14 (3): 130–37. https://doi.org/10.1108/eb046814.

———. 2001. “Snowball: A Language for Stemming Algorithms.” https://snowballstem.org.

Prokhorenkova, Liudmila, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. 2019. “CatBoost: Unbiased Boosting with Categorical Features.” https://arxiv.org/abs/1706.09516.

Robertson, Stephen. 2004. “Understanding Inverse Document Frequency: On Theoretical Arguments for IDF.” Journal of Documentation 60 (5): 503–20.

RUBIN, DONALD B. 1976. “Inference and missing data.” Biometrika 63 (3): 581–92. https://doi.org/10.1093/biomet/63.3.581.

Sanh, Victor, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2020. “DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter.” https://arxiv.org/abs/1910.01108.

SPARCK JONES, K. 1972. “A STATISTICAL INTERPRETATION OF TERM SPECIFICITY AND ITS APPLICATION IN RETRIEVAL.” Journal of Documentation 28 (1): 11–21. https://doi.org/https://doi.org/10.1108/eb026526.

Thakur, A. 2020. Approaching (Almost) Any Machine Learning Problem. Amazon Digital Services LLC - Kdp. https://books.google.com/books?id=ZbgAEAAAQBAJ.

Zou, Hui, Trevor Hastie, and Robert Tibshirani. 2006. “Sparse Principal Component Analysis.” Journal of Computational and Graphical Statistics 15 (2): 265–86. http://www.jstor.org/stable/27594179.