References
Angelov, Dimo. 2020. โTop2Vec: Distributed Representations of
Topics.โ https://arxiv.org/abs/2008.09470.
Araci, Dogu. 2019. โFinBERT: Financial Sentiment Analysis with
Pre-Trained Language Models.โ https://arxiv.org/abs/1908.10063.
Asgari, Mohammad R. K., Ehsaneddin AND Mofrad. 2015. โContinuous
Distributed Representation of Biological Sequences for Deep Proteomics
and Genomics.โ PLOS ONE 10 (11): 1โ15. https://doi.org/10.1371/journal.pone.0141287.
Beltagy, Iz, Kyle Lo, and Arman Cohan. 2019. โSciBERT: A
Pretrained Language Model for Scientific Text.โ https://arxiv.org/abs/1903.10676.
Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. โLatent
Dirichlet Allocation.โ J. Mach. Learn. Res. 3 (null):
993โ1022.
Buuren, S. van. 2012. Flexible Imputation of Missing Data.
Chapman & Hall/CRC Interdisciplinary Statistics. CRC Press. https://books.google.com/books?id=elDNBQAAQBAJ.
Caรฑete, Josรฉ, Gabriel Chaperon, Rodrigo Fuentes, Jou-Hui Ho, Hojin Kang,
and Jorge Pรฉrez. 2023. โSpanish Pre-Trained BERT Model and
Evaluation Data.โ https://arxiv.org/abs/2308.02976.
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019.
โBERT: Pre-Training of Deep Bidirectional
Transformers for Language Understanding.โ In Proceedings of
the 2019 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies,
Volume 1 (Long and Short Papers), edited by Jill Burstein, Christy
Doran, and Thamar Solorio, 4171โ86. Minneapolis, Minnesota: Association
for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423.
Galli, S. 2020. Python Feature Engineering Cookbook: Over 70 Recipes
for Creating, Engineering, and Transforming Features to Build Machine
Learning Models. Packt Publishing. https://books.google.com/books?id=2c_LDwAAQBAJ.
Geฬron, Aureฬlien. 2017. Hands-on Machine Learning with Scikit-Learn
and TensorFlow : Concepts, Tools, and Techniques to Build Intelligent
Systems. Sebastopol, CA: OโReilly Media.
Honnibal, Matthew, Ines Montani, Sofie Van Landeghem, and Adriane Boyd.
2020. โspaCy: Industrial-strength Natural
Language Processing in Python.โ https://doi.org/10.5281/zenodo.1212303.
Huang, Kexin, Jaan Altosaar, and Rajesh Ranganath. 2020.
โClinicalBERT: Modeling Clinical Notes and Predicting Hospital
Readmission.โ https://arxiv.org/abs/1904.05342.
Kuhn, M., and K. Johnson. 2013. Applied Predictive Modeling.
SpringerLink : Bรผcher. Springer New York. https://books.google.com/books?id=xYRDAAAAQBAJ.
โโโ. 2019. Feature Engineering and Selection: A Practical Approach
for Predictive Models. Chapman & Hall/CRC Data Science Series.
CRC Press. https://books.google.com/books?id=q5alDwAAQBAJ.
Kuhn, M., and J. Silge. 2022. Tidy Modeling with r. OโReilly
Media. https://books.google.com/books?id=98J6EAAAQBAJ.
Lan, Zhenzhong, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush
Sharma, and Radu Soricut. 2020. โALBERT: A Lite BERT for
Self-Supervised Learning of Language Representations.โ https://arxiv.org/abs/1909.11942.
Le, Quoc V., and Tomas Mikolov. 2014. โDistributed Representations
of Sentences and Documents.โ https://arxiv.org/abs/1405.4053.
Lee, Jieh-Sheng, and Jieh Hsiang. 2019. โPatentBERT: Patent
Classification with Fine-Tuning a Pre-Trained BERT Model.โ https://arxiv.org/abs/1906.02124.
Lee, Jinhyuk, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan
Ho So, and Jaewoo Kang. 2019. โBioBERT: A Pre-Trained Biomedical
Language Representation Model for Biomedical Text Mining.โ Edited
by Jonathan Wren. Bioinformatics 36 (4): 1234โ40. https://doi.org/10.1093/bioinformatics/btz682.
Lewis, David D., Yiming Yang, Tony G. Rose, and Fan Li. 2004.
โRCV1: A New Benchmark Collection for Text
Categorization Research.โ Journal of Machine Learning
Research 5: 361โ97. https://www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf.
Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi
Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov.
2019. โRoBERTa: A Robustly Optimized BERT Pretraining
Approach.โ https://arxiv.org/abs/1907.11692.
Luhn, H. P. 1960. โKey Word-in-Context Index for Technical
Literature (Kwic Index).โ American Documentation 11 (4):
288โ95. https://doi.org/https://doi.org/10.1002/asi.5090110403.
Micci-Barreca, Daniele. 2001. โA Preprocessing Scheme for
High-Cardinality Categorical Attributes in Classification and Prediction
Problems.โ SIGKDD Explor. Newsl. 3 (1): 27โ32. https://doi.org/10.1145/507533.507538.
Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013.
โEfficient Estimation of Word Representations in Vector
Space.โ https://arxiv.org/abs/1301.3781.
Ng, Patrick. 2017. โDna2vec: Consistent Vector Representations of
Variable-Length k-Mers.โ https://arxiv.org/abs/1701.06279.
Nothman, Joel, Hanmin Qin, and Roman Yurchak. 2018. โStop Word
Lists in Free Open-Source Software Packages.โ In Proceedings
of Workshop for NLP Open Source Software
(NLP-OSS), edited by Eunjeong L. Park,
Masato Hagiwara, Dmitrijs Milajevs, and Liling Tan, 7โ12. Melbourne,
Australia: Association for Computational Linguistics. https://doi.org/10.18653/v1/W18-2502.
Ozdemir, S. 2022. Feature Engineering Bookcamp. Manning. https://books.google.com/books?id=3n6HEAAAQBAJ.
Pargent, Florian, Florian Pfisterer, Janek Thomas, and Bernd Bischl.
2022. โRegularized Target Encoding Outperforms Traditional Methods
in Supervised Machine Learning with High Cardinality Features.โ
Computational Statistics 37 (5): 2671โ92. https://doi.org/10.1007/s00180-022-01207-6.
Porter, Martin F. 1980. โAn Algorithm for Suffix
Stripping.โ Program 14 (3): 130โ37. https://doi.org/10.1108/eb046814.
โโโ. 2001. โSnowball: A Language for Stemming Algorithms.โ
https://snowballstem.org.
Prokhorenkova, Liudmila, Gleb Gusev, Aleksandr Vorobev, Anna Veronika
Dorogush, and Andrey Gulin. 2019. โCatBoost: Unbiased Boosting
with Categorical Features.โ https://arxiv.org/abs/1706.09516.
Robertson, Stephen. 2004. โUnderstanding Inverse Document
Frequency: On Theoretical Arguments for IDF.โ Journal of
Documentation 60 (5): 503โ20.
RUBIN, DONALD B. 1976. โInference and missing
data.โ Biometrika 63 (3): 581โ92. https://doi.org/10.1093/biomet/63.3.581.
Sanh, Victor, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2020.
โDistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper
and Lighter.โ https://arxiv.org/abs/1910.01108.
SPARCK JONES, K. 1972. โA STATISTICAL INTERPRETATION OF TERM
SPECIFICITY AND ITS APPLICATION IN RETRIEVAL.โ Journal of
Documentation 28 (1): 11โ21. https://doi.org/https://doi.org/10.1108/eb026526.
Thakur, A. 2020. Approaching (Almost) Any Machine Learning
Problem. Amazon Digital Services LLC - Kdp. https://books.google.com/books?id=ZbgAEAAAQBAJ.