61 BERT

61.1 BERT

Bidirectional Encoder Representations from Transformers (BERT)(Devlin et al. 2019) is a different way to represent text as data than what we have seen before in this section. In the bag-of-words method, we counted the occurance of each token. This was done regardless of the order of the tokens. BERT on the other hand looks at the previous and following works, hence the name Bidirectional.

One of the reasons why BERT excels is that since it looks at the surrounding tokens when trying to give a token meaning, it can better distinguish between “duck” the animal and “duck” the movement. This added nuance to tokens in their context is what will give it its edge at times. This is of course much more complicated than counting tokens by themselves, but the added complexity pays off in many ways as BERT models have been used in quite a lot of places such as search, translation, etc etc.

BERTs work by using a transformer architecture. In simple terms, a transformer is comprised of an encoder and a decoder. The encoder maps the tokens to numeric arrays, and the decoder makes the numeric arrays to predictions. The BERT model in question is just the encoder part of the transformer.

BERTs are trained on ordered tokens. Special tokens are added to allow the model more information about the sentence structure. [SEP] stands for seperation and is used to denote that a sentence is ending, [EOS] stands for “end of sentence”. These tokens when added doing preprocessing allow us to introduce structure to the data, that the model is able to pick up on.

Once the data is properly structured, we can fit the model. It is a Masked Language Model (MLM), meaning that doing the training loop a token is masked, and the model is trying to predict the masked token. Remember that this is done using an encoder and a decoder. Where the decoder tries to predict the masked token. We only need the encoder which maps tokens into numeric arrays.

TODO

should we have a diagram?

When we apply a BERT model, we are actually looking up the tokens in the embedding. These embeddings can be created from scratch, but it is much more common to use pre-trained embeddings, and fine-tuning them as needed. Fine-tuning the model involves more training using our specific data, to update the weights in the model.

Important

When using a pre-trained model, it is important that you are using the exact same tokenizer, otherwise, you won’t have proper token coverage and you will get bad results. BERT uses WordPiece as its tokenizer.

BERT as a model is very useful, and a number of spin-off models have been added. (Lan et al. 2020) and DistilBERT(Sanh et al. 2020) are a smaller version that has comparable speed, with a smaller computational cost. RoBERTa(Liu et al. 2019) is trained using more data and other data structures allowing for more capabilities.

Since we are able to finetune a BERT to a specific data set. People have also been doing that to release domain and language-specific models:

SciBERT(Beltagy, Lo, and Cohan 2019) (biomedical and computer science literature corpus)
FinBERT(Araci 2019) (financial services corpus)
BioBERT(J. Lee et al. 2019) (biomedical literature corpus)
ClinicalBERT(Huang, Altosaar, and Ranganath 2020) (clinical notes corpus)
patentBERT(J.-S. Lee and Hsiang 2019) (patent corpus)
Nordic BERT (Danish, Norwegian, Swedish)
SpanBERTa(Cañete et al. 2023) (RoBERTa for Spanish)

With all of this in mind, there are a couple of downsides. The first is the increased computational cost. This is noticed twice, once for the initial training of the corpus and again for the application. You can alliviate this but use smaller. Which is in line with the general advice to start small and only add complixity as needed.

The second problem you can run into is that BERT has a maximum token limit. One way to deal with this is to chunk up text that is over the limit into smaller manageable sizes, process them separately, and combine results.

Lastly, it is known that the performance of any given BERT depends very much on the data it was trained on. This means that a general BERT model is unlikely to work well on a domain-specific problem without finetuning.

61.2 Pros and Cons

61.2.1 Pros

Performance increases
More nuanced token information that takes surrounding context into consideration

61.2.2 Cons

Computationally expensive
Has token input limits
Will need to be finetuned for domain-specific tasks

61.3 R Examples

61.4 Python Examples

Araci, Dogu. 2019. “FinBERT: Financial Sentiment Analysis with Pre-Trained Language Models.” https://arxiv.org/abs/1908.10063.

Beltagy, Iz, Kyle Lo, and Arman Cohan. 2019. “SciBERT: A Pretrained Language Model for Scientific Text.” https://arxiv.org/abs/1903.10676.

Cañete, José, Gabriel Chaperon, Rodrigo Fuentes, Jou-Hui Ho, Hojin Kang, and Jorge Pérez. 2023. “Spanish Pre-Trained BERT Model and Evaluation Data.” https://arxiv.org/abs/2308.02976.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), edited by Jill Burstein, Christy Doran, and Thamar Solorio, 4171–86. Minneapolis, Minnesota: Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423.

Huang, Kexin, Jaan Altosaar, and Rajesh Ranganath. 2020. “ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission.” https://arxiv.org/abs/1904.05342.

Lan, Zhenzhong, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. “ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations.” https://arxiv.org/abs/1909.11942.

Lee, Jieh-Sheng, and Jieh Hsiang. 2019. “PatentBERT: Patent Classification with Fine-Tuning a Pre-Trained BERT Model.” https://arxiv.org/abs/1906.02124.

Lee, Jinhyuk, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. “BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining.” Edited by Jonathan Wren. Bioinformatics 36 (4): 1234–40. https://doi.org/10.1093/bioinformatics/btz682.

Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. “RoBERTa: A Robustly Optimized BERT Pretraining Approach.” https://arxiv.org/abs/1907.11692.

Sanh, Victor, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2020. “DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter.” https://arxiv.org/abs/1910.01108.