Mark Tenenholtz

Mark Tenenholtz



Over the last year, I've spent 100's of hours training transformers for NLP. I went back over my most successful projects and competitions and distilled them into a solid, repeatable process that anyone can follow. 7 steps to train transformers:

1. Immerse yourself in the data Most of the data preprocessing is handled for you when you use a pretrained tokenizer. However, you have to understand how your tokenizer interacts with your dataset. Sometimes important words get split up, some tokens are unknown, etc.

2. Create a baseline Simple models like logistic regression or LightGBM on top of TF-IDF or bag-of-words features are usually a solid baseline. If a transformer can't significantly outperform one of those on your task, it's probably not worth pursuing.

3. Set up a pipeline I generally start off with something really simple and reliable: • Fixed random seed • DeBERTa V3 base or small • AdamW optimizer, no schedule • Experiment tracking and logging • Short sequences (i.e. 256-512 tokens) Make sure this runs, bug-free!

4. Error analysis and QA Now that you've run your pipeline, the easiest way to do QA is to make sure you're producing sensible outputs. This is where error analysis comes in. I usually start with sorting by the largest errors and seeing if I find any obvious issues.

5. Start improving your model The first methods are try are usually: • Longer sequences • Text preprocessing • Testing the effect of scaling up • MLM pretraining on your corpus • Adding new tokens to your tokenizer When are these useful?

For example, adding new tokens can be useful when working in a specialized domain. Sometimes there are words like medical acronyms that your model won't otherwise understand.

As far as text preprocessing, an example is when chaotic capitalization can confuse your model. In these cases lowercasing your text before tokenization can help.

As a sidebar: The performance of a transformer is heavily dependent on its capacity. It can be very helpful to add specialized heads, like an LSTM on the last hidden state.

However! Sometimes, gains you make on a "small" or "base" variant will vanish if you scale up to a large (or bigger) variant. I recommend you test the effect of scaling variants before testing other methods of scaling. Some experts don't even try base variants for this reason.

So, improve your model until you start to overfit. Once you start to overfit, it's time to regularize!

6. Add regularization There are a bunch of ways to regularize deep learning models, but my favorite (to start with) for transformers is dropout. With transformers, there are many places to try dropout.

You can try: • Token dropout • Internal hidden dropout • Internal attention dropout • Dropout on your output layer But, you should note that dropout on regression tasks usually hurts your model performance, so I would avoid it.

There are also adversarial training methods like Fast Gradient Method (FGM) and Adversarial Weight Perturbation (AWP). Basically, they perform adversarial attacks on the model's embeddings during training to improve robustness at the cost of extra training time.

7. Repeat 5+6! At this point, you should continue to try to squeeze out more accuracy and then regularize when you begin to overfit. Just make sure you don't melt your GPU!

TL;DR: 1. Immerse yourself in the data 2. Create a baseline 3. Set up a pipeline 4. Error analysis and QA 5. Start improving your model 6. Add regularization 7. Repeat 5+6 Follow me @marktenenholtz for more high-signal ML content!

If you like this content, I go into much more detail on this topic and similar ones in my newsletter. Check it out:

Follow us on Twitter

to be informed of the latest developments and updates!

You can easily use to @tivitikothread bot for create more readable thread!
Donate 💲

You can keep this app free of charge by supporting 😊

for server charges...