Replicating Decoder-only Transformer by using William Shakespeare Corpus

I started following Deep Learning Curriculum written by Jacob Hilton and here is what I learnt from the exercise in Topic 1 - Transformer. My solution is written in Colab T1-Transformers-solution.ipynb

It took me around 20 hours to finish the exercise and it totally worth it. Throughout the process I learnt:

How to implement the transformer model end-to-end.
How to gather and clean the data for transformer model
How to implement positional embedding, Attention, FNN, Residual Connection and put all of them together into transformer model.
Switching between LayerNorm(x + SubLayer(x)) and x + SubLayer(LayerNorm(x) doesn’t affect model performance.
How to program in Pytorch more fluently and gathered a bunch of utility function for later usage.
How to debug the model by using gradient flow and torchviz.make_dot to check model structure clearly.

Enjoy Reading This Article?