Replicating Decoder-only Transformer by using William Shakespeare Corpus
I started following Deep Learning Curriculum written by Jacob Hilton and here is what I learnt from the exercise in Topic 1 - Transformer. My solution is written in Colab T1-Transformers-solution.ipynb
It took me around 20 hours to finish the exercise and it totally worth it. Throughout the process I learnt:
- How to implement the transformer model end-to-end.
- How to gather and clean the data for transformer model
- How to implement positional embedding, Attention, FNN, Residual Connection and put all of them together into transformer model.
- Switching between
LayerNorm(x + SubLayer(x))
andx + SubLayer(LayerNorm(x)
doesn’t affect model performance. - How to program in Pytorch more fluently and gathered a bunch of utility function for later usage.
- How to debug the model by using gradient flow and
torchviz.make_dot
to check model structure clearly.
Enjoy Reading This Article?
Here are some more articles you might like to read next: