Chapter 1

GPT-3 is a foundation model.

Notes & Takeaways

Stages when creating LLMs

  1. Data collection, sampling, attention mechanism, building the architecture
  2. Pretraining to build the foundation model
  3. Training (basically finetuning) for specific domain purpose

Transformer architecture

Encoder-decoder

Encoder has a multi-head attention layer and a feed forward layer.

Decoder has a masked multi-head attention layer, another multi-head attention layer, and a feed-forward layer.

GPT architecture

Only decoder.

Other takeaways

The transformer architecture is not easy, man. For a noob that has never read a paper or implemented an architecture before. Truthfully speaking, I'm still stuck on the scaled dot-product attention notations (Q, K, V). Well, I think I'm slightly getting ahead of myself by trying to understand this on chapter 1 when it's discussed on chapter 3 in the book. I'll just take it step-by-step from here, I guess.

Overall, a straightforward chapter with a very high-level overview of things. Hands-on approaches start from chapter 2, I think!