LLama2 From Scratch
visit the project repo at github repo This project aims to build the LLaMA 2 architecture from scratch, incorporating essential advancements in transformer models. Key enhancements include RMS-Normalization, SwiGLU activation, Rotary Positional Embeddings, and advanced attention mechanisms like Grouped-Query Attention, all designed to improve model performance, particularly in handling longer context windows and enhancing the model’s positional understanding.
Features ✨ RMS-Normalization: A simplified version of layer normalization that stabilizes layer activations and aids in model convergence. SwiGLU Activation Function: Replaces ReLU to improve training performance through more efficient activation. Rotary Positional Embeddings (RoPE): Enhances positional awareness at each token by adding distance between tokens, featured in RoFormer: Enhanced Transformer with Rotary Position Embedding (ArXiv). Increased Context Length with GQA: Expands the context window to 4096 tokens and employs grouped-query attention for better long document processing. KV-Cache: A caching technique to improve decoding efficiency and speed.