In this seminar, participants will journey through the history and development of Large Language Models. The course begins with the origins of word embeddings, tokenization, progressing through the significant architectural milestones that led to modern transformer-based LLMs.
Emphasis is placed on reading and understanding the most influential, well-published papers that paved the way for contemporary approaches in natural language processing.
Requirements : ============
Student should have basic understanding of Neural Networks, Back Propagation, Recurrent Neural Networks, LSTMs etc.
Some key papers include ====================
[1] Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems 26 (2013).
[2] Bahdanau, Dzmitry et al. “Neural Machine Translation by Jointly Learning to Align and Translate.” ICLR (2015).
[3] Vaswani, Ashish et al. “Attention is All you Need.” Neural Information Processing Systems (2017).
[4] Shazeer, Noam, et al. "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." International Conference on Learning Representations. (2017).
[5] Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies (2019).
[6] Brown, Tom, et al. "Language models are few-shot learners." Advances in neural information processing systems 33 (2020). [7] Ouyang, Long, et al. "Training language models to follow instructions with human feedback." Advances in neural information processing systems 35 (2022). [8] Fedus, William, Barret Zoph, and Noam Shazeer. "Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity." Journal of Machine Learning Research 23.120 (2022). [9] Wang, Guan, et al. "Hierarchical Reasoning Model." arXiv preprint arXiv:2506.21734 (2025). |