Machine Translation Learning
MT Bibliography (expanding…)
* blog posts, tutorials, visual explanations
- Prerequisites: general ML concepts, blogs, tutorials
- Statistical Machine Translation & Language Models
- Evaluation
- Data Collection, Alignment
- Neural MT with RNNs
- Neural MT with CNNs
- Tokenizers
- Neural MT with Transformers
- Neural MT with Diffusion Models
- Back to the future
- Other Courses
Prerequisites: general ML concepts, blogs, tutorials
- Linear Algebra
- Multivariate Calculus
- Probability Course
- Beautiful Visualizations - Proba and Stats
- Bayesian Statistics
- Count Bayes Blog
- Five Minutes Stats
- Expectation Maximization (EM) Foundations
- EM for Gaussian Mixture Models
- Hidden Markov Models, EM, and Viterbi
- Information Theory, Entropy, KL-Divergence
- Monte Carlo / Metropolis
- DS Handbook
Prerequisites, general ML concepts, books
- Pattern Recognition And Machine Learning - huge book of the early 2000s, excellent coverage of probability distributions, graphical models, Bayesian inference
- Statistical Foundations of ML - course syllabus, good coverage of probabilities, statistical tests, general treatment of ML methods
- Information Theory, Inference, and Learning Algorithms - huge book of the early 2000s, excellent coverage of information theory and probabilistic inference
- Math for ML - when you want to start from the very basics of algebra, geometry, calculus, probabilities
- Probabilistic Machine Learning Book 1,2,3 - excellent coverage of foundations, and more advanced topics (like diffusion models)
- Deep Learning Book - good place to understand neural networks
Statistical Machine Translation & Language Models
- *Kevin Knight’s Workbook
- *Lena Voita’s explanations on LM
- Koehn’s SMT book (SMT from scratch)
- Knesser-Ney smoothing, 1995, *tutorial
- Och’s PhD thesis, 2002
- Mathematics of SMT, 2003
- N-gram Language Models, Jurafsky, SLP
Evaluation
- BLEU, Papineni et al., 2002
- Statistical significance tests, 2004
- Statistical significance tests of models’ correlation, 2014
- chrF++, 2015
- Comparison of metrics, Formicheva & Specia, 2018
- A Call for Clarity in Reporting BLEU Scores, 2018, sacre bleu
- Good translation wrong in context, 2019
- BERTScore, 2020
- Scientific Credibility, 2021
- COMET, more recent paper 2022
- Lab Notebook: MT Eval 1, MT Eval 2
Data Collection, Alignment
- *bitextor
- Parallel corpora for medium density languages, 2005 hunalign
- Word Alignment with Markov Chain Monte Carlo, 2016, efmaral
- Backtranslation, 2015
- Word Alignments Without Parallel Training Data, 2020, SimAlign
- Aligned segments from unclean parallel data, 2020
- Comparison of GIZA++ vs. Neural Word Alignment, 2020
- Massively Multilingual Sentence Embeddings, 2019
- Multilingual Sentence Embeddings, 2020
- Mining Using Distilled Sentence, 2022, LASER
- MT for the next 1000 Lang, 2022
- Lab Notebook: Using LASER
Neural MT with RNNs
- *Seq2seq Models With Attention
- *Seq2seq Models Tutorial
- *Another tutorial
- *Different attention types
- *Tutorial on training RNNs, 2002-2013
- Learning Long-term Dependencies are Difficult, 1994
- LSTM, 1997
- Neural Probabilisitc Language Model, 2003, also here
- Seq2seq learning with NNs, 2014
- RNN Encoder-Decoder, 2014
- Seq2seq with Attention, 2015
- More Types of Attention, 2015
- Lab Tutorial: Training an RNN seq2seq, Generic LLM training using Axolotl, Unbabel models
Neural MT with CNNs
- Language Modeling with Gated Convolutional Networks, 2016
- Convolutional Sequence to Sequence Learning, 2017
- *Tutorial with code
- Lab Tutorial: Training a CNN seq2seq
Tokenizers
- *Byte Pair Encoding
- *Tokenizers
- *Understanding Sentencepiece
- *EM, Viterbi, Unigram LM
- Byte-Pair Encoding Compression, 1994
- Byte-Pair Encoding Tokenization, 2015
- Unigram LM Tokenizer, 2018
- sentencepiece library, 2018, code
- BPE Dropout, 2020
- Lab Tutorial:, sentencepiece only
*Transformers - Tutorials
- Illustrated Transformer
- Lena Voita’s Tutorial
- The Annotated Transformer
- Peter Bloem’s Tutorial
- Illustrated BERT
- Illustrated GPT-2
- Huggingface Transformers Tutorial
- Annotated GPT-2
- Transformer for people outside NLP
- E2ML School Tutorial
Transformers - Essential Readings
- Attention is all you Need, 2017
- BERT, 2019
- GPT-2, 2019
- WMT2021 Baselines and Models, 2021
- *Models in huggingface
Other Transformer Models
- GPT-3, 2020, open gpt flavors
- ELECTRA, 2019, hgfce
- RoBERTa, 2019, hgfce
- BART, 2020, hgfce
- mBART, 2020, hgfce
- *Reformer, 2020, hgfce
- T5, 2020, hgfce, hgfce
- M2M-100, 2021, model, hgfce
- Lab Tutorial: T5
Transformers and Explainability
- Visualizing Attention, 2019
- Is Attention Interpretable?, 2019
- Quantifying Attention Flow in Transformers, 2020
- Transformer Interpretability Beyond Attention, 2021, code
Machine Translation Frameworks
- Marian MT, 2018
- OpenNMT, 2017
- fairseq, 2019
- JoeyNMT, 2019
- Huggingface seq2seq
Extra Readings on Machine Translation
- Gender Bias in MT, 2019
- MT Domain Robustness, 2019
- Fixed Encoder Self-Attention Patterns, 2020
- Translationese, 2020
- Character-level NMT, 2021
Recent / Interesting Research
- Synchronous Bidirectional Beam Search, 2019
- Specialized Heads Do the Heavy Lifting, code, tutorial 2019
- Transformer Circuits, 2021
- Why Beam Search Works, 2021
- What Works Best for Zero-Shot, 2022
- Contrastive Text Generation, 2022, code
- Induction Heads, 2022
- Wide Attention vs Depth, 2022
- The 48 params of BERT, 2022
- Mixture of Experts, 2022
- The Importance of Attention, 2022
Neural MT with Diffusion Models
- *Energy-based Models
- Translation with DM, 2021
- Text Generation, 2021
- DiffuSeq, 2022
Back to the future
- Shannon’s Autoregressive Language Models, 1950
- ALPAC report, 1966, summary here
- Statistical Methods and Linguistics, 1995
- The Future of MT, seen from 1985
- MT in the USSR, 1984
- Early MT in Romania
- Soviet MT overview, Gordin, 2020
- Survey of MT in USSR, 2010