Module 01
NLP Foundations
3 courses · Tokenization, morphology, syntax
Module 02
Core Applications
4 courses · IR, MT, sentiment, speech
Module 03
Large Language Models
3 courses · Embeddings, transformers, APIs
Module 04
Evaluation & Responsibility
2 courses · Metrics, bias, acceptance
C01 · Module 1
+
Text, Language & Tokenization
From raw text to linguistic units — the entry point for any NLP pipeline.
TokenizationUnicode & encodingRegexn-gramsEdit distanceBPE
Topics covered
- What is language technology? Scope and historical overview
- Text representations: characters, words, sentences, documents
- Tokenization strategies: whitespace, rule-based, subword (BPE, WordPiece, SentencePiece)
- Unicode, normalization, encoding pitfalls for multilingual text
- Regular expressions as a text-processing workhorse
- String similarity: edit distance (Levenshtein), Jaccard, cosine on bags-of-words
- N-gram language models: intuition and basic formalism
- Overview of the course structure and evaluation approach
C02 · Module 1
+
Morphology & the Lexicon
Word structure, inflection, derivation, and computational approaches to the lexicon.
MorphemesStemming & lemmatizationFSTsPOS taggingNERLexical resources
Topics covered
- Morphological typology: inflectional vs derivational, agglutinative vs fusional
- Stemming (Porter, Snowball) and lemmatization
- Finite-state transducers (FSTs) for morphological analysis
- Part-of-speech tagging: rule-based, HMM, neural
- Named entity recognition: sequence labelling, IOB notation
- Lexical resources: WordNet, FrameNet, domain-specific terminologies
- Challenges in morphologically rich languages (Arabic, Finnish, Turkish)
- Impact of tokenization choices on downstream tasks
C03 · Module 1
+
Syntax & Parsing
Phrase-structure grammars, dependency trees, and the CKY parsing algorithm.
CFGCKY algorithmDependency parsingTreebanksAmbiguitySemantic roles
Topics covered
- Phrase-structure grammars: CFG, PCFG
- CKY (Cocke-Kasami-Younger) algorithm - step-by-step walkthrough
- Parsing ambiguity and garden-path sentences
- Dependency grammars and Universal Dependencies
- Statistical and neural parsers (transition-based, graph-based)
- Treebanks as annotation resources (Penn, UD)
- Semantic role labelling and shallow semantic parsing
- Limits of syntax: when do we need semantics?
C04 · Module 2
+
Information Retrieval & Text Classification
Finding and categorizing documents: the backbone of search engines and content pipelines.
TF-IDFBM25Inverted indexNaive BayesSVMNeural IRDense retrieval
Topics covered
- Boolean retrieval and the inverted index
- TF-IDF and the vector space model
- BM25 and probabilistic retrieval frameworks
- Text classification: Naive Bayes, logistic regression, SVM
- Feature engineering vs learned representations
- Evaluation: precision, recall, F1, MAP, NDCG
- Neural retrieval: dense passage retrieval (DPR), bi-encoders
- Application cases: legal document search, medical literature, enterprise search
C05 · Module 2
+
Sentiment Analysis & Summarization
Extracting stance and condensing information - core tasks for business and media analysis.
SentimentAspect-based SAOpinion miningExtractiveAbstractiveROUGE
Topics covered
- Document-level and sentence-level sentiment classification
- Aspect-based sentiment analysis (ABSA)
- Subjectivity detection and opinion target extraction
- Lexicon-based vs machine-learning approaches
- Extractive summarization: TextRank, sentence scoring
- Abstractive summarization: sequence-to-sequence, pointer networks
- Faithfulness and factuality issues in summarization
- Applications: customer feedback, social media monitoring, news aggregation
C06 · Module 2
+
Machine Translation & Sequence Modelling
From rule-based systems to neural MT and the broader family of seq2seq problems.
SMTNeural MTAttentionBeam searchBLEULow-resource MTDomain adaptation
Topics covered
- Historical arc: rule-based to SMT to neural MT
- Statistical MT: alignment models (IBM models), phrase tables
- Encoder-decoder architecture and the attention mechanism
- Beam search decoding and length penalties
- Evaluation: BLEU, chrF, human post-editing effort
- Low-resource and zero-shot MT
- Domain adaptation (legal, medical, technical MT)
- Human-in-the-loop: professional post-editing workflows
C07 · Module 2
+
Speech, Dialogue & Domain-Specific NLP
Spoken language interfaces, conversational systems, and adapting NLP to specialised domains.
ASRTTSDialogue systemsLegal NLPClinical NLPScientific NLP
Topics covered
- Automatic speech recognition (ASR): acoustic models, language models, CTC
- Text-to-speech synthesis (TTS): concatenative, parametric, neural
- Dialogue systems: task-oriented vs open-domain
- Dialogue state tracking, policy, and natural language generation
- Legal NLP: contract analysis, case law retrieval, jurisdiction-specific challenges
- Clinical NLP: EHR processing, ICD coding, privacy (de-identification)
- Scientific and technical NLP: chemistry, biology, engineering text
- Ethical and regulatory constraints across domains
C08 · Module 3
+
From Statistical LMs to Word Embeddings
The conceptual bridge from count-based models to dense vector representations.
N-gram LMsSmoothingWord2VecGloVeFastTextDistributional semantics
Topics covered
- N-gram language models: MLE, perplexity, smoothing (Kneser-Ney)
- Distributional hypothesis and word co-occurrence matrices
- Word2Vec: CBOW and Skip-gram, negative sampling
- GloVe and FastText (subword embeddings)
- Evaluating embeddings: analogy tasks, word similarity benchmarks
- Contextualised representations: ELMo as a stepping stone
- Bias in word embeddings and debiasing techniques
- Practical: visualising embedding spaces (PCA, t-SNE)
C09 · Module 3
+
Transformer Architecture & Pre-trained LLMs
Self-attention, BERT, GPT, and the pre-train / fine-tune paradigm.
TransformersSelf-attentionBERTGPTFine-tuningLoRAMultimodal LLMs
Topics covered
- Recurrent nets and their limitations: vanishing gradients, sequential bottleneck
- The Transformer: multi-head self-attention, positional encoding, feed-forward layers
- Encoder-only models: BERT, RoBERTa - masked language modelling
- Decoder-only models: GPT family - causal language modelling
- Encoder-decoder models: T5, BART - span corruption and denoising
- Pre-training objectives and their inductive biases
- Fine-tuning, PEFT, LoRA, adapters - compute-efficient adaptation
- Multimodal LLMs: vision-language, speech-language
C10 · Module 3
+
Working with LLMs: APIs, RAG & Prompting
Practical skills for integrating and steering large language models in real applications.
Open vs closedAPI usagePrompt engineeringChain-of-thoughtRAGAgents
Topics covered
- Open vs closed LLMs: capability, cost, privacy, and compliance trade-offs
- API design and usage patterns (OpenAI, Anthropic, Hugging Face)
- Prompt engineering: zero-shot, few-shot, chain-of-thought, self-consistency
- Instruction tuning and RLHF / RLAIF
- Retrieval-Augmented Generation (RAG): architecture, chunking, re-ranking
- LLM agents: tool use, function calling, multi-step reasoning
- Structured outputs, JSON mode, grammar-constrained generation
- Data privacy, security, and responsible deployment in organisations
C11 · Module 4
+
Intrinsic & Extrinsic Evaluation Metrics
Measuring NLP systems rigorously - from classification scores to generation quality.
Precision/Recall/F1Confusion matrixBLEUROUGEBERTScoreKappa
Topics covered
- Evaluation philosophy: intrinsic vs extrinsic, offline vs online
- Classification metrics: precision, recall, F-measure, confusion matrix, macro/micro/weighted
- Sequence labelling: span-level evaluation, boundary matching
- Generation metrics: BLEU, ROUGE, METEOR, chrF, BERTScore
- Human evaluation: adequacy, fluency, coherence rating scales
- Inter-annotator agreement: Cohen's kappa, Fleiss' kappa, Krippendorff's alpha
- Statistical significance testing for NLP experiments
- Benchmark design, leaderboard pitfalls, and evaluation dataset contamination
C12 · Module 4
+
Responsible NLP: Bias, Hallucinations & Acceptance
Critical perspectives on deployment - fairness, reliability, and defining done.
HallucinationsFactualityBias & fairnessAcceptance criteriaEU AI ActOversight
Topics covered
- Hallucination taxonomy: intrinsic vs extrinsic, closed vs open-domain
- Measuring and mitigating hallucinations in generation systems
- Bias: types (representation, measurement, aggregation), sources, mitigation strategies
- Fairness metrics: demographic parity, equalized odds, calibration
- Robustness and adversarial evaluation (CheckList methodology)
- Acceptance criteria: defining readiness for deployment in regulated sectors
- AI regulation landscape: EU AI Act, GDPR implications, sector-specific rules
- Human oversight, audit trails, and continuous monitoring in production