Faculty of Computer Science  ·  Master's Level

Language
Technology

A rigorous, interdisciplinary curriculum bridging linguistic theory and modern AI — from tokenization and syntax to large language models, retrieval-augmented generation, and responsible deployment. Designed for a mixed cohort of computer scientists, lawyers, medical practitioners, and entrepreneurs.

12
Courses
11
Labs
4
Modules
1h30
Per session

Curriculum — 12 courses

Module 01
NLP Foundations
3 courses  ·  Tokenization, morphology, syntax
Module 02
Core Applications
4 courses  ·  IR, MT, sentiment, speech
Module 03
Large Language Models
3 courses  ·  Embeddings, transformers, APIs
Module 04
Evaluation & Responsibility
2 courses  ·  Metrics, bias, acceptance
C01  ·  Module 1 +
Text, Language & Tokenization
From raw text to linguistic units — the entry point for any NLP pipeline.
TokenizationUnicode & encodingRegexn-gramsEdit distanceBPE
Topics covered
  • What is language technology? Scope and historical overview
  • Text representations: characters, words, sentences, documents
  • Tokenization strategies: whitespace, rule-based, subword (BPE, WordPiece, SentencePiece)
  • Unicode, normalization, encoding pitfalls for multilingual text
  • Regular expressions as a text-processing workhorse
  • String similarity: edit distance (Levenshtein), Jaccard, cosine on bags-of-words
  • N-gram language models: intuition and basic formalism
  • Overview of the course structure and evaluation approach
C02  ·  Module 1 +
Morphology & the Lexicon
Word structure, inflection, derivation, and computational approaches to the lexicon.
MorphemesStemming & lemmatizationFSTsPOS taggingNERLexical resources
Topics covered
  • Morphological typology: inflectional vs derivational, agglutinative vs fusional
  • Stemming (Porter, Snowball) and lemmatization
  • Finite-state transducers (FSTs) for morphological analysis
  • Part-of-speech tagging: rule-based, HMM, neural
  • Named entity recognition: sequence labelling, IOB notation
  • Lexical resources: WordNet, FrameNet, domain-specific terminologies
  • Challenges in morphologically rich languages (Arabic, Finnish, Turkish)
  • Impact of tokenization choices on downstream tasks
C03  ·  Module 1 +
Syntax & Parsing
Phrase-structure grammars, dependency trees, and the CKY parsing algorithm.
CFGCKY algorithmDependency parsingTreebanksAmbiguitySemantic roles
Topics covered
  • Phrase-structure grammars: CFG, PCFG
  • CKY (Cocke-Kasami-Younger) algorithm - step-by-step walkthrough
  • Parsing ambiguity and garden-path sentences
  • Dependency grammars and Universal Dependencies
  • Statistical and neural parsers (transition-based, graph-based)
  • Treebanks as annotation resources (Penn, UD)
  • Semantic role labelling and shallow semantic parsing
  • Limits of syntax: when do we need semantics?
C04  ·  Module 2 +
Information Retrieval & Text Classification
Finding and categorizing documents: the backbone of search engines and content pipelines.
TF-IDFBM25Inverted indexNaive BayesSVMNeural IRDense retrieval
Topics covered
  • Boolean retrieval and the inverted index
  • TF-IDF and the vector space model
  • BM25 and probabilistic retrieval frameworks
  • Text classification: Naive Bayes, logistic regression, SVM
  • Feature engineering vs learned representations
  • Evaluation: precision, recall, F1, MAP, NDCG
  • Neural retrieval: dense passage retrieval (DPR), bi-encoders
  • Application cases: legal document search, medical literature, enterprise search
C05  ·  Module 2 +
Sentiment Analysis & Summarization
Extracting stance and condensing information - core tasks for business and media analysis.
SentimentAspect-based SAOpinion miningExtractiveAbstractiveROUGE
Topics covered
  • Document-level and sentence-level sentiment classification
  • Aspect-based sentiment analysis (ABSA)
  • Subjectivity detection and opinion target extraction
  • Lexicon-based vs machine-learning approaches
  • Extractive summarization: TextRank, sentence scoring
  • Abstractive summarization: sequence-to-sequence, pointer networks
  • Faithfulness and factuality issues in summarization
  • Applications: customer feedback, social media monitoring, news aggregation
C06  ·  Module 2 +
Machine Translation & Sequence Modelling
From rule-based systems to neural MT and the broader family of seq2seq problems.
SMTNeural MTAttentionBeam searchBLEULow-resource MTDomain adaptation
Topics covered
  • Historical arc: rule-based to SMT to neural MT
  • Statistical MT: alignment models (IBM models), phrase tables
  • Encoder-decoder architecture and the attention mechanism
  • Beam search decoding and length penalties
  • Evaluation: BLEU, chrF, human post-editing effort
  • Low-resource and zero-shot MT
  • Domain adaptation (legal, medical, technical MT)
  • Human-in-the-loop: professional post-editing workflows
C07  ·  Module 2 +
Speech, Dialogue & Domain-Specific NLP
Spoken language interfaces, conversational systems, and adapting NLP to specialised domains.
ASRTTSDialogue systemsLegal NLPClinical NLPScientific NLP
Topics covered
  • Automatic speech recognition (ASR): acoustic models, language models, CTC
  • Text-to-speech synthesis (TTS): concatenative, parametric, neural
  • Dialogue systems: task-oriented vs open-domain
  • Dialogue state tracking, policy, and natural language generation
  • Legal NLP: contract analysis, case law retrieval, jurisdiction-specific challenges
  • Clinical NLP: EHR processing, ICD coding, privacy (de-identification)
  • Scientific and technical NLP: chemistry, biology, engineering text
  • Ethical and regulatory constraints across domains
C08  ·  Module 3 +
From Statistical LMs to Word Embeddings
The conceptual bridge from count-based models to dense vector representations.
N-gram LMsSmoothingWord2VecGloVeFastTextDistributional semantics
Topics covered
  • N-gram language models: MLE, perplexity, smoothing (Kneser-Ney)
  • Distributional hypothesis and word co-occurrence matrices
  • Word2Vec: CBOW and Skip-gram, negative sampling
  • GloVe and FastText (subword embeddings)
  • Evaluating embeddings: analogy tasks, word similarity benchmarks
  • Contextualised representations: ELMo as a stepping stone
  • Bias in word embeddings and debiasing techniques
  • Practical: visualising embedding spaces (PCA, t-SNE)
C09  ·  Module 3 +
Transformer Architecture & Pre-trained LLMs
Self-attention, BERT, GPT, and the pre-train / fine-tune paradigm.
TransformersSelf-attentionBERTGPTFine-tuningLoRAMultimodal LLMs
Topics covered
  • Recurrent nets and their limitations: vanishing gradients, sequential bottleneck
  • The Transformer: multi-head self-attention, positional encoding, feed-forward layers
  • Encoder-only models: BERT, RoBERTa - masked language modelling
  • Decoder-only models: GPT family - causal language modelling
  • Encoder-decoder models: T5, BART - span corruption and denoising
  • Pre-training objectives and their inductive biases
  • Fine-tuning, PEFT, LoRA, adapters - compute-efficient adaptation
  • Multimodal LLMs: vision-language, speech-language
C10  ·  Module 3 +
Working with LLMs: APIs, RAG & Prompting
Practical skills for integrating and steering large language models in real applications.
Open vs closedAPI usagePrompt engineeringChain-of-thoughtRAGAgents
Topics covered
  • Open vs closed LLMs: capability, cost, privacy, and compliance trade-offs
  • API design and usage patterns (OpenAI, Anthropic, Hugging Face)
  • Prompt engineering: zero-shot, few-shot, chain-of-thought, self-consistency
  • Instruction tuning and RLHF / RLAIF
  • Retrieval-Augmented Generation (RAG): architecture, chunking, re-ranking
  • LLM agents: tool use, function calling, multi-step reasoning
  • Structured outputs, JSON mode, grammar-constrained generation
  • Data privacy, security, and responsible deployment in organisations
C11  ·  Module 4 +
Intrinsic & Extrinsic Evaluation Metrics
Measuring NLP systems rigorously - from classification scores to generation quality.
Precision/Recall/F1Confusion matrixBLEUROUGEBERTScoreKappa
Topics covered
  • Evaluation philosophy: intrinsic vs extrinsic, offline vs online
  • Classification metrics: precision, recall, F-measure, confusion matrix, macro/micro/weighted
  • Sequence labelling: span-level evaluation, boundary matching
  • Generation metrics: BLEU, ROUGE, METEOR, chrF, BERTScore
  • Human evaluation: adequacy, fluency, coherence rating scales
  • Inter-annotator agreement: Cohen's kappa, Fleiss' kappa, Krippendorff's alpha
  • Statistical significance testing for NLP experiments
  • Benchmark design, leaderboard pitfalls, and evaluation dataset contamination
C12  ·  Module 4 +
Responsible NLP: Bias, Hallucinations & Acceptance
Critical perspectives on deployment - fairness, reliability, and defining done.
HallucinationsFactualityBias & fairnessAcceptance criteriaEU AI ActOversight
Topics covered
  • Hallucination taxonomy: intrinsic vs extrinsic, closed vs open-domain
  • Measuring and mitigating hallucinations in generation systems
  • Bias: types (representation, measurement, aggregation), sources, mitigation strategies
  • Fairness metrics: demographic parity, equalized odds, calibration
  • Robustness and adversarial evaluation (CheckList methodology)
  • Acceptance criteria: defining readiness for deployment in regulated sectors
  • AI regulation landscape: EU AI Act, GDPR implications, sector-specific rules
  • Human oversight, audit trails, and continuous monitoring in production

Labs — 11 × 90 min

L00  ·  Before Module 2 Foundation lab
Software development practices for NLP projects
Git versioning, project structure, basic testing, and deployment — a shared toolkit reused in every subsequent lab.
90 min timeline
Framing (20 min)
No-code track (35 min)
Advanced track (35 min)
Synthesis & check (20 min)
Shared framing — 20 min Instructor shows a single real failure: an NLP script that works on my machine but breaks in a colleague's environment, has no version history, and cannot be tested automatically. The lab fixes all three problems. Both tracks end with the same artefact — a structured project folder, version-controlled on GitHub, with at least one automated check — ready to be reused in every subsequent lab.
Four concepts covered
Version control
Git + GitHub: commit, branch, pull request, merge conflict. Why history matters for reproducibility in research.
Project structure
Cookiecutter-style layout: data/, src/, notebooks/, tests/, README, requirements.txt / pyproject.toml.
Test automation
A single pytest smoke test plus a GitHub Actions workflow that runs it on every push. Green badge = deployable.
Deployment basics
Packaging a script as a CLI tool or a minimal Gradio / Streamlit app and sharing a public link — not production, but demonstrable.
Low / no-code track
10-45 min (35 min)
GitHub Desktop + GitHub Projects + HuggingFace Spaces
  • Install GitHub Desktop (GUI client, no terminal). Clone the provided course template repo.
  • Create a branch called lab-l00-yourname using the UI. Edit the README — add your name and a one-line project description.
  • Commit the change with a meaningful message (feat: add project description) — introduces the conventional commits convention.
  • Open a Pull Request on GitHub.com. Review a classmate's PR, leave one comment, approve it.
  • Set up a GitHub Projects Kanban board: columns To Do / In Progress / Done. Add 3 cards for the next 3 labs.
  • Deploy the provided Gradio demo to HuggingFace Spaces via drag-and-drop — get a live public URL.
  • Share the URL in the class chat. This URL is reused in Lab L04 for the IR classifier.
Tools: GitHub Desktop · GitHub Projects · HuggingFace Spaces · browser only
Advanced IT track
10-45 min (35 min)
Git CLI + pytest + GitHub Actions + Docker basics
  • Fork and clone the course template repo via CLI. Inspect the folder structure: data/, src/, tests/, notebooks/, pyproject.toml.
  • Create a feature branch, add a tokenizer utility function to src/text_utils.py, write 3 pytest unit tests (normal input, empty string, Unicode).
  • Run pytest -v locally — all green. Commit and push; open a PR with a short description.
  • Inspect the provided .github/workflows/ci.yml — understand the trigger, environment, install, and test steps. Push a deliberately failing test, observe the red check, then fix it.
  • Add a Makefile with targets: make test, make lint (ruff), make run. This becomes the standard interface for all subsequent labs.
  • Build a minimal Docker image for the Gradio app (FROM python:3.11-slim, copy code, install deps, expose port). Run it locally.
Tools: Git CLI · pytest · GitHub Actions · ruff · Docker Desktop · Makefile
Synthesis & wrap-up Every student confirms: (a) repo on GitHub, (b) green CI badge, (c) live HuggingFace Spaces URL. Instructor records these URLs — they become the submission mechanism for all future labs. Discussion: why does reproducibility matter in NLP? Announce the running convention: every deliverable is a commit; the PR description is the lab report.
L04  ·  Course 4
IR & text classification in practice
Build and evaluate a retrieval and classification pipeline.
90 min timeline
Warm-up (10 min)
No-code track (30 min)
Advanced track (30 min)
Synthesis & wrap-up (35 min)
Warm-up Explore a live search engine. Why does "jaguar" return different results in different contexts? Collect examples on the board before touching any classifier. (10 min)
Low / no-code track
10-40 min (30 min)
Orange Data Mining classifier
  • Load a news or review dataset in Orange — no code needed.
  • Apply Bag-of-Words then Naive Bayes / SVM widget.
  • Inspect the confusion matrix widget — screenshot it.
  • Adjust vocabulary size; observe precision / recall shift.
  • Export your confusion matrix to the class shared spreadsheet for the synthesis.
  • Dev hook: commit a markdown results table to your course repo using GitHub Desktop.
Tools: Orange Data Mining (desktop, free) · class shared spreadsheet
Advanced IT track
10-40 min (30 min)
TF-IDF + sklearn pipeline
  • Build a TF-IDF + SVM pipeline on 20 Newsgroups.
  • Grid-search over min_df and ngram_range.
  • Compute macro / micro / weighted F1 per class.
  • Add a dense retrieval stage (sentence-transformers) and compare with BM25 on 20 queries.
  • Dev hook: add 2 pytest tests (one per class label), push via PR; CI must be green.
Tools: scikit-learn · sentence-transformers · rank-bm25
Synthesis & wrap-up Compare confusion matrices across groups. Which categories are hardest to separate? Discuss the stakes of misclassification in law vs spam filtering. (25 min + 10 min wrap-up)
L05  ·  Course 5
Sentiment analysis & summarization audit
Compare model outputs against human judgement; surface faithfulness failures.
90 min timeline
Warm-up (10 min)
No-code track (30 min)
Advanced track (30 min)
Synthesis & wrap-up (35 min)
Warm-up Manually annotate 10 sentences for sentiment (positive / negative / neutral). Keep your scores — you will compare them against model output during synthesis. (10 min)
Low / no-code track
10-40 min (30 min)
HuggingFace Spaces + Resoomer
  • Run your 10 warm-up sentences through 3 HuggingFace sentiment models.
  • Fill in a comparison table: human vs model A vs model B vs model C.
  • Use Resoomer or SMMRY to summarise a 500-word news article.
  • Manually flag at least 2 faithfulness errors in the generated summaries.
  • Dev hook: commit the comparison table as results/l05_sentiment.md in your repo.
Tools: HuggingFace Inference API · Resoomer · SMMRY · shared spreadsheet
Advanced IT track
10-40 min (30 min)
Transformers pipeline + ROUGE scoring
  • Run aspect-based sentiment analysis on product reviews with PyABSA.
  • Generate abstractive summaries with BART / PEGASUS.
  • Compute ROUGE-1, ROUGE-2, ROUGE-L against reference summaries.
  • Probe for hallucinations: run an NLI model on (summary, source) pairs.
  • Dev hook: log all ROUGE scores to results/l05_rouge.csv, commit alongside the code.
Tools: PyABSA · transformers (BART/PEGASUS) · rouge-score · scipy
Synthesis & wrap-up Where do models and humans disagree most? Revisit the faithfulness errors found by both groups. What would a reliability threshold look like for a clinical summarization tool? (25 min + 10 min wrap-up)
L06  ·  Course 6
Machine translation quality & post-editing
Evaluate MT quality through manual post-editing and automatic metrics.
90 min timeline
Warm-up (10 min)
No-code track (30 min)
Advanced track (30 min)
Synthesis & wrap-up (35 min)
Warm-up Feed the same specialised sentence (a contract clause) into 3 MT engines. Spot the first error in each — is it the same error type? (10 min)
Low / no-code track
10-40 min (30 min)
Post-editing exercise with error classification
  • Translate a legal or medical paragraph with DeepL and Google Translate.
  • Use the provided post-editing template to correct the MT output.
  • Classify each error: terminology, grammar, omission, mistranslation.
  • Count errors per 100 words — compute a rough HTER estimate.
  • Rank the two MT systems based on post-edit effort.
  • Dev hook: commit your error log as results/l06_postedit.csv via GitHub Desktop.
Tools: DeepL · Google Translate · provided post-editing template (shared doc)
Advanced IT track
10-40 min (30 min)
MarianMT + sacrebleu benchmarking
  • Load a Helsinki-NLP MarianMT model via HuggingFace.
  • Translate the same test set with a generic and a domain-adapted model.
  • Compute BLEU, chrF, and TER with sacrebleu.
  • Run COMET or BERTScore for neural MT evaluation.
  • Dev hook: pin MarianMT and sacrebleu versions in requirements.txt; CI reruns the benchmark — score must match to 2 decimal places.
Tools: Helsinki-NLP MarianMT · sacrebleu · comet-score · pinned requirements.txt
Synthesis & wrap-up Does BLEU correlate with post-edit effort? Present cases where a legally inadmissible translation still scores well. When is MT unsafe without human review? (25 min + 10 min wrap-up)
L07  ·  Course 7
Speech interfaces & domain NLP deployment
Transcribe, dialogue-design, and process domain-specific text end to end.
90 min timeline
Warm-up (10 min)
No-code track (30 min)
Advanced track (30 min)
Synthesis & wrap-up (35 min)
Warm-up Record a 30-second voice note with background noise. We will use it to illustrate ASR robustness across both tracks. (10 min)
Low / no-code track
10-40 min (30 min)
Whisper web demo + Voiceflow dialogue sketch
  • Transcribe 3 audio clips (general speech, medical dictation, legal deposition) using the Whisper web demo.
  • Annotate transcription errors by type: ASR error, punctuation, proper noun, domain term.
  • In Voiceflow, sketch a 5-turn FAQ dialogue for a legal helpdesk.
  • Map intents, entities, and fallback paths on the canvas.
  • Export the intent list — how many intents are needed for a minimal viable bot?
  • Dev hook: commit a markdown summary of your intent list to your repo.
Tools: Whisper web demo · Voiceflow (free tier) · shared annotation sheet
Advanced IT track
10-40 min (30 min)
Whisper API + scispaCy NER pipeline
  • Run Whisper (small) on the same audio clips locally; compute WER against the web demo output.
  • Process a clinical note with spaCy + scispaCy (NER, negation detection).
  • Extract medications, dosages, and diagnoses into a structured table.
  • Build a minimal slot-filling dialogue manager in Python (rule-based, ~50 lines).
  • Dev hook: each student works on a named domain branch. Merging requires one classmate review — simulates a real team handoff.
Tools: Whisper (small model, local) · spaCy · scispaCy · Python slot-filling
Synthesis & wrap-up Present your dialogue designs. What assumptions did they encode? What regulatory constraints apply in a clinical context? How does WER translate into real-world risk? (25 min + 10 min wrap-up)
L08  ·  Course 8
Exploring word embeddings & distributional bias
Navigate embedding spaces, probe for analogies, and surface bias patterns.
90 min timeline
Warm-up (10 min)
No-code track (30 min)
Advanced track (30 min)
Synthesis & wrap-up (35 min)
Warm-up Word association: write 5 words for "bank", "nurse", "engineer". Pool on the board — observe stereotypes before touching any model. (10 min)
Low / no-code track
10-40 min (30 min)
TensorFlow Embedding Projector
  • Load pretrained Word2Vec / GloVe in projector.tensorflow.org.
  • Explore k-NN for polysemous words (bank, spring, bat).
  • Test 5 analogy queries: man-woman, king-queen, doctor-nurse, lawyer-?.
  • Find 3 clear examples of gender or ethnic bias in the embedding space.
  • Document findings in the class shared spreadsheet.
  • Dev hook: commit a markdown bias observation report to your repo.
Tools: projector.tensorflow.org · shared results spreadsheet
Advanced IT track
10-40 min (30 min)
Domain Word2Vec with gensim + bias quantification
  • Train Word2Vec on a domain corpus (legal judgments or biomedical abstracts).
  • Compare domain vs GloVe nearest neighbours for 10 domain terms.
  • Visualise with t-SNE — highlight domain-specific clusters.
  • Compute a WEAT score to quantify gender bias (wefe library).
  • Attempt a simple debiasing (project out gender direction); re-run WEAT.
  • Dev hook: save t-SNE plot to results/figures/; add a pytest assertion that the output file exists after running the script.
Tools: gensim · scikit-learn (t-SNE) · wefe (WEAT) · matplotlib
Synthesis & wrap-up Present bias examples from both tracks. How would you debias? What are the downstream consequences if you do not? Who should be responsible for bias in a deployed system? (25 min + 10 min wrap-up)
L09  ·  Course 9
Pre-trained LLMs: model cards & fine-tuning
Critically evaluate model documentation and adapt a BERT model to a new task.
90 min timeline
Warm-up (10 min)
No-code track (30 min)
Advanced track (30 min)
Synthesis & wrap-up (35 min)
Warm-up Fill-mask live demo: "The patient was diagnosed with [MASK]" — what does BERT predict? What does that reveal about its training data? (10 min)
Low / no-code track
10-40 min (30 min)
HuggingFace model hub audit
  • Find 3 models for a domain (legal, medical, multilingual).
  • Read model cards: training data, known limitations, intended use, licence.
  • Test each model on 5 domain sentences via the Inference API.
  • Fill in an audit table: data source, bias disclosure, licence compatibility.
  • Flag which models would be deployable under GDPR.
  • Dev hook: commit the audit table to your repo as results/l09_model_audit.md.
Tools: HuggingFace Hub Inference API · shared audit spreadsheet
Advanced IT track
10-40 min (30 min)
BERT fine-tuning with PEFT / LoRA
  • Fine-tune BERT (or RoBERTa) on a small classification task (<=500 examples).
  • Compare zero-shot pipeline vs fine-tuned — delta in F1.
  • Apply LoRA via the PEFT library to reduce trainable parameters by 90%+.
  • Inspect attention weights for a few examples with BertViz.
  • Dev hook: log fine-tuning runs to W&B or MLflow; commit the run ID. Experiment tracking is now part of every future lab.
Tools: transformers · PEFT · BertViz · Weights & Biases (or MLflow)
Synthesis & wrap-up Is the model card adequate for deploying in a hospital? Which information is missing? Encoder vs decoder — when does the choice matter? Compare fine-tuned F1 with zero-shot baseline. (25 min + 10 min wrap-up)
L10  ·  Course 10
Prompt engineering & RAG pipeline build
Systematically test prompting strategies and build a minimal retrieval-augmented system.
90 min timeline
Warm-up (10 min)
No-code track (30 min)
Advanced track (30 min)
Synthesis & wrap-up (35 min)
Warm-up Ask the same factual question 3 different ways to a chat LLM. Record how much the answer varies — this motivates systematic prompt testing. (10 min)
Low / no-code track
10-40 min (30 min)
Systematic prompt comparison (web UI)
  • Task: extract key obligations from a 1-page contract clause.
  • Test 4 strategies: zero-shot, few-shot (2 examples), chain-of-thought, role-prompting.
  • Record outputs in a structured scoring table (completeness, accuracy, format).
  • Try adversarial prompts — can you make the model ignore the contract?
  • Write a 1-paragraph prompting guide for a non-technical colleague.
  • Dev hook: commit your scoring table and prompting guide to your repo.
Tools: ChatGPT / Claude web UI · shared scoring spreadsheet
Advanced IT track
10-40 min (30 min)
Minimal RAG pipeline (LangChain / LlamaIndex)
  • Index 15-20 domain documents (chunk, embed, store in a local vector DB).
  • Build a retrieval + generation chain with a local or API model.
  • Compare RAG vs bare LLM on 10 factual questions about the corpus.
  • Test retrieval failure modes: adversarial query, wrong chunk size.
  • Log precision@3 for the retriever separately from generation quality.
  • Dev hook: introduce .env + python-dotenv for the API key. Add detect-secrets as a pre-commit hook — CI fails if a key is ever committed in plain text.
Tools: LangChain or LlamaIndex · ChromaDB · OpenAI or Anthropic API · .env + python-dotenv
Synthesis & wrap-up Share best prompt strategies. When does RAG still hallucinate? What chunking strategy worked best and why? What would it take to deploy this in a law firm? (25 min + 10 min wrap-up)
L-AG  ·  After Course 10 Bonus lab
Building an agentic AI workflow
Design, build, and stress-test a multi-step AI agent that uses tools, memory, and conditional logic.
90 min timeline
Shared framing (20 min)
No-code track (35 min)
Advanced track (35 min)
Demo & debrief (35 min)
Shared framing — 20 min Instructor walks through the agent loop live — perceive, plan, act, observe, loop — running a demo agent in verbose mode so every tool call and failure is visible. Students pick their own scenario: contract review assistant, clinical literature monitor, competitor news digest, or grant deadline tracker.
Low / no-code track
20-55 min (35 min)
Visual agent builder — n8n or Flowise
  • Wire a 4-node chain: trigger, retrieve (HTTP/RSS), LLM summarise, output (email / Slack / Google Sheet).
  • Add a conditional branch: if the LLM flags high urgency, route to a different output node.
  • Add a memory node (window buffer) so the agent can answer follow-up questions about its last run.
  • Run the workflow 3 times with different inputs — screenshot each trace.
  • Deliberately break one tool (wrong API key, bad URL) — observe how the agent fails and whether it recovers.
Tools: n8n.io (cloud free tier) · Flowise (hosted) · Zapier AI (fallback) — browser only
Advanced IT track
20-55 min (35 min)
Code-first agent — Cursor + LangGraph
  • Open the provided starter repo in Cursor — use the AI chat to understand the LangGraph state machine skeleton.
  • Ask Cursor to generate a new tool node (e.g. PubMed search, CURIA case lookup, or company filings fetch).
  • Implement a simple ReAct loop: the agent decides whether to call a tool or produce a final answer.
  • Add a conditional edge: if confidence score < 0.6, loop back to retrieval rather than answering.
  • Test a prompt injection attack: embed an instruction in a retrieved document and see if the agent follows it.
  • Dev hook: commit every agent execution as a JSON trace to runs/. Add a pytest fixture that replays a trace and asserts the final answer is unchanged — a regression test for agent behaviour.
Tools: Cursor (IDE) · LangGraph · OpenAI / Anthropic API · provided starter repo (~50 lines to complete)
Synthesis & wrap-up Live demo: each group runs their agent and demonstrates the deliberate failure. Discussion: which approach is easier to audit? Safety panel: when should an agent ask for human confirmation? Map to EU AI Act risk categories. Exit ticket: each student writes "My agent should not be allowed to autonomously do X because..." — collected for L12. (35 min)
L11  ·  Course 11
Annotation, metrics & significance testing
Annotate a real dataset, measure inter-annotator agreement, run a model comparison.
90 min timeline
Warm-up (10 min)
No-code track (30 min)
Advanced track (30 min)
Synthesis & wrap-up (35 min)
Warm-up Everyone rates the same model output 1-5. Pool scores on the board — observe the spread. This is the problem the whole lab is designed to solve. (10 min)
Low / no-code track
10-40 min (30 min)
Annotation exercise + Cohen's kappa
  • Annotate 20 sentences for sentiment (3-class) in pairs.
  • Compute Cohen's kappa using an online calculator.
  • Reconcile disagreements — draft a 5-rule annotation guideline.
  • Re-annotate 10 ambiguous sentences with the new guideline.
  • Does kappa improve? By how much? Commit the final annotation guideline to your repo.
Tools: Online kappa calculator · shared annotation spreadsheet · provided template
Advanced IT track
10-40 min (30 min)
Evaluation suite in Python
  • Compute macro/micro/weighted F1; plot confusion matrix with seaborn.
  • Compute ROUGE-1/2/L and BERTScore on a summarization output.
  • Run a bootstrap significance test between two model outputs (scipy).
  • Compute Krippendorff's alpha across 4 annotators.
  • Dev hook: the full evaluation suite runs in CI on every PR. A PR that drops macro-F1 below a defined threshold automatically fails — acceptance criteria enforced in code.
Tools: scikit-learn · rouge-score · bert-score · scipy · krippendorff library · seaborn
Synthesis & wrap-up Is model A significantly better than model B? What does better mean when metrics disagree? Benchmark contamination discussion. Do the thresholds set today match the L-AG exit tickets? (25 min + 10 min wrap-up)
L12  ·  Course 12
Bias auditing, red-teaming & acceptance criteria
Red-team a live system, audit for bias, and write deployment acceptance criteria.
90 min timeline
Warm-up (10 min)
No-code track (30 min)
Advanced track (30 min)
Synthesis & wrap-up (35 min)
Warm-up 5-minute red-team: try to elicit a biased or factually wrong response from a commercial LLM in your domain (law, medicine, business). Document what you found. (10 min)
Low / no-code track
10-40 min (30 min)
Bias audit + deployment readiness report
  • Score 30 model responses on demographically varied inputs using a provided rubric.
  • Identify which demographic variables correlate with output quality drops.
  • Complete a deployment readiness checklist (provided template).
  • Draft a 1-page readiness report: risk level, mitigations, open questions.
  • Map the system to an EU AI Act risk category (prohibited / high-risk / limited / minimal).
  • Dev hook: commit the readiness report to your repo as docs/readiness_report.md.
Tools: Provided bias scoring rubric · deployment readiness checklist template · EU AI Act reference card
Advanced IT track
10-40 min (30 min)
CheckList tests + fairlearn metrics
  • Implement 3 CheckList test types (MFT, INV, DIR) for a sentiment classifier.
  • Compute demographic parity and equalized odds with fairlearn.
  • Plot fairness metrics across protected attributes (gender, ethnicity proxies).
  • Run a perturbation-based hallucination probe on an LLM response.
  • Dev hook: translate your prose acceptance criteria into tests/test_acceptance.py. The semester ends with a repo whose CI enforces the fairness and hallucination thresholds you defined — acceptance criteria as executable code.
Tools: checklist library · fairlearn · seaborn · pytest
Synthesis & wrap-up Present acceptance criteria documents. What would you require before deploying in a hospital or law firm? Revisit the L-AG exit tickets — which agent behaviours would fail today's tests? Which systems are EU AI Act high-risk? (25 min + 10 min wrap-up)

Dev practices — integration across labs

After L00, software development practices are woven into each lab as lightweight hooks — never the main focus, always the delivery mechanism. The no-code track uses GitHub Desktop and HuggingFace Spaces throughout; the advanced track accumulates a real, tested, CI-gated project repo across the full semester.
L04
Submit classifier as a PR
Advanced track adds the sklearn pipeline to src/, writes 2 tests, pushes. No-code track updates their HF Space with the Orange-exported model.
git commitpytestHF Spaces
L05
Log experiment results
Advanced track logs ROUGE scores to a results/ CSV committed alongside the code. No-code track documents their comparison table as a markdown file.
git historyresults tracking
L06
Reproducible MT benchmark
Advanced track pins the MarianMT model and sacrebleu version in requirements.txt. CI reruns the benchmark — score must match to 2 decimal places.
pinned depsCI benchmark
L07
Branch per domain
Each student works on a named domain branch (legal, clinical, scientific). Merging into main via PR requires one classmate's review.
branchingcode review
L08
Visualisation as a deliverable
The t-SNE plot is saved to results/figures/ and committed. Advanced track adds a pytest assertion that the output file exists after running the script.
artifact testingfigures/
L09
Experiment tracking
Advanced track logs fine-tuning runs to W&B or MLflow, commits the run ID to the repo. No-code track links a structured experiment sheet from the README.
W&B / MLflowexperiment ID
L10
Secrets management
Introduce .env + python-dotenv. CI uses GitHub Secrets for the API key. A detect-secrets pre-commit hook ensures no key is ever committed in plain text.
.env patternGitHub Secretspre-commit
L-AG
Agent run traces committed
Every agent execution logs a JSON trace to runs/. Advanced track adds a pytest fixture that replays a trace and asserts the final answer is unchanged — a regression test for agent behaviour.
run tracesregression test
L11
Evaluation as CI step
The full evaluation suite runs in CI on every PR. A PR that drops macro-F1 below a threshold automatically fails — acceptance criteria enforced in code.
CI gatethreshold check
L12
Acceptance criteria as code
Students translate their prose acceptance criteria into tests/test_acceptance.py. The semester ends with CI enforcing the fairness and hallucination thresholds they defined.
test_acceptance.pyfairness CI gate