paul · spencer
case · klinkende taal
home work github
case · klinkende taal Dutch B1 Language Models klinkende taal · slimtaal · 2025
portfolio/ selected work/ klinkende taal
lead ai/ml engineer · 2025

Klinkende Taal

I led the development of AI systems to help Dutch government organisations communicate more clearly with citizens. The core challenge was fine-tuning large language models to produce text at B1 reading level — making letters from the government accessible to 80% of the population.

B1 compliance
12→76%
LLMs evaluated
36
training pairs
36k
phases
DA · SFT · DPO

i. Overview

The work spanned three training phases — Domain Adaptation, Supervised Fine-Tuning, and Direct Preference Optimization — running on H100 80GB GPUs against curated Dutch corpora. The headline number: a 63 percentage-point jump in B1 compliance, from 12.4% baseline to 75.6% with Mistral-Small-24B + DPO.

Klinkende Taal / SlimTaal, 2025.

ii. Evaluation & Model Selection

I designed and implemented a comprehensive evaluation system to identify the optimal base model for fine-tuning:

  • Built a multi-service evaluation platform (Java/Spring Boot + Kotlin) comparing 36 LLMs across 4 providers (Mistral, OpenAI, Groq, Nebius).
  • Developed an LLM-as-judge framework with 4 specialised judges evaluating data correctness, style guide compliance, quality adherence, and content rules.
  • Integrated the Klinkende Taal API for automated B1 / B2 language complexity scoring.
  • Achieved 75.6% B1 compliance with Mistral-Small-24B + DPO, up from a 12.4% baseline (≈12% to 76%) — a 63 percentage-point improvement.
Model evaluation pipeline diagram showing 36 LLMs scored against four LLM-as-judge specialists and the Klinkende Taal B1 API
plate i — model evaluation pipeline

iii. Fine-Tuning Pipeline

I developed a complete ML pipeline for training language models on simplified Dutch:

  • Implemented three-phase training: Domain Adaptation, Supervised Fine-Tuning (SFT), and Direct Preference Optimization (DPO).
  • Built async data pipelines processing 160k+ Wikipedia sentences through B1 filtering and LLM quality assessment.
  • Developed curriculum learning approaches with 43k preference pairs sorted by difficulty gap.
  • Deployed training infrastructure on Nebius H100 80GB GPUs using the Axolotl framework.
  • Created SGLang and vLLM inference pipelines for high-throughput model evaluation.
Three-phase fine-tuning pipeline showing domain adaptation, supervised fine-tuning, and DPO training stages on H100 GPUs
plate ii — three-phase training, domain adaptation through DPO

iv. Synthetic Data Generation

Real training data for government correspondence is privacy-sensitive, requiring synthetic alternatives:

  • Built LetterProcessing for standardising and pseudonymising correspondence templates.
  • Developed SyntheticLetters pipeline generating DPO training pairs from templates.
  • Created 36k combined training examples (32k sentence pairs + 5.3k letter pairs).
  • Implemented quality filtering with automatic rejection of ambiguous or low-quality pairs.

v. Human Feedback Platform

To continuously improve model outputs, I built the LveRLHF expert feedback system:

  • Developed a gamified web platform for language experts to evaluate model outputs.
  • Implemented Kubernetes deployment with secure authentication for domain-restricted access.
  • Created API infrastructure for batch loading evaluation data and collecting feedback.
  • Deployed at expertportaal.slimtaal.nl for production use by Lve language specialists.
Stacked screenshots of the RLHF gamified expert-feedback platform
plate iii — rlhf platform · click to expand

vi. Key Contributions

  • Established the complete ML infrastructure for B1 language model development.
  • Identified Mistral-Small-24B as the optimal base model through rigorous comparative study.
  • Built reusable evaluation frameworks with objective metrics for language complexity.
  • Created comprehensive documentation enabling knowledge transfer to the wider team.
  • Developed privacy-preserving synthetic data generation for sensitive government correspondence.

vii. Stack

PythonJavaKotlinSpring Boot PyTorchAxolotlvLLMSGLang DPOSFTDomain Adaptation H100 GPUsNebius PostgreSQLFlyway MistralOpenAIGroq KubernetesDocker Async pipelines

No mail program is configured for this browser, so here is the address:

spencerpj@gmail.com