
Stéphane Sobucki
Head of Data Engineering
Published on 15 February 2026
4 min read
Stéphane Sobucki
Head of Data Engineering
We build a Knowledge Base of regulatory content — tax rulings, legislation, court decisions, social security guidance — for accountants and auditors across Europe. Each country has its own tax authority, its own legislation database, its own court archive. Each source needs a web spider, an ETL cleaner, a text chunker, database registrations, an Airflow DAG, and quality control gates.
Imagine the dream scraping setup with self helping.
Every regulatory source we cover goes through the same stages:
Scrape raw content → Clean HTML → Chunk text → Vectorize → Upload to search indexThe interesting part isn't the pipeline itself — it's that adding a new source means repeating this pattern with source-specific logic at each stage. A Finnish tax authority and a Polish court archive need different spiders, different cleaning rules, different category mappings. But the shape of the work is identical.
That repetition is what made it possible to hand off to agents.
We split pipeline development into three roles:
| # | Role | Does what | Produces |
|---|---|---|---|
| 1 |
Accounting Engineer
Researches a country's regulatory landscape |
Source inventory with priority tiers |
| 2 | Product Owner | Translates business requirements into technical specs | YAML source specifications |
| 3 | Developer | Implements the pipeline from those specs | Spider, cleaner, chunker, DAG config |
The Accounting Engineer is human (sometimes AI-assisted). The other two are AI agents running in Claude Code.
What makes this work isn't the agents themselves — it's the handoff between them. The Product Owner agent produces a structured YAML spec. The Developer agent reads that spec and scaffolds the full pipeline from it. No human needs to re-explain anything in between.
Here's a fragment of what one of those specs looks like:
We have 73 of these across 15 countries. Each one is a complete technical blueprint.
Early on, we wrote long prompts every session: "here's the database schema, here's how spiders work, here's the file structure, now write a cleaner for this source." Every session started from zero. Mistakes repeated.
We replaced this with skills — self-contained procedure documents that get loaded into the agent's context when invoked. A skill isn't "write me a spider." A skill is the full workflow: what a spider is responsible for, what it delegates to the cleaner, which base class to extend, how to handle metadata, and how to verify the output.
/scope-country, /scope-website/create-spec, /create-implementation-plan, /create-tasks, /gap-analysis/scaffold-pipeline, /create-scrape, /create-cleaner/run-scrape, /run-etl, /airflow, /investigate-failures, /trace-document/db-query, /db-migrate, /db-restore, /inspect-blob, /show-configThe scoping skills research regulatory landscapes and conduct structured interviews to produce YAML specs. The building skills consume those specs to scaffold pipelines, implement spiders (download-only — no HTML parsing), and write cleaners (which own all content extraction). The operating skills run pipelines, manage DAGs, and trace documents through every stage.
Each skill enforces boundaries that matter. Spiders download raw content and nothing else. Cleaners own all HTML parsing and metadata extraction. This separation exists because agents produce better code when their scope is narrow and unambiguous.
Every source moves through a state machine:
needs_scoping → planned → in_progress → review → liveEach country has a progress file that both agents and humans read. After scoping a website, the agent updates the status to planned. After building a pipeline, it moves to in_progress. A human reviews, marks review, and promotes to live.
This gave us something we didn't expect: visibility without meetings. At any point we can check where things stand across every country without asking anyone.
This all happened on a single branch:
These are production pipelines with failure tracking, quality control, batch processing, and automated Kubernetes orchestration. They weren't generated and forgotten — they went through the same review → live process as anything we'd write by hand.
Process clarity is the bottleneck. The agents weren't limited by capability. They were limited by how clearly we could describe our workflows. Building skills forced us to articulate things that previously lived in engineers' heads — and that made our human engineers faster too.
Structured handoffs matter more than smart agents. The biggest quality jump didn't come from better prompts. It came from replacing natural language handoffs with structured YAML specs. Agents are good at following schemas. They're unreliable at interpreting ambiguous prose.
The hard part moved. After this, adding a new country wasn't an engineering problem. It was a domain problem: which sources matter, what content is in scope, how should categories map. The humans who understood European regulatory frameworks became the bottleneck, not the humans who could write Python.
We're still iterating on the skill system. Some skills are too broad, some too narrow, and the boundaries between scoping and building aren't always clean. But the basic pattern — structured specs, procedural skills, narrow agent roles — has held up across ten countries and counting.
We're building an AI-powered accounting workspace at Taxxa. The Knowledge Base described here powers our RAG application and regulatory newsfeed for accountants and auditors across Europe.
name: agenziaentrate
country_code: it
language: it
spider_type: search
technical:
content_format: pdf
javascript_required: false
categories:
- id: circolari
label: Circolari
- id: risoluzioni
label: Risoluzioni