Policy Threads

What this site is

Policy Threads is a tracker for live UK policy. It pulls policy artefacts — consultations, draft legislation, statutory instruments, parliamentary scrutiny, regulator decisions, library briefings — from the bodies that publish them, and assembles them into threaded timelines per policy issue. The shipping product direction is threads on demand: a user query reconstructs a thread from the underlying corpus at request time, rather than relying solely on pre-curated issues.

It is built as a Django application, runs on Postgres, and refreshes daily.

Sources we ingest

~38 ingest commands across the categories below. Each is a self-contained scraper or API client that writes to the same PolicyDocument table, with a synthetic govuk_base_path prefix used as the dedup key (e.g. hse:<slug>, commons-library:<ref>, si:<year>/<number>).

UK government

GOV.UK — Search API + Content API. Catches every department and most attached agencies that publish to GOV.UK (CMA, MHRA, Environment Agency, Companies House, Ofsted etc.).
Citizen Space portals — five English departmental portals (Justice, Defra, Education, Communities, Energy) plus consult.gov.scot for Scottish Government consultations. Off-site PDFs are merged into the canonical GOV.UK record where applicable.
Welsh Government — gov.wales/consultations direct.
legislation.gov.uk — Atom feeds for UK, Scottish, Welsh, and Northern Ireland statutory instruments.

Westminster Parliament

Bills, stages, publications, chamber debates — Parliament Bills API.
Select committees — Reports, government responses, special reports, oral evidence; correspondence opt-in.
Written Ministerial Statements and written parliamentary questions — Q&S API.
Hansard — narrow scope: Delegated Legislation Committee + Public Bill Committee debates. Wider Hansard ingestion deferred.

Parliamentary research / analytical briefings

House of Commons Library — Commons Briefing Papers, Commons Debate Packs, Standard Notes, Lords Library Notes, etc. Discovered via the Library's public XML sitemaps; metadata extracted from JSON-LD on each briefing page.
Lords Library — sister source.
Parliamentary Office of Science and Technology (POST) — POSTnotes and POSTbriefs.

Statutory regulators

Each scraped from the regulator's own website (sitemap-driven where possible, paginated listings where not):

FCA, PRA / Bank of England, Ofcom, Ofgem, ORR, PSR, ICO
CAA, NICE, Ofwat, HSE (and Building Safety Regulator), OEP
The Pensions Regulator, Takeover Panel, FRC, SRA, Bar Standards Board
Gambling Commission
National Energy System Operator (NESO)
AI Safety Institute (AISI)

Other

Law Commission — WordPress publication taxonomy.
Mayoral combined authorities — currently TVCA and GMCA; others scaffolded.
Inquiry-tracker ETL — cross-import of cached parliamentary activity records (committee recommendations, oral evidence, hansard mentions, NAO and PFD reports) from a sibling project.

Devolved parliaments (in flight)

Scottish Parliament Bills and Senedd Bills are scaffolded; not yet on the daily cron. NI Assembly deferred.

Daily pipeline

A single GitHub Actions workflow runs at 06:00 UTC each day. Every ingest step is wrapped continue-on-error so one slow or blocked source doesn't stop the rest.

Ingest — every source listed above with a --since 35-days-ago window. Per-row save points are wrapped with @retry_on_db_error: four attempts with exponential backoff, closing the database connection between attempts so reconnects survive Railway's proxy drops.
Dedup — two passes:
- dedup_cs_govuk collapses Citizen Space rows that duplicate GOV.UK rows on the same issue.
- dedup_regulator_sources collapses GOV.UK rows that duplicate own-site regulator rows on the same body, by normalised title similarity (≥0.85) and date proximity (±30 days). Own-site is treated as canonical for the body's own publications.
Attachment backfill — refetches GOV.UK content for documents whose attachments field is empty, then scrapes Citizen Space for off-site PDFs.
Pair consultation outcomes to their open consultation documents (rule-based, no LLM).
Lifecycle stages — assigns each Issue's lifecycle stage from its underlying document events.
Embeddings — embeds every new Issue and PolicyDocument via Google's gemini-embedding-001 model (3072 dimensions). Stored as JSON on each row. Load-bearing: this is the index used by every semantic-search and on-demand thread feature.
Retag legacy "other" rows — bounded LLM pass that drains the legacy document_type='other' backlog, capped at 2,000 documents per night with a high-confidence threshold using Gemini 2.5 Flash.

A weekly job runs classify_issues (Gemini 2.5 Flash) to mint new Issues from clusters of unassigned documents, in 40-document batches.

What we deliberately don't do daily

As of 2026-05-08, the daily link_docs_to_issues step is removed. Pre-linking forced each new document into a single Issue foreign key, using cosine embedding similarity and an Opus borderline-arbitration pass. Threads-on-demand replaces this: a document finds its threads at query time, naturally supports many-to-many membership, and can use the user's query as context for borderline disambiguation. The command itself is retained for ad-hoc backfills.

Embeddings

Model: gemini-embedding-001 (Google).
Dimensions: 3,072.
Input: for documents, the title concatenated with the summary. For Issues, title + summary + accumulated search tags.
Storage: JSON-encoded vector on the row itself; no vector database. Postgres JSONB + cosine in Python is enough at the current corpus size (~200K rows).
Refresh: embedded once on first save; re-embedded only when title or summary materially changes.

Threads on demand

When a user submits a topic, the on-demand thread builder:

Embeds the query with the same Gemini embedding model.
Performs a cosine search over the document corpus to retrieve a candidate set.
Filters candidates by basic relevance (similarity threshold, recency, document type if specified).
Asks an LLM to assemble the candidates into a chronological thread, returning a structured set of events (consultations opened, responses published, bills introduced, regulatory decisions, etc.) with citations back to the source documents.
Persists the resulting thread (so subsequent visitors see a cached version) and runs Coverage Check in the background to surface anything the corpus is missing.

Coverage Check / Research Sweep

Coverage Check is a per-issue gap-finding feature. It is also the engine that powers threads on demand for novel queries.

Model: Anthropic Claude Sonnet 4.6, with Anthropic's native web_search tool enabled.
Search modes: official_only (gov.uk + parliament.uk + regulator domains), official_plus (adds devolved + ALB sites), public_web (broader UK web), news.
Output: a structured set of candidates, each placed in one of five buckets:
- missing — likely a real gap in our index
- covered — already in our DB (matched by URL or title)
- related — adjacent thread; LLM proposes either an existing-issue link or a draft new Issue
- low_conf — surfaced but low confidence
- background — context, not a discrete event
Cost / rate limits: roughly $0.43–$0.69 per sweep. Public access is allowed; anonymous traffic shares a daily quota of 5 sweeps per user.
Async: a sweep takes 60–120 seconds, longer than Railway's edge-proxy timeout. The view spawns a daemon thread and immediately redirects to the inbox, which auto-refreshes via <meta http-equiv="refresh"> until the run finishes.

Briefings

The briefing generator turns a watchlist (a curated set of Issues) into an HTML report:

Themed executive summary written by Claude (Sonnet) from the underlying issue summaries.
Per-issue sections that summarise the most recent material events on each thread, with footnoted citations to the source documents.
Global footnotes deduplicated across the document set.

Linkage and curation

For internal QA and the ad-hoc backfill path, the legacy three-pass linker is still available:

Exact match between a document's proposed slug and an existing Issue slug — free.
Cosine embedding similarity between document and Issue, threshold 0.70 — uses already-computed embeddings.
Mid-confidence band only: a Claude Opus batch judges 25 candidates at a time and either auto-links above a high threshold or leaves the document unassigned.

Borderline-band Opus calls have totalled ~400 cumulative across the project's lifetime — small spend.

Source role and body resolution

Every PolicyDocument is automatically tagged with:

A source role (regulator_response, committee_scrutiny, consultation, parliamentary_record, analysis, announcement, etc.) by a small rule engine in tracker/source_role.py. Domain-prefix rules fire first; document-type rules fall through.
One or more responsible bodies (departments, agencies, regulators, ALBs, parliamentary committees) by a resolver that walks alias maps with hardening guards for ambiguous tokens (committee bodies, fire/police, portfolio collisions).

Models in use

Embeddings: Google gemini-embedding-001 (3,072 dimensions).
Issue classification + retag passes: Google gemini-2.5-flash.
Coverage Check / Research Sweep + briefing prose: Anthropic Claude Sonnet 4.6 with native web_search.
Borderline linkage (ad-hoc only): Anthropic Claude Opus 4.x.

Storage and infrastructure

Django on Python 3.12.
Postgres on Railway in production; SQLite for local development.
Daily backups enabled on the Railway Postgres instance.
Cron via GitHub Actions (Railway's scheduler only supports a single job per service).
Caching: Redis when available, locmem fallback.
Static files: Whitenoise.
Authentication: django-allauth.

Licence and re-use

Documents sourced from GOV.UK, Parliament, devolved governments, and most regulator sites are made available under the Open Government Licence v3.0 or equivalent. Where a regulator publishes under a more restrictive licence (a small minority), we link out to the original rather than redistribute.

Recent material changes

2026-05-08 — added 16 own-site / sister-source ingests (Lords Library, POST, Takeover Panel, TPR, AISI, NESO, Combined Authorities, Welsh Government consultations, HSE/BSR, CAA, OEP, NICE, Ofwat, SRA, BSB, FRC, Gambling Commission). Removed the daily document-to-issue pre-linker step in favour of threads-on-demand. Added dedup_regulator_sources for cross-source deduplication.
2026-05-06 — Coverage Check / Research Sweep shipped publicly. Devolved SI ingestion (Scottish, Welsh, NI) added. Calibration tooling for linkage thresholds added.