Home / About & methodology

About & methodology

What sources we ingest, how the daily pipeline runs, and how threads, embeddings and coverage checks are produced. Last updated 2026-05-08.

What this site is

Policy Threads is a tracker for live UK policy. It pulls policy artefacts — consultations, draft legislation, statutory instruments, parliamentary scrutiny, regulator decisions, library briefings — from the bodies that publish them, and assembles them into threaded timelines per policy issue. The shipping product direction is threads on demand: a user query reconstructs a thread from the underlying corpus at request time, rather than relying solely on pre-curated issues.

It is built as a Django application, runs on Postgres, and refreshes daily.

Sources we ingest

~38 ingest commands across the categories below. Each is a self-contained scraper or API client that writes to the same PolicyDocument table, with a synthetic govuk_base_path prefix used as the dedup key (e.g. hse:<slug>, commons-library:<ref>, si:<year>/<number>).

UK government

  • GOV.UK — Search API + Content API. Catches every department and most attached agencies that publish to GOV.UK (CMA, MHRA, Environment Agency, Companies House, Ofsted etc.).
  • Citizen Space portals — five English departmental portals (Justice, Defra, Education, Communities, Energy) plus consult.gov.scot for Scottish Government consultations. Off-site PDFs are merged into the canonical GOV.UK record where applicable.
  • Welsh Governmentgov.wales/consultations direct.
  • legislation.gov.uk — Atom feeds for UK, Scottish, Welsh, and Northern Ireland statutory instruments.

Westminster Parliament

  • Bills, stages, publications, chamber debates — Parliament Bills API.
  • Select committees — Reports, government responses, special reports, oral evidence; correspondence opt-in.
  • Written Ministerial Statements and written parliamentary questions — Q&S API.
  • Hansard — narrow scope: Delegated Legislation Committee + Public Bill Committee debates. Wider Hansard ingestion deferred.

Parliamentary research / analytical briefings

  • House of Commons Library — Commons Briefing Papers, Commons Debate Packs, Standard Notes, Lords Library Notes, etc. Discovered via the Library's public XML sitemaps; metadata extracted from JSON-LD on each briefing page.
  • Lords Library — sister source.
  • Parliamentary Office of Science and Technology (POST) — POSTnotes and POSTbriefs.

Statutory regulators

Each scraped from the regulator's own website (sitemap-driven where possible, paginated listings where not):

  • FCA, PRA / Bank of England, Ofcom, Ofgem, ORR, PSR, ICO
  • CAA, NICE, Ofwat, HSE (and Building Safety Regulator), OEP
  • The Pensions Regulator, Takeover Panel, FRC, SRA, Bar Standards Board
  • Gambling Commission
  • National Energy System Operator (NESO)
  • AI Safety Institute (AISI)

Other

  • Law Commission — WordPress publication taxonomy.
  • Mayoral combined authorities — currently TVCA and GMCA; others scaffolded.
  • Inquiry-tracker ETL — cross-import of cached parliamentary activity records (committee recommendations, oral evidence, hansard mentions, NAO and PFD reports) from a sibling project.

Devolved parliaments (in flight)

  • Scottish Parliament Bills and Senedd Bills are scaffolded; not yet on the daily cron. NI Assembly deferred.

Daily pipeline

A single GitHub Actions workflow runs at 06:00 UTC each day. Every ingest step is wrapped continue-on-error so one slow or blocked source doesn't stop the rest.

  1. Ingest — every source listed above with a --since 35-days-ago window. Per-row save points are wrapped with @retry_on_db_error: four attempts with exponential backoff, closing the database connection between attempts so reconnects survive Railway's proxy drops.
  2. Dedup — two passes:
    • dedup_cs_govuk collapses Citizen Space rows that duplicate GOV.UK rows on the same issue.
    • dedup_regulator_sources collapses GOV.UK rows that duplicate own-site regulator rows on the same body, by normalised title similarity (≥0.85) and date proximity (±30 days). Own-site is treated as canonical for the body's own publications.
  3. Attachment backfill — refetches GOV.UK content for documents whose attachments field is empty, then scrapes Citizen Space for off-site PDFs.
  4. Pair consultation outcomes to their open consultation documents (rule-based, no LLM).
  5. Lifecycle stages — assigns each Issue's lifecycle stage from its underlying document events.
  6. Embeddings — embeds every new Issue and PolicyDocument via Google's gemini-embedding-001 model (3072 dimensions). Stored as JSON on each row. Load-bearing: this is the index used by every semantic-search and on-demand thread feature.
  7. Retag legacy "other" rows — bounded LLM pass that drains the legacy document_type='other' backlog, capped at 2,000 documents per night with a high-confidence threshold using Gemini 2.5 Flash.

A weekly job runs classify_issues (Gemini 2.5 Flash) to mint new Issues from clusters of unassigned documents, in 40-document batches.

What we deliberately don't do daily

As of 2026-05-08, the daily link_docs_to_issues step is removed. Pre-linking forced each new document into a single Issue foreign key, using cosine embedding similarity and an Opus borderline-arbitration pass. Threads-on-demand replaces this: a document finds its threads at query time, naturally supports many-to-many membership, and can use the user's query as context for borderline disambiguation. The command itself is retained for ad-hoc backfills.

Embeddings

  • Model: gemini-embedding-001 (Google).
  • Dimensions: 3,072.
  • Input: for documents, the title concatenated with the summary. For Issues, title + summary + accumulated search tags.
  • Storage: JSON-encoded vector on the row itself; no vector database. Postgres JSONB + cosine in Python is enough at the current corpus size (~200K rows).
  • Refresh: embedded once on first save; re-embedded only when title or summary materially changes.

Threads on demand

When a user submits a topic, the on-demand thread builder:

  1. Embeds the query with the same Gemini embedding model.
  2. Performs a cosine search over the document corpus to retrieve a candidate set.
  3. Filters candidates by basic relevance (similarity threshold, recency, document type if specified).
  4. Asks an LLM to assemble the candidates into a chronological thread, returning a structured set of events (consultations opened, responses published, bills introduced, regulatory decisions, etc.) with citations back to the source documents.
  5. Persists the resulting thread (so subsequent visitors see a cached version) and runs Coverage Check in the background to surface anything the corpus is missing.

Coverage Check / Research Sweep

Coverage Check is a per-issue gap-finding feature. It is also the engine that powers threads on demand for novel queries.

  • Model: Anthropic Claude Sonnet 4.6, with Anthropic's native web_search tool enabled.
  • Search modes: official_only (gov.uk + parliament.uk + regulator domains), official_plus (adds devolved + ALB sites), public_web (broader UK web), news.
  • Output: a structured set of candidates, each placed in one of five buckets:
    • missing — likely a real gap in our index
    • covered — already in our DB (matched by URL or title)
    • related — adjacent thread; LLM proposes either an existing-issue link or a draft new Issue
    • low_conf — surfaced but low confidence
    • background — context, not a discrete event
  • Cost / rate limits: roughly $0.43–$0.69 per sweep. Public access is allowed; anonymous traffic shares a daily quota of 5 sweeps per user.
  • Async: a sweep takes 60–120 seconds, longer than Railway's edge-proxy timeout. The view spawns a daemon thread and immediately redirects to the inbox, which auto-refreshes via <meta http-equiv="refresh"> until the run finishes.

Briefings

The briefing generator turns a watchlist (a curated set of Issues) into an HTML report:

  • Themed executive summary written by Claude (Sonnet) from the underlying issue summaries.
  • Per-issue sections that summarise the most recent material events on each thread, with footnoted citations to the source documents.
  • Global footnotes deduplicated across the document set.

Linkage and curation

For internal QA and the ad-hoc backfill path, the legacy three-pass linker is still available:

  1. Exact match between a document's proposed slug and an existing Issue slug — free.
  2. Cosine embedding similarity between document and Issue, threshold 0.70 — uses already-computed embeddings.
  3. Mid-confidence band only: a Claude Opus batch judges 25 candidates at a time and either auto-links above a high threshold or leaves the document unassigned.

Borderline-band Opus calls have totalled ~400 cumulative across the project's lifetime — small spend.

Source role and body resolution

Every PolicyDocument is automatically tagged with:

  • A source role (regulator_response, committee_scrutiny, consultation, parliamentary_record, analysis, announcement, etc.) by a small rule engine in tracker/source_role.py. Domain-prefix rules fire first; document-type rules fall through.
  • One or more responsible bodies (departments, agencies, regulators, ALBs, parliamentary committees) by a resolver that walks alias maps with hardening guards for ambiguous tokens (committee bodies, fire/police, portfolio collisions).

Models in use

  • Embeddings: Google gemini-embedding-001 (3,072 dimensions).
  • Issue classification + retag passes: Google gemini-2.5-flash.
  • Coverage Check / Research Sweep + briefing prose: Anthropic Claude Sonnet 4.6 with native web_search.
  • Borderline linkage (ad-hoc only): Anthropic Claude Opus 4.x.

Storage and infrastructure

  • Django on Python 3.12.
  • Postgres on Railway in production; SQLite for local development.
  • Daily backups enabled on the Railway Postgres instance.
  • Cron via GitHub Actions (Railway's scheduler only supports a single job per service).
  • Caching: Redis when available, locmem fallback.
  • Static files: Whitenoise.
  • Authentication: django-allauth.

Licence and re-use

Documents sourced from GOV.UK, Parliament, devolved governments, and most regulator sites are made available under the Open Government Licence v3.0 or equivalent. Where a regulator publishes under a more restrictive licence (a small minority), we link out to the original rather than redistribute.

Recent material changes

  • 2026-05-08 — added 16 own-site / sister-source ingests (Lords Library, POST, Takeover Panel, TPR, AISI, NESO, Combined Authorities, Welsh Government consultations, HSE/BSR, CAA, OEP, NICE, Ofwat, SRA, BSB, FRC, Gambling Commission). Removed the daily document-to-issue pre-linker step in favour of threads-on-demand. Added dedup_regulator_sources for cross-source deduplication.
  • 2026-05-06 — Coverage Check / Research Sweep shipped publicly. Devolved SI ingestion (Scottish, Welsh, NI) added. Calibration tooling for linkage thresholds added.