What this site is
Policy Threads is a tracker for live UK policy. It pulls policy artefacts — consultations, draft legislation, statutory instruments, parliamentary scrutiny, regulator decisions, library briefings — from the bodies that publish them, and assembles them into threaded timelines per policy issue. The shipping product direction is threads on demand: a user query reconstructs a thread from the underlying corpus at request time, rather than relying solely on pre-curated issues.
It is built as a Django application, runs on Postgres, and refreshes daily.
Sources we ingest
~38 ingest commands across the categories below. Each is a self-contained scraper or API client that writes to the same PolicyDocument table, with a synthetic govuk_base_path prefix used as the dedup key (e.g. hse:<slug>, commons-library:<ref>, si:<year>/<number>).
UK government
- GOV.UK — Search API + Content API. Catches every department and most attached agencies that publish to GOV.UK (CMA, MHRA, Environment Agency, Companies House, Ofsted etc.).
- Citizen Space portals — five English departmental portals (Justice, Defra, Education, Communities, Energy) plus
consult.gov.scotfor Scottish Government consultations. Off-site PDFs are merged into the canonical GOV.UK record where applicable. - Welsh Government —
gov.wales/consultationsdirect. - legislation.gov.uk — Atom feeds for UK, Scottish, Welsh, and Northern Ireland statutory instruments.
Westminster Parliament
- Bills, stages, publications, chamber debates — Parliament Bills API.
- Select committees — Reports, government responses, special reports, oral evidence; correspondence opt-in.
- Written Ministerial Statements and written parliamentary questions — Q&S API.
- Hansard — narrow scope: Delegated Legislation Committee + Public Bill Committee debates. Wider Hansard ingestion deferred.
Parliamentary research / analytical briefings
- House of Commons Library — Commons Briefing Papers, Commons Debate Packs, Standard Notes, Lords Library Notes, etc. Discovered via the Library's public XML sitemaps; metadata extracted from JSON-LD on each briefing page.
- Lords Library — sister source.
- Parliamentary Office of Science and Technology (POST) — POSTnotes and POSTbriefs.
Statutory regulators
Each scraped from the regulator's own website (sitemap-driven where possible, paginated listings where not):
- FCA, PRA / Bank of England, Ofcom, Ofgem, ORR, PSR, ICO
- CAA, NICE, Ofwat, HSE (and Building Safety Regulator), OEP
- The Pensions Regulator, Takeover Panel, FRC, SRA, Bar Standards Board
- Gambling Commission
- National Energy System Operator (NESO)
- AI Safety Institute (AISI)
Other
- Law Commission — WordPress publication taxonomy.
- Mayoral combined authorities — currently TVCA and GMCA; others scaffolded.
- Inquiry-tracker ETL — cross-import of cached parliamentary activity records (committee recommendations, oral evidence, hansard mentions, NAO and PFD reports) from a sibling project.
Devolved parliaments (in flight)
- Scottish Parliament Bills and Senedd Bills are scaffolded; not yet on the daily cron. NI Assembly deferred.
Daily pipeline
A single GitHub Actions workflow runs at 06:00 UTC each day. Every ingest step is wrapped continue-on-error so one slow or blocked source doesn't stop the rest.
- Ingest — every source listed above with a
--since 35-days-agowindow. Per-row save points are wrapped with@retry_on_db_error: four attempts with exponential backoff, closing the database connection between attempts so reconnects survive Railway's proxy drops. - Dedup — two passes:
dedup_cs_govukcollapses Citizen Space rows that duplicate GOV.UK rows on the same issue.dedup_regulator_sourcescollapses GOV.UK rows that duplicate own-site regulator rows on the same body, by normalised title similarity (≥0.85) and date proximity (±30 days). Own-site is treated as canonical for the body's own publications.
- Attachment backfill — refetches GOV.UK content for documents whose
attachmentsfield is empty, then scrapes Citizen Space for off-site PDFs. - Pair consultation outcomes to their open consultation documents (rule-based, no LLM).
- Lifecycle stages — assigns each Issue's lifecycle stage from its underlying document events.
- Embeddings — embeds every new Issue and PolicyDocument via Google's
gemini-embedding-001model (3072 dimensions). Stored as JSON on each row. Load-bearing: this is the index used by every semantic-search and on-demand thread feature. - Retag legacy "other" rows — bounded LLM pass that drains the legacy
document_type='other'backlog, capped at 2,000 documents per night with a high-confidence threshold using Gemini 2.5 Flash.
A weekly job runs classify_issues (Gemini 2.5 Flash) to mint new Issues from clusters of unassigned documents, in 40-document batches.
What we deliberately don't do daily
As of 2026-05-08, the daily link_docs_to_issues step is removed. Pre-linking forced each new document into a single Issue foreign key, using cosine embedding similarity and an Opus borderline-arbitration pass. Threads-on-demand replaces this: a document finds its threads at query time, naturally supports many-to-many membership, and can use the user's query as context for borderline disambiguation. The command itself is retained for ad-hoc backfills.
Embeddings
- Model:
gemini-embedding-001(Google). - Dimensions: 3,072.
- Input: for documents, the title concatenated with the summary. For Issues, title + summary + accumulated search tags.
- Storage: JSON-encoded vector on the row itself; no vector database. Postgres
JSONB+ cosine in Python is enough at the current corpus size (~200K rows). - Refresh: embedded once on first save; re-embedded only when title or summary materially changes.
Threads on demand
When a user submits a topic, the on-demand thread builder:
- Embeds the query with the same Gemini embedding model.
- Performs a cosine search over the document corpus to retrieve a candidate set.
- Filters candidates by basic relevance (similarity threshold, recency, document type if specified).
- Asks an LLM to assemble the candidates into a chronological thread, returning a structured set of events (consultations opened, responses published, bills introduced, regulatory decisions, etc.) with citations back to the source documents.
- Persists the resulting thread (so subsequent visitors see a cached version) and runs Coverage Check in the background to surface anything the corpus is missing.
Coverage Check / Research Sweep
Coverage Check is a per-issue gap-finding feature. It is also the engine that powers threads on demand for novel queries.
- Model: Anthropic Claude Sonnet 4.6, with Anthropic's native
web_searchtool enabled. - Search modes: official_only (gov.uk + parliament.uk + regulator domains), official_plus (adds devolved + ALB sites), public_web (broader UK web), news.
- Output: a structured set of candidates, each placed in one of five buckets:
- missing — likely a real gap in our index
- covered — already in our DB (matched by URL or title)
- related — adjacent thread; LLM proposes either an existing-issue link or a draft new Issue
- low_conf — surfaced but low confidence
- background — context, not a discrete event
- Cost / rate limits: roughly $0.43–$0.69 per sweep. Public access is allowed; anonymous traffic shares a daily quota of 5 sweeps per user.
- Async: a sweep takes 60–120 seconds, longer than Railway's edge-proxy timeout. The view spawns a daemon thread and immediately redirects to the inbox, which auto-refreshes via
<meta http-equiv="refresh">until the run finishes.
Briefings
The briefing generator turns a watchlist (a curated set of Issues) into an HTML report:
- Themed executive summary written by Claude (Sonnet) from the underlying issue summaries.
- Per-issue sections that summarise the most recent material events on each thread, with footnoted citations to the source documents.
- Global footnotes deduplicated across the document set.
Linkage and curation
For internal QA and the ad-hoc backfill path, the legacy three-pass linker is still available:
- Exact match between a document's proposed slug and an existing Issue slug — free.
- Cosine embedding similarity between document and Issue, threshold 0.70 — uses already-computed embeddings.
- Mid-confidence band only: a Claude Opus batch judges 25 candidates at a time and either auto-links above a high threshold or leaves the document unassigned.
Borderline-band Opus calls have totalled ~400 cumulative across the project's lifetime — small spend.
Source role and body resolution
Every PolicyDocument is automatically tagged with:
- A source role (
regulator_response,committee_scrutiny,consultation,parliamentary_record,analysis,announcement, etc.) by a small rule engine intracker/source_role.py. Domain-prefix rules fire first; document-type rules fall through. - One or more responsible bodies (departments, agencies, regulators, ALBs, parliamentary committees) by a resolver that walks alias maps with hardening guards for ambiguous tokens (committee bodies, fire/police, portfolio collisions).
Models in use
- Embeddings: Google
gemini-embedding-001(3,072 dimensions). - Issue classification + retag passes: Google
gemini-2.5-flash. - Coverage Check / Research Sweep + briefing prose: Anthropic Claude Sonnet 4.6 with native
web_search. - Borderline linkage (ad-hoc only): Anthropic Claude Opus 4.x.
Storage and infrastructure
- Django on Python 3.12.
- Postgres on Railway in production; SQLite for local development.
- Daily backups enabled on the Railway Postgres instance.
- Cron via GitHub Actions (Railway's scheduler only supports a single job per service).
- Caching: Redis when available, locmem fallback.
- Static files: Whitenoise.
- Authentication:
django-allauth.
Licence and re-use
Documents sourced from GOV.UK, Parliament, devolved governments, and most regulator sites are made available under the Open Government Licence v3.0 or equivalent. Where a regulator publishes under a more restrictive licence (a small minority), we link out to the original rather than redistribute.
Recent material changes
- 2026-05-08 — added 16 own-site / sister-source ingests (Lords Library, POST, Takeover Panel, TPR, AISI, NESO, Combined Authorities, Welsh Government consultations, HSE/BSR, CAA, OEP, NICE, Ofwat, SRA, BSB, FRC, Gambling Commission). Removed the daily document-to-issue pre-linker step in favour of threads-on-demand. Added
dedup_regulator_sourcesfor cross-source deduplication. - 2026-05-06 — Coverage Check / Research Sweep shipped publicly. Devolved SI ingestion (Scottish, Welsh, NI) added. Calibration tooling for linkage thresholds added.