WIP — Data and labels are in active iteration

About

This page describes the dataset and explains, in plain language, how we turn annual-report text into the dashboard metrics.

Dataset

The AI Risk Observatory dataset includes a metadata.csv file that details the mapping between company, year, report, excerpt, and other metadata. It also provides a list of excerpts, each annotated with labels assigned by classifiers. Every excerpt is labeled with mention type, adoption maturity, risk taxonomy, vendor references, signal strength, and substantiveness scores. Additionally, LLM classifier reasoning is provided for easier quality assurance analysis.

The full dataset, processing pipeline, and documentation are available on GitHub.

View on GitHub

Method Summary

This dashboard currently uses 150 company-year reports (2022-2024), covering 50 companies and 16 sectors. The pipeline extracted 1538 AI-related text chunks from 45 companies, then classified those chunks into structured labels.

The method is intentionally staged: find potential AI text first, then classify what that text is about, then aggregate labels to report-level trends.

Company-Year Reports

150

Extracted Chunks

1538

Reports With AI Signal

109

Reports With AI Risk Signal

87

Methodology in a Nutshell

The pipeline has three stages: pre-processing, processing, and post-processing. The diagram below shows the end-to-end flow.

Pre Processing

1.1List companies and years to analyse with metadata
1.2Fetch annual reports
1.3Convert reports to Markdown
1.4Extract all excerpts that mention AI

Processing

2.1For all AI mention excerpts
2.2Run a mention type classifier
2.3Run Phase 2 classifiers (risk, adoption, vendor)
2.4Run the boilerplate level classifier

Post Processing

3.1Aggregate classifications across chunks and reports
3.2Compute metrics and trends
3.3Produce a structured dataset
3.4Visualize on the dashboard

1. Pre-processing

We collect annual reports, convert them to normalized markdown text, then detect AI keyword hits and build chunk windows around those hits.

1.1 Candidate retrieval is recall-first: keyword matching is intentionally broad (for example AI, artificial intelligence, machine learning, LLM, GPT, GenAI, Copilot). This catches more candidates, including false positives.

1.2 Context windows are merged: overlapping hits are deduplicated into a single chunk so nearby mentions are analyzed together.

1.3 Long/noisy blocks are cleaned: very long table rows and formatting noise are reduced so classifiers see readable text.

1.4 Traceability is preserved: each chunk keeps source report identifiers, section hints, offsets, and matched keywords.

2. Processing

Processing is two-phase. Phase 1 decides what type of AI mention is in a chunk. Phase 2 adds deeper labels.

2.1 Mention type labels: adoption, risk, harm, vendor, general_ambiguous, none. These are non-mutually-exclusive except that none stands alone and means no real AI mention / false positive.

This chart is shown here because mention type is the Phase 1 gate: it determines how chunks are routed to downstream classifiers, so changes here flow through the rest of the pipeline.

Mention Types Over Time

Harm
General / Ambiguous
Vendor
Risk
Adoption

Each bar shows how many reports per year were tagged with each mention type (confidence ≥ 0.2).

2.2 Routing logic: the risk classifier only runs on chunks tagged risk, adoption classifier only on adoption chunks, and vendor classifier only on vendor chunks. If Phase 1 misses a tag, Phase 2 for that branch will not run.

2.3 Taxonomies used: adoption type = non_llm, llm, agentic. Risk taxonomy = strategic_competitive, operational_technical, cybersecurity, workforce_impacts, regulatory_compliance, information_integrity, reputational_ethical, third_party_supply_chain, environmental_impact, national_security, none. Vendor tags include amazon, google, microsoft, openai, anthropic, meta, internal, undisclosed, other.

2.4 Signal and substantiveness: adoption uses 0-3 signals, while risk and vendor use 1-3 signals (weak implicit to explicit). Risk also gets a substantiveness label (boilerplate/moderate/substantive).

Exact Taxonomy Reference (Canonical Labels)

Labels are shown exactly as stored in classifier outputs and dataset fields for transparency.

Mention Type Taxonomy

LabelDefinition
adoptionCurrent use, rollout, pilot, implementation, or delivery of AI systems by the company (or for clients).
riskAI is described as a downside or exposure for the company.
harmAI is described as causing or enabling harm (for example misinformation, fraud, abuse, safety incidents).
vendorA named AI model/vendor/platform provider is referenced.
general_ambiguousAI is mentioned, but the text is too high-level or vague for adoption/risk/harm/vendor.
noneNo real AI mention / false positive. This label is exclusive (it should not co-occur with others).

Adoption Taxonomy

LabelDefinition
non_llmTraditional AI/ML (non-LLM), such as predictive models, computer vision, detection/classification systems.
llmLarge language model / GenAI use (for example GPT/ChatGPT/Gemini/Claude/Copilot-style deployments).
agenticAutonomous or agent-based workflows with limited human intervention (can co-occur with llm).

Adoption signal scale: 0 absent, 1 weak implicit, 2 strong implicit, 3 explicit.

Risk Taxonomy

LabelDefinition
strategic_competitiveAI-driven competitive disadvantage, disruption, or failure to adapt.
operational_technicalReliability/accuracy/model-risk failures that degrade operations or decisions.
cybersecurityAI-enabled attacks/fraud/breach pathways or adversarial AI abuse.
workforce_impactsAI-related displacement, skills gaps, or risky employee AI usage.
regulatory_complianceAI-specific legal/regulatory/privacy/IP liability and compliance burden.
information_integrityAI-enabled misinformation, deepfakes, or authenticity manipulation.
reputational_ethicalAI-linked trust, fairness, ethics, or rights concerns.
third_party_supply_chainDependency on external AI vendors/providers and concentration exposure.
environmental_impactAI-related energy, carbon, or resource-burden risk.
national_securityAI-linked geopolitical/security destabilization or critical-systems exposure.
noneNo attributable AI-risk category (or too vague to assign one).

Risk signal scale: 1 weak implicit, 2 strong implicit, 3 explicit.

Vendor Taxonomy

LabelDefinition
amazonAmazon / AWS / Bedrock / Titan / related Amazon AI model platforms.
googleGoogle / Vertex AI / Gemini / DeepMind / related Google AI model platforms.
microsoftMicrosoft / Azure AI / Copilot / Azure OpenAI Service.
openaiOpenAI / GPT / ChatGPT references.
anthropicAnthropic / Claude references.
metaMeta AI / Llama references.
internalExplicitly in-house or proprietary model development/deployment.
undisclosedThird-party AI provider is implied but not named.
otherNamed provider outside the predefined list (with free-text vendor name in metadata).

Vendor signal scale: 1 weak implicit, 2 strong implicit, 3 explicit.

Substantiveness Levels

LabelDefinition
boilerplateGeneric AI language with low information density; could appear in many reports unchanged.
moderateSpecific area is identified, but with limited mechanism, metrics, or mitigation detail.
substantiveConcrete mechanism and/or tangible action, commitment, metric, system detail, or timeline.

3. Post-processing

Chunk outputs are normalized and aggregated into both per-chunk and per-report views.

3.1 Confidence handling: report-level adoption/risk trend counts use confidence thresholds (default 0.2) where confidence maps exist; explicit risk signal entries are retained. Signal heatmaps bin values into weak/strong/explicit.

3.2 Legacy compatibility: older risk labels (for example regulatory, workforce) are mapped to current canonical labels so longitudinal charts stay comparable.

3.3 Report denominator is explicit: we keep no-signal reports in the report-level dataset to show blind spots, not just positive cases.

Quality Controls

We use schema-constrained outputs, deterministic settings, and explicit validation/reconciliation tools to reduce noise and improve reproducibility.

Structured outputs: classifiers write to strict response schemas (Pydantic + JSON schema), reducing malformed labels.

Conservative prompting: prompts require explicit AI attribution and discourage category over-assignment.

Testing and reconciliation: repo scripts support QA checks, human-vs-LLM disagreement review, and merge-back of reconciled labels.

Known limitations: this release is still primarily LLM-labeled and keyword-seeded, so it can miss subtle non-keyword AI references and can still include some ambiguous cases.