ClimateKG — Surfacing Knowledge

A community knowledge graph:
Based on FAIR Data & UNESCO Open Science Principles

Simon Worthington & Laura Oldenbourg — Open Science Lab, TIB – Leibniz Information Centre for Science and Technology

2026-06-10

About ClimateKG

  • Climate Knowledge Graph (ClimateKG) is a yearlong R&D project to create a knowledge graph of the IPCC Sixth Assement Report (AR6).
  • The ClimateKG knowledge graph is built using Wikibase and MediaWiki.
  • The goal is to be a resource for:
    • data science use, for
    • citizen science activities, and to
    • distribute climate science metadata to Wikidata.

The project has been funded by TIB Innovation Fund and is made in cooperation with #semanticClimate community (India, Germany, UK, +).

TIB Logo

What are the IPCC Reports?

IPCC Reports

It is a survival guide for humanity. As it shows, the 1.5-degree limit is achievable.
- UN Secretary-General António Guterres (2023)

  • A multilateral panel of 195 nations that reviews the global scientific literature to map out climate scenarios for the next 100 years combining science, policy, and politics.
  • AR6 ( \(\approx\) 5-8 year cycle since 1988): 932 authors; 7 Reports; >8 million words; 10,047 pages; 48,400 citations; 66,834 data sets; 2,136 images; 1,910 Acronyms; 920 Glossary items; 5 languages+ (partial).*

ClimateKG Project Deliverables

Transform PDF & web corpus, and web databases into a knowledge graph with a reusable pipeline made by ClimateKG:

Full text & publishing Paged Medias

Metadata

Metadata

API / SPARQL Endpoint

SPARQL Endpoint

Claude & LLMs IDE integration

Claude & LLMs IDE integration

Schemas / Ontologies: DataCite, OpenAlex, Basic Formal Ontology (BFO), etc

Schemas / Ontologies

Data workbench: JupyterLab / Quarto

Data workbench

Problem: Public Trust in Climate Science Requires It’s at Their Fingertips!

  • Headline issues:
    • Search and SEO limited
    • Not easy to reuse
    • Formats of Web CMS HTML and PDF not suitable for data science
    • Referenced materials not interlinked — only manual tracking possible
    • Parts not easily available — Methodology, Supplementary material, etc
  • Examples:
    • Links only to top levels: Reports, Glossary or Author websites, etc.
    • Glossary terms have no links to reports
    • Author lists have no link to chapters
    • Figures listed on web pages with no DOIs
    • Citations on websites and not research repositories

Solution: Apply the vision of the Semantic Web

🕸️ Link all of the parts and make machine readable
— AKA a Knowledge Graph

  • Linked Data: Interlinked data with unique identifiers and standard formats
  • FAIR Data: Findable, Accessible, Interoperable, and Reusable

Overview of work packages 📦

workflow
  • From Siloed Data to FAIR Data
  • Finding the sources → retrieving the data → storing it → preparing different ways of outputting and visualizing it and making data available for reuse

Scoping Data Sources & Data Modeling

  • In-scope and out-of-scope: Only the main text (excluding front matter and back matter) for AR6 has currently been imported, citations and data sets have not been linked.
  • Data modelling:
    • Simple (KISS) approaches of Protein Data Bank (1970s) and GeneBank (1990s) have been used to encourage community uptake.
    • Method: Bottom up / Top down / and map to schemas
    • Lit review of Wikibase KGs consulted - Zotero collection

Data Retrieval 📥

workflow
  • Building a semi-automated scraping system from Web to MediaWiki/Wikibase
  • Using the IPCC websites as a starting point, developing a method for scraping text, assets, and additional data
  • Preparing the data for import into a MediaWiki/Wikibase instance

Collecting Supporting Data 🧩

workflow
  • Additional data from various sources, e.g. IPCC data - web databases, Crossref, etc.
  • Preparing the supporting data for import into Wikibase

Import Scrape Data — Innovation That Allows for Corpus Text and Data Import

The main innovation of the project — being able to semi-automatically web-scrape a corpus, import text and images, and make a Linked Open Data corpus backbone as an import to Wikibase and MediaWiki

Wikitext + CSV > XML > DTD > Python (WikibaseIntegrator) > Wikibase / MediaWiki

Documentation: wiki.kewl.org/projects:ckgscrape

Code: Mercurial Clone hg clone https://hg.kewl.org/pub/ckg_s2mw (anti-bot credentials: username/password)

Thank you to Darron M. Broad of Runstop for the software development work.

Import Supporting Data

workflow

xml

Browsing the Corpus in MediaWiki 🔎

MediaWiki screens
  • Cleaning the chapter content directly in MediaWiki
  • Building navigation and browsing templates for the corpus
  • Adding glossary and acronym content to MediaWiki

DevOps

A 4 stage MediaWiki/Wikibase deployment is in place with support from #semanticClimate - this will move to WB4R and TIB hosting shortly once initial development resolved.

┌──────────────┐                    ┌──────────────┐
│   LOCAL      │◄───pull-from-dev───│     DEV      │  DEV = DB source of truth
│ (workstation)│                    │ (178...88)   │  Content edited here
└──────┬───────┘                    └──────┬───────┘
       │                                   │
       │ sync-local-to-test                │ sync-dev-to-test
       │ (staging from local)              │ (standard promotion)
       │                                   │
       ↓                                   ↓
   ┌──────────────┐                  ┌──────────────┐
   │    TEST      │──────────────────│    TEST      │
   │ (46...24)    │  (same target)   │ (46...24)    │
   └──────────────┘                  └──────┬───────┘
                                            │
                                            │ sync-dev-to-prod  (or sync-test-to-prod)
                                            ↓
                                       ┌──────────────┐
                                       │    PROD      │  Public instance
                                       │ (178...174)  │
                                       └──────────────┘

Data Analysis & Visualisation

CSS Paged Media as Output 📖

paged media

Possible output option Example chapter:

  • Using CSS Paged Media to typeset the corpus content into multi-format output

ClimateKG Next Steps 🚀

Planned use:

  • A community resource for data science — to use and contribute
  • Citizen science activities — Chapter annotation and enrichment #ChapterChampions
  • Distribute metadata to commons (Wikidata)

Goals:

  • Corpus KG service — Computation Publishing Service (CPS)
  • Work with IPCC and climate community
  • Become a KG repository for all climate literature

Thank you! Simon Worthington, Laura Oldenbourg, and team #semanticClimate — June ’26