ClimateKG — Surfacing Knowledge

A community knowledge graph:
Based on FAIR Data & UNESCO Open Science Principles

Simon Worthington & Laura Oldenbourg — Open Science Lab, TIB – Leibniz Information Centre for Science and Technology

2026-06-10

About ClimateKG

Climate Knowledge Graph (ClimateKG) is a yearlong R&D project to create a knowledge graph of the IPCC Sixth Assement Report (AR6).
The ClimateKG knowledge graph is built using Wikibase and MediaWiki.
The goal is to be a resource for:
- data science use, for
- citizen science activities, and to
- distribute climate science metadata to Wikidata.

The project has been funded by TIB Innovation Fund and is made in cooperation with #semanticClimate community (India, Germany, UK, +).

What are the IPCC Reports?

IPCC Reports

It is a survival guide for humanity. As it shows, the 1.5-degree limit is achievable.
- UN Secretary-General António Guterres (2023)

A multilateral panel of 195 nations that reviews the global scientific literature to map out climate scenarios for the next 100 years combining science, policy, and politics.
AR6 ( \(\approx\) 5-8 year cycle since 1988): 932 authors; 7 Reports; >8 million words; 10,047 pages; 48,400 citations; 66,834 data sets; 2,136 images; 1,910 Acronyms; 920 Glossary items; 5 languages+ (partial).*

ClimateKG Project Deliverables

Transform PDF & web corpus, and web databases into a knowledge graph with a reusable pipeline made by ClimateKG:

Full text & publishing Paged Medias

Metadata

API / SPARQL Endpoint

SPARQL Endpoint

Claude & LLMs IDE integration

Schemas / Ontologies: DataCite, OpenAlex, Basic Formal Ontology (BFO), etc

Schemas / Ontologies

Data workbench: JupyterLab / Quarto

Data workbench

Problem: Public Trust in Climate Science Requires It’s at Their Fingertips!

Headline issues:
- Search and SEO limited
- Not easy to reuse
- Formats of Web CMS HTML and PDF not suitable for data science
- Referenced materials not interlinked — only manual tracking possible
- Parts not easily available — Methodology, Supplementary material, etc

Examples:
- Links only to top levels: Reports, Glossary or Author websites, etc.
- Glossary terms have no links to reports
- Author lists have no link to chapters
- Figures listed on web pages with no DOIs
- Citations on websites and not research repositories

Solution: Apply the vision of the Semantic Web

🕸️ Link all of the parts and make machine readable
— AKA a Knowledge Graph

Linked Data: Interlinked data with unique identifiers and standard formats
FAIR Data: Findable, Accessible, Interoperable, and Reusable

Overview of work packages 📦

From Siloed Data to FAIR Data
Finding the sources → retrieving the data → storing it → preparing different ways of outputting and visualizing it and making data available for reuse

Scoping Data Sources & Data Modeling

In-scope and out-of-scope: Only the main text (excluding front matter and back matter) for AR6 has currently been imported, citations and data sets have not been linked.
Data modelling:
- Simple (KISS) approaches of Protein Data Bank (1970s) and GeneBank (1990s) have been used to encourage community uptake.
- Method: Bottom up / Top down / and map to schemas
- Lit review of Wikibase KGs consulted - Zotero collection

Data Retrieval 📥

Building a semi-automated scraping system from Web to MediaWiki/Wikibase
Using the IPCC websites as a starting point, developing a method for scraping text, assets, and additional data
Preparing the data for import into a MediaWiki/Wikibase instance

Collecting Supporting Data 🧩

Additional data from various sources, e.g. IPCC data - web databases, Crossref, etc.
Preparing the supporting data for import into Wikibase

Import Scrape Data — Innovation That Allows for Corpus Text and Data Import

The main innovation of the project — being able to semi-automatically web-scrape a corpus, import text and images, and make a Linked Open Data corpus backbone as an import to Wikibase and MediaWiki

Wikitext + CSV > XML > DTD > Python (WikibaseIntegrator) > Wikibase / MediaWiki

Documentation: wiki.kewl.org/projects:ckgscrape

Code: Mercurial Clone hg clone https://hg.kewl.org/pub/ckg_s2mw (anti-bot credentials: username/password)

Thank you to Darron M. Broad of Runstop for the software development work.

Import Supporting Data

workflow

xml

Browsing the Corpus in MediaWiki 🔎

Cleaning the chapter content directly in MediaWiki
Building navigation and browsing templates for the corpus
Adding glossary and acronym content to MediaWiki

DevOps

A 4 stage MediaWiki/Wikibase deployment is in place with support from #semanticClimate - this will move to WB4R and TIB hosting shortly once initial development resolved.

┌──────────────┐                    ┌──────────────┐
│   LOCAL      │◄───pull-from-dev───│     DEV      │  DEV = DB source of truth
│ (workstation)│                    │ (178...88)   │  Content edited here
└──────┬───────┘                    └──────┬───────┘
       │                                   │
       │ sync-local-to-test                │ sync-dev-to-test
       │ (staging from local)              │ (standard promotion)
       │                                   │
       ↓                                   ↓
   ┌──────────────┐                  ┌──────────────┐
   │    TEST      │──────────────────│    TEST      │
   │ (46...24)    │  (same target)   │ (46...24)    │
   └──────────────┘                  └──────┬───────┘
                                            │
                                            │ sync-dev-to-prod  (or sync-test-to-prod)
                                            ↓
                                       ┌──────────────┐
                                       │    PROD      │  Public instance
                                       │ (178...174)  │
                                       └──────────────┘

CSS Paged Media as Output 📖

Possible output option Example chapter:

Using CSS Paged Media to typeset the corpus content into multi-format output

ClimateKG Next Steps 🚀

Planned use:

A community resource for data science — to use and contribute
Citizen science activities — Chapter annotation and enrichment #ChapterChampions
Distribute metadata to commons (Wikidata)

Goals:

Corpus KG service — Computation Publishing Service (CPS)
Work with IPCC and climate community
Become a KG repository for all climate literature

Thank you! Simon Worthington, Laura Oldenbourg, and team #semanticClimate — June ’26