UK Biobank Health Data Keeps Ending up on GitHub
Companies Mentioned
Why It Matters
These takedowns highlight gaps in data‑privacy enforcement for research datasets, risking loss of valuable scientific code and slowing open‑science collaboration. The approach also raises questions about the suitability of copyright mechanisms for protecting participant confidentiality.
Key Takeaways
- •UK Biobank filed 110 DMCA takedowns on GitHub since July 2025.
- •Nearly half of removed files were Jupyter or R notebooks containing data.
- •Notices targeted developers in 14 countries, 24 from the United States.
- •Takedown activity paused early 2026, resumed after Guardian investigation.
Pulse Analysis
UK Biobank, one of the world’s largest biomedical resources, grants researchers access to genetic, phenotypic, and health‑record data under strict licensing agreements. When a researcher inadvertently uploads raw or derived data to a public platform, the biobank invokes the U.S. Digital Millennium Copyright Act (DMCA) process to request removal, even though the underlying issue is privacy rather than copyright infringement. Because the United Kingdom lacks a dedicated privacy‑breach takedown regime, the organization relies on GitHub’s DMCA repository, which publicly logs each notice, providing a rare window into the scale of the problem.
Analysis of the GitHub DMCA archive shows 110 takedown notices filed between July 2025 and March 2026, with a noticeable pause in January‑March 2026 that ended after The Guardian’s March investigation. Almost half of the flagged items are Jupyter or R notebooks—lightweight scripts that can embed a few rows of participant data—while a quarter consist of raw genotype files such as PLINK or BGEN. The notices span developers in at least 14 nations; the United States accounts for 24 of the 170 identified users, and China follows with 21, underscoring the global reach of the biobank’s data‑sharing ecosystem.
The reliance on copyright takedowns exposes a regulatory blind spot: privacy breaches are being policed through a mechanism designed for intellectual‑property disputes. This mismatch can lead to over‑removal of legitimate research code, chilling open‑source contributions, and creating uncertainty for institutions that host collaborative notebooks. Policymakers and research funders may need to develop a dedicated, cross‑jurisdictional framework that balances participant confidentiality with the benefits of reproducible science. Until such safeguards exist, organizations like UK Biobank will likely continue to weaponize DMCA notices, prompting the research community to adopt stricter data‑handling protocols before publishing code.
UK Biobank health data keeps ending up on GitHub
Comments
Want to join the conversation?
Loading comments...