MedAID-ML: A Multilingual Dataset of Biomedical Texts for Detecting AI-Generated Content

doi:10.20350/digitalCSIC/17276

Published June 19, 2025 | Version 1.0

Dataset Restricted

MedAID-ML: A Multilingual Dataset of Biomedical Texts for Detecting AI-Generated Content

1. TU Wien
2. Spanish National Research Council
3. Carlos III University of Madrid

Dataset for MedAID-ML: A Multilingual Dataset of Biomedical Texts for Detecting AI-Generated Content

This dataset was created by gathering human-authored corpora from several public health sites and generating additional data via three different LLMs: GPT-4o, Mistral-7B and Llama3-1. We included texts in English, Spanish, German and French data from the biomedical domain. The current version gathers 50% AI-generated and 50% human-written texts.

The following are the data we used:

Cochrane Library: This is a database of meta-analyses and systematic reviews of updated results of clinical studies. We used abstracts of systematic reviews in all four languages.
European Clinical Trials (EUCT): This agency that supervises and evaluates pharmaceutical products of the European Union (EU). We downloaded parallel data from public assessment reports (EPARs) from 12 new medicinal products, and data from clinical trial protocols and eligibility criteria. We ensured the data were published only from January 2025 to date. The goal was gathering data that might not have been used to train the LLMs in our experiments.
European Medicines Agency (EMA): This agency that supervises and evaluates pharmaceutical products of the European Union (EU). We downloaded parallel data from public assessment reports (EPARs) from 12 new medicinal products, and data from clinical trial protocols and eligibility criteria. We ensured the data were published only from January 2025 to date. The goal was gathering data that might not have been used to train the LLMs in our experiments.
European Food Safety Authority (EFSA): This website provides a comprehensive range of data about food consumption and chemical/biological monitoring data. We chose only the topics we deem necessary for our goals, therefore including a total of 51 topics. Processing: we manually split articles with a wordcount of above 1350 and manually ensured their correctness and alignment in all languages.
European Vaccination Information Portal (EVIP): it provides up-to-date information on vaccines and vaccination. The factsheets are available in all languages, and consist of 20 texts each.
Immunize: Immunize.org (formerly known as the Immunization Action Coalition) is a U.S.-based organization dedicated to providing comprehensive immunization resources for healthcare professionals and the public. Vaccine Information Sheets (VISs) have been translated into several languages, but not all of them contain all VISs. They are given as PDFs, with 25 in Spanish, French and English, but only 21 in German. Only PDFs overlapping in all languages were used.
Migration und Gesundheit - German Ministry of Health (BFG): This portal provides multilingual health information tailored for migrants and refugees. Gesundheit für alle is a PDF file that provides a guide to the German healthcare system, and it is available in Spanish, English and German. Processing: Two topics, which were shorter than 100 words, were merged with the next one to ensure that context is preserved.
Orphadata (INSERM): a comprehensive knowledge base about rare diseases and orphan drugs, in re-usable and high-quality formats, released in 12 official EU languages. We gathered definitions, signs and symptoms and phenotypes about 4389 rare diseases in English, German, Spanish and French. Processing: Since each definition is roughly the same size and similar format, we simply group 5 definitions together to make the text per topic longer.
PubMed (National Library of Medicine): we downloaded abstracts available in English, Spanish, French and German.
Wikipedia: a free, web-based, collaborative multilingual encyclopedia project; we selected (bio)medical contents available in English, German, Spanish and French. To ensure that the texts were not automatically generated, we only use articles that date back to before the release of ChatGPT, i.e. before 30th November 2022. Processing: some data cleaning was necessary; we also removed all topics with less than 5 words, or split those with more than 9 sentences into equally long parts. From these split up files, we make sure that they contain a minimum of 100 words, and we take only those contents or topics that exist in all three languages.

Description of methods used for collection/generation of data

The corpus statistics and methods are explained in the following article: Patrick Styll, Leonardo Campillos-Llanos, Jorge Fernández-García, Isabel Segura-Bedmar (2025) "MedAID-ML: A Multilingual Dataset of Biomedical Texts for Detecting AI-Generated Content".

Methods for processing the data

Web-scraping of data from HTML content and PDF files available on the websites of the health contents.
Postprocessing and cleaning of data (e.g., removal of redundant white spaces or line breaks), and homogeneization of text length.
Generation of corresponding contents by means of generative AI using three large language models: GPT-4o, Mistral-7B and Llama3-1. - Formating of contents into JSON format.

Files

JSON files:.These are separated in TRAIN and TEST. Each file has a list of hashes for each text, and each hash contains the following fields:

text: the textual content.
data_source: the source repository of the text.
filename: the name of the original file from which the data were sourced.
source: label indicating if it is a human-written text (HUMAN) or the LLM used to generate the text ("gpt4o", "mistral" or "llama").
language: The language code of the text: German ("de"), English ("en"), Spanish ("es") or French ("fr").
target: a binary label to code if the text is written by humans ("0") or AI ("1").
ratio: The proportion of the text that was created with AI: "0.5" for AI-generated texts, and "null" for human texts.

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

You are currently not logged in. Do you have an account? Log in here

Additional details

Accepted: 2025-06-11

Acceptance at CLEF 2025 Conference Track

MedAID-ML: A Multilingual Dataset of Biomedical Texts for Detecting AI-Generated Content

Creators

Description

Dataset for MedAID-ML: A Multilingual Dataset of Biomedical Texts for Detecting AI-Generated Content

Description of methods used for collection/generation of data

Methods for processing the data

Files

Files

Restricted

Request access

Additional details

Dates