Preview

You are previewing a new record that has not yet been published.

Published March 9, 2026 | Version v1
Dataset Open

LongEval 2026 Test Dataset

  • 1. ROR icon The Open University
  • 2. ROR icon TH Köln - University of Applied Sciences
  • 3. ROR icon TU Wien
  • 4. Friedrich-Schiller-Universität Jena
  • 5. ROR icon University of Stavanger
  • 6. ROR icon Université Grenoble Alpes

Contributors

  • 1. TU Wien

Description

This dataset serves as the official testing data for the 2026 LongEval Information Retrieval Lab (https://clef-longeval.github.io/), organized at CLEF.

This repository contains the testing datasets for the 2026 LongEval  (https://clef-longeval.github.io/), organized at CLEF. It includes the dataset for subtasks: (1) Task 1 - LongEval-Sci: Ad-Hoc Scientific Retrieval, (2) Task 3 - LongEval-USim: User Simulation, and (3) Task 4. LongEval-RAG: Retrieval Augmented Generation (RAG)

The testind data for all tasks were collected from June to August 2025 and from September to November 2025.

  • Task 1 - LongEval-Sci: Ad-Hoc Scientific Retrieval

The collection consists of queries and documents extracted from the CORE scholarly literature search engine (https://core.ac.uk/). The queries in this collection were issued by CORE users, and a specific pipeline was designed to capture user queries, their corresponding search results, and user interactions. Using this pipeline, the dataset for this task was extracted, consisting of two main components: 

  1. Search Information: unique, anonymous identifiers for individual user sessions, search queries, and the returned results. 
  2. Click Information: unique, anonymous identifiers for individual user sessions, the links clicked within the results list, and the positions of the clicked links. 

The documents in the collection were selected based on the search results of user queries. In addition to these selected documents, the collection also contains randomly chosen documents from the CORE index. 

In total, the collection comprises 381 testing queries. The document set consists of downloaded, cleaned, filtered scholarly articles, and divided into two snapshots: (1) from June to August 2025: 1322720 documents , and (2) from September to November 2025: 1661900 documents.  The document sets are incremental, where each future snapshots contains the previous snapshots documents and extra documents.

  • Task 3 - LongEval-USim: User Simulation

The dataset consists of : (1) 118 sessions collected from June to August 2025, and (2) 181 sessions collected from September to November 2025. The sessions contain the following rich set of features:

    - The sequence of queries submitted.

    - Timestamps for query submissions.

    - The top-10 SERP (Search Engine Results Page) retrieved for each query.

    - Which documents were clicked, along with their corresponding click type.

  • Task 4. LongEval-RAG: Retrieval Augmented Generation (RAG)

The dataset contains 47 queries, each paired with a corresponding set of document IDs. Each query includes the following elements:

    - A question related to research papers available on the CORE website.

    - A set of 10 document IDs from which participants must justify their answer, selecting at least two ids.

Files

About_longeval_test_2026.pdf

Files (23.0 GiB)

NameSize
md5:9b794710956b723e60f61b15ea8f7d64
178.2 KiBPreview Download
md5:09bbdee27604ac0c48e5ce0a9ba27266
901.2 MiBPreview Download
md5:3c8cbcc9c5bc587940678a405bbe01c5
9.6 GiBPreview Download
md5:02b3cdc3f96257b2e301d8a1772aa194
1.1 GiBPreview Download
md5:3aca66a46024a6c1013d892afa0b6ea7
11.4 GiBPreview Download
md5:486c9157b198f2cc815f6b6d97b0b378
26.0 KiBDownload
md5:226b73be77adda64503d72fef9b42fca
69.5 KiBPreview Download
md5:b5ff2c479a1eb8af5f22f9a380f98630
87.5 KiBPreview Download
md5:80d43ac6056622812c2596655a206b49
13.8 KiBDownload