ORCAS-I

doi:10.48436/pp7xz-n9a06

Published April 22, 2022 | Version 1.0.0

Dataset Open

ORCAS-I

1. TU Wien, Vienna, Austria
2. Spinque, Utrecht, The Netherlands
3. Radboud University, Nijmegen, The Netherlands

ORCAS-I is an annotated version of ORCAS dataset (Craswell et al., 2020) annotated with user intents using weak supervision. It allows you to train your algorithm on various types of user intents. Those intents are initially taken from Broder's (2002) classification: informational, navigational and transactional. We also refined this classification and added two subcategories inside the informational category: factual and instrumental. If the intent did not get any label inside the informational category it was classified as abstain.

ORCAS-I consists of the following files:

a) ORCAS-I-18M.tsv

A complete ORCAS data set which contains 18 million unique query-urls pairs.

dataset size: 18,823,602
unique queries: 10,405,339
unique URLs: 1,422,029
unique domains: 241,199

b) ORCAS-I-2M.tsv

A 2M subset of ORCAS-I-18M.tsv that we used for our experiments with different machine learning algorithms.

dataset size: 2,000,000
unique queries: 1,796,652
unique URLs: 618,679
unique domains: 126,001

Both ORCAS-I-18M and ORCAS-I-2M contain the following columns:

qid: the id of the query
query: the text of the query
url: the url that the user clicked
did: the document from TREC deep learning track that the url leads to
level_1: first level of annotation which has three top level categories: informational, navigational and transactional
level_2: second level of annotation (only classifies according to factual and instrumental categories, so all the other intents in the column are classified as abstain)
label: final intent label. Provides the annotation for informational, navigational and transactional categories and also for factual, instrumental and abstain subcategories
data_split: either 'train' or 'validation' that corresponds to split used during the original experiments

You can train your classifier either on the 3 top level categories (column 'level_1') or on the full taxonomy (column 'label').

c) ORCAS-I-gold.tsv

This is a test file that contains 1000 randomly selected queries from the full dataset (they are excluded from the 2M sample). These queries were manually annotated by two IR specialists.

dataset size: 1,000
unique queries: 1,000
unique URLs: 995
unique domains: 700

ORCAS-I-gold contains the following columns:

qid: the id of the query
query: the text of the query
url: the url that the user clicked
did: the document from TREC deep learning track that the url leads to
label_manual - manually annotated intent
data_split: always equal to 'test'

Files

Files (2.5 GiB)

Name	Size
ORCAS-I-18M.tsv md5:b28684059bba8cfd2303fe28dc983e9f	2.3 GiB	Download
ORCAS-I-2M.tsv md5:3de30cd4e716f9b4049760f022374c0f	250.2 MiB	Download
ORCAS-I-gold.tsv md5:d7634ed17e676afb015ae16d867a0809	104.0 KiB	Download

Additional details

Continues: Preprint: arXiv:2006.05324 (arXiv); Conference Paper: 10.1145/792550.792552 (DOI)
Is supplement to: Conference Paper: 10.1145/3477495.3531737 (DOI)

ORCAS-I

Creators

Description

a) ORCAS-I-18M.tsv

b) ORCAS-I-2M.tsv

c) ORCAS-I-gold.tsv

Files

Files (2.5 GiB)

Additional details

Related works