Download Data

Load text data from ArXiv, Common Crawl, Wikipedia, and custom sources using Curator.

Curator provides a task-centric pipeline for downloading and processing large-scale public text datasets. It runs on Ray and converts raw formats like Common Crawl’s .warc.gz into JSONL.

How it Works

Curator uses a 4-step pipeline pattern where data flows through stages as tasks. Each step uses a ProcessingStage that transforms tasks according to Curator’s pipeline-based architecture .

Data sources provide composite stages that combine these steps into complete download-and-extract pipelines, producing DocumentBatch tasks for further processing.


Data Sources & File Formats

Load data from public datasets and custom data sources using Curator stages.