Read Existing Data

Use Curator’s JsonlReader and ParquetReader to read existing datasets into a pipeline, then optionally add processing stages.

Reader Configuration

Common Parameters

Both JsonlReader and ParquetReader support these configuration options:

ParameterTypeDescriptionDefault
file_pathsstr | list[str]File paths or glob patterns to readRequired
files_per_partitionint | NoneNumber of files per partition. Overrides blocksize if both are provided.None
blocksizeint | str | NoneTarget partition size (e.g., “128MB”). Ignored if files_per_partition is provided.None
fieldslist[str] | NoneColumn names to read (column selection)None (all columns)
read_kwargsdict[str, Any] | NoneExtra arguments for the underlying readerNone

Parquet-Specific Features

ParquetReader provides these optimizations:

  • PyArrow Engine: Uses pyarrow engine by default for better performance
  • Storage Options: Supports cloud storage via storage_options in read_kwargs
  • Schema Handling: Automatic schema inference and validation
  • Columnar Efficiency: Optimized for reading specific columns

Performance Tips

  • Use fields parameter to read required columns for better performance
  • Set files_per_partition based on your cluster size and memory constraints
  • Use blocksize for fine-grained control over partition sizes

Output Integration

Both readers produce DocumentBatch tasks that integrate seamlessly with:

  • Processing Stages: Apply filters, transformations, and quality checks
  • Writer Stages: Export to JSONL, Parquet, or other formats
  • Analysis Tools: Convert to Pandas/PyArrow for inspection and debugging