Quality Assessment & Filtering

Score and remove low-quality content using heuristics and ML classifiers to prepare your data for model training using NeMo Curator’s tools and utilities.

Large datasets often contain many documents considered “low quality.” In this context, “low quality” means data we do not want downstream models to learn from, and “high quality” is data we do want them to learn from. The metrics that define quality can vary widely.

How It Works

NeMo Curator’s filtering framework is built around several key components that work within the data processing architecture :


Filtering Approaches

Usage

NeMo Curator provides programmatic interfaces for document filtering through the Pipeline framework:

from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.io.reader import JsonlReader
from nemo_curator.stages.text.io.writer import JsonlWriter
from nemo_curator.stages.text.filters import ScoreFilter
from nemo_curator.stages.text.filters.heuristic import WordCountFilter

# Create and configure pipeline
pipeline = Pipeline(name="document_filtering")

# Add data loading
reader = JsonlReader(
    file_paths="/path/to/input/data/*.jsonl",
    fields=["text", "id"]
)
pipeline.add_stage(reader)

# Add filtering stage
filter_stage = ScoreFilter(
    filter_obj=WordCountFilter(min_words=80),
    text_field="text",
    score_field="word_count"
)
pipeline.add_stage(filter_stage)

# Add output stage
writer = JsonlWriter(path="/path/to/output/filtered/")
pipeline.add_stage(writer)

# Execute pipeline (uses XennaExecutor by default)
results = pipeline.run()

Best Practices

When filtering large datasets, consider these performance tips:

  1. Order matters: Place computationally inexpensive filters early in your pipeline
  2. Batch size tuning: Adjust batch sizes based on your hardware capabilities
  3. Use vectorization: Implement batched methods for compute-intensive filters
  4. Disk I/O: Consider compression and chunking strategies for large datasets
  5. Distributed processing: For TB-scale datasets, use distributed filtering with the XennaExecutor