Classifier-Based Filtering

Classifier-based filtering uses machine learning models to differentiate between high-quality and low-quality documents. NVIDIA NeMo Curator implements an approach similar to the one described in Brown et al., 2020, which trains a binary skip-gram classifier to distinguish between curated high-quality data and lower-quality data.

How It Works

Classifier-based filtering learns the characteristics of high-quality documents from training data, unlike heuristic filtering which relies on predefined rules and thresholds. This approach is particularly effective when:

  • You have a reference dataset of known high-quality documents
  • The distinction between high and low quality is complex or subtle
  • You want to filter based on domain-specific characteristics

NVIDIA NeMo Curator uses fastText for implementing classifier-based filtering, which offers excellent performance and scalability for text classification tasks.

📝

fastText is the official name and capitalization used by the fastText library created by Facebook Research.

The classifier-based filtering process involves:

  1. Preparing training data by sampling from high-quality and low-quality datasets
  2. Training a binary skip-gram classifier using fastText
  3. Using the trained model to score documents in your dataset
  4. Filtering documents based on the classifier scores, optionally using Pareto-based sampling

Usage

NeMo Curator provides two approaches for quality assessment:

  1. Classification: Use QualityClassifier to add quality predictions and optionally filter during classification
  2. Filtering: Use FastTextQualityFilter with ScoreFilter for document-level filtering with Pareto sampling
📝

If you need to train custom fastText models for specific domains or requirements, refer to the fastText documentation for comprehensive training guides.

Quality Classifier and Filter Parameters

QualityClassifier (DeBERTa)

The QualityClassifier accepts the following parameters:

  • filter_by (list, default=None): Quality levels to keep (options: “Low”, “Medium”, “High”)
  • model_inference_batch_size (int, default=256): Batch size for inference
  • max_chars (int, default=6000): Max characters per document for processing
  • label_field (str, default=“quality_pred”): Name of the prediction column
  • text_field (str, default=“text”): Name of the text field in input data

FastTextQualityFilter

The FastTextQualityFilter accepts the following parameters:

  • model_path (str, required): Path to the trained fastText model file
  • label (str, default=“__label__hq”): The label for high-quality documents
  • alpha (float, default=3): Alpha parameter for Pareto distribution sampling
  • seed (int, default=42): Random seed for reproducible sampling

Best Practices

For effective classifier-based filtering:

  1. Model selection: Start with the DeBERTa quality classifier for general use cases; consider fastText for high-throughput scenarios
  2. Validation: Manually review a sample of filtered results to confirm effectiveness
  3. Quality level tuning: Adjust filter_by levels (DeBERTa) or alpha values (fastText) based on your quality requirements
  4. Batch size optimization: Tune model_inference_batch_size for DeBERTa models based on your available memory
  5. Combination with heuristics: Consider using heuristic filters as a pre-filter to improve efficiency
  6. Domain adaptation: For specialized corpora, consider training custom models using domain-specific data