NeMo Curator NeMo Curator NeMo Curator
Latest · v1.2.0 (26.04)
Documentation API Reference
Home
  • Welcome
About NeMo Curator
  • Overview
  • Key Features
  • Concepts
    • Overview
    • Deduplication
    • Scaling & Performance
      • Resource Allocation
      • Streaming
      • Auto-Balancing
      • Throughput
      Text Concepts
      • Overview
      • Data
        • Loading
        • Acquisition
        • Processing
        • Curation Pipeline
      Image Concepts
      • Overview
      • Data
        • Loading
        • Data Processing
        • Data Export
      Video Concepts
      • Overview
      • Architecture
      • Abstractions
      • Data Flow
      Audio Concepts
      • Overview
      • Curation Pipeline
      • Audio Task
      • ASR Pipeline
      • Quality Metrics
      • Manifests and Ingest
      • ALM Pipeline
      • Text Integration
    Release Notes
    • Overview
    • Migration FAQ
    • Migration Guide
Get Started
  • Overview
  • Install (All Modalities)
  • Text Quickstart
  • Image Quickstart
  • Video Quickstart
  • Audio Quickstart
Curate Text
  • Overview
  • Tutorials
  • Load Data
    • Overview
    • ArXiv
    • Common Crawl
    • Custom Sources
    • Nemotron-Parse PDF Pipeline
    • Read Existing Data
    • Wikipedia
    Process Data
    • Overview
    • Content Processing
      • Overview
      • Add IDs
      • Text Cleaning
      Embeddings
      • Overview
      • vLLM Embedder
      Deduplication
      • Overview
      • Exact Deduplication
      • Fuzzy Deduplication
      • Semantic Deduplication
      Language Management
      • Overview
      • Language Detection
      • Stopwords
      Quality Assessment
      • Overview
      • Classifier
      • Distributed Classifier
      • Heuristic Filtering
      Specialized Processing
      • Overview
      • Code Processing
      Interleaved Datasets
      • Overview
      • Interleaved IO
      • Interleaved Filters
  • Save and Export
  • Synthetic Data
    • Overview
    • LLM Client Setup
    • Inference Server
    • NeMo Data Designer
    • Multilingual Q&A
    • Nemotron-CC
      • Overview
      • Task Reference
Curate Images
  • Overview
  • Tutorials
    • Overview
    • Beginner Tutorial
    • Deduplication Workflow
    Load Data
    • Overview
    • TAR Archives
    Process Data
    • Overview
    • Embeddings
      • Overview
      • CLIP Embedder
      Filters
      • Overview
      • Aesthetic Filter
      • NSFW Filter
  • Save and Export
Curate Video
  • Overview
  • Tutorials
    • Overview
    • Beginner Tutorial
    • Split and Dedup
    • Pipeline Customization
      • Overview
      • Add Custom Environment
      • Add Custom Code
      • Add Custom Model
      • Add Custom Stage
  • Load Data
  • Process Data
    • Overview
    • Clipping
    • Transcoding
    • Filtering
    • Embeddings
    • Deduplication
    • Frame Extraction
    • Captions Preview
  • Save and Export
Curate Audio
  • Overview
  • Tutorials
    • Overview
    • Beginner Tutorial
    • ALM Tutorial
    • ReadSpeech Tutorial
    Load Data
    • Overview
    • Custom Manifests
    • FLEURS Dataset
    • Local Files
    Process Data
    • Overview
    • ASR Inference
      • Overview
      • NeMo ASR Models
      Quality Assessment
      • Overview
      • WER Filtering
      • Duration Filtering
      Quality Filtering
      • Overview
      • Preprocessing Stages
      • VAD Segmentation
      • Band Filter
      • UTMOS Filter
      • SIGMOS Filter
      • Speaker Separation
      • AudioDataFilterStage Composite
      Audio Analysis
      • Overview
      • Duration Calculation
      • Format Validation
      ALM Data Curation
      • Overview
      • Data Builder
      • Overlap Filtering
    • Text Integration
  • Save and Export
Setup & Deployment
  • Overview
  • Deployment
    • Overview
    • Requirements
    • Deploy Image Curation on Slurm
    • Multi-Node Ray on Slurm
    Integrations
    • Overview
Reference
  • Overview
  • Infra
    • Overview
    • Memory Management
    • Monitoring
    • GPU Processing
    • Resumable Processing
    • Execution Backends
    • Per-Stage Runtime Environments
    • Container Environments
  • Related Tools

Integrations#

This section is currently being updated. Integration guides will be available in future releases.

Previous
Multi-Node Ray on Slurm
Next
Overview