Fuzzy Duplicate Removal
Find and remove near-duplicate documents with small edits or reformatting using MinHash and Locality Sensitive Hashing (LSH). This approach identifies candidate pairs with a similarity threshold efficiently at scale on GPU.
For other approaches, refer to Deduplication .
How It Works
Fuzzy deduplication uses MinHash and LSH to find near-duplicate content:
- Computes MinHash signatures over character n-grams
- Uses Locality Sensitive Hashing (LSH) to find candidate matches
- Builds a graph of duplicate relationships
- Identifies groups of near-duplicate documents
Ideal for detecting documents with minor differences such as formatting changes, typos, or small edits, where documents share a high degree of overlapping content.
Before You Start
Prerequisites:
- Ray cluster with GPU support (required for distributed processing)
- Stable document identifiers for removal (either existing IDs or IDs generated by the workflow and removal stages)
Running in Docker: When running fuzzy deduplication inside the NeMo Curator container, ensure the container is started with --gpus all so that Ray workers can access the GPU. Without GPU access, you may see CUDARuntimeError or AttributeError: 'CUDARuntimeError' object has no attribute 'msg'. Also activate the virtual environment with source /opt/venv/env.sh after entering the container.
Quick Start
Get started with fuzzy deduplication using the following example of identifying duplicates, then remove them:
from nemo_curator.core.client import RayClient
from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow
from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow
ray_client = RayClient()
ray_client.start()
# Step 1: Identify duplicates
fuzzy_workflow = FuzzyDeduplicationWorkflow(
input_path="input_data/",
cache_path="./cache",
output_path="./results",
text_field="text",
perform_removal=False,
input_filetype="parquet",
char_ngrams=24,
num_bands=20,
minhashes_per_band=13
)
result = fuzzy_workflow.run()
# result.metadata contains: total_time, num_duplicates, minhash_time, lsh_time, connected_components_pipeline_time, id_generator_path
# Step 2: Remove duplicates
removal_workflow = TextDuplicatesRemovalWorkflow(
input_path="input_data/",
ids_to_remove_path="./results/FuzzyDuplicateIds",
output_path="./deduplicated",
input_filetype="parquet",
input_id_field="_curator_dedup_id",
ids_to_remove_duplicate_id_field="_curator_dedup_id",
id_generator_path="./results/fuzzy_id_generator.json"
)
result = removal_workflow.run()
# result.metadata contains: total_time, num_duplicates_removed
Configuration
Configure fuzzy deduplication using these key parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
input_path | str | list[str] | None | Path(s) to input files or directories |
cache_path | str | Required | Directory to cache intermediate results |
output_path | str | Required | Directory to write duplicate IDs and ID generator |
text_field | str | βtextβ | Name of the text field in input data |
char_ngrams | int | 24 | Character n-gram size for MinHash (recommended: >= 20) |
num_bands | int | 20 | Number of LSH bands (affects similarity threshold) |
minhashes_per_band | int | 13 | Number of hashes per LSH band |
bands_per_iteration | int | 5 | Bands processed per iteration (memory tuning) |
use_64_bit_hash | bool | False | Use 64-bit hash (more memory, fewer collisions) |
seed | int | 42 | Random seed for MinHash permutations |
input_filetype | str | βparquetβ | Input file format (βparquetβ or βjsonlβ) |
input_blocksize | str | int | β1GiBβ | Size of input blocks for processing |
lsh_num_output_partitions | int | None | None | Total number of partitions to write during the LSH shuffle. If None, the partition count is chosen automatically as the closest power of 2 <= the number of input tasks. |
lsh_rmm_pool_size | int | βautoβ | None | βautoβ | Size of the RMM GPU memory pool in bytes for the LSH stage. "auto" sets the pool to 90% of free GPU memory. None sets the pool to 50% of free GPU memory and allows expansion. |
lsh_spill_memory_limit | int | βautoβ | None | βautoβ | Device memory limit in bytes for spilling to host during the LSH stage. "auto" sets the limit to 80% of the RMM pool size. None disables spilling. |
perform_removal | bool | False | Reserved; must remain False. Fuzzy removal is performed with TextDuplicatesRemovalWorkflow. |
Similarity Threshold
Control matching strictness with num_bands and minhashes_per_band:
- More strict matching: Increase
num_bandsor decreaseminhashes_per_band - Less strict matching: Decrease
num_bandsor increaseminhashes_per_band
Default (num_bands=20, minhashes_per_band=13) provides a balanced trade-off between recall and precision for many datasets. The exact similarity at which pairs are detected depends on your data distribution.
Custom Similarity Threshold
# Example: stricter matching (fewer pairs detected, higher required similarity)
fuzzy_workflow = FuzzyDeduplicationWorkflow(
num_bands=25, # More bands = stricter matching
minhashes_per_band=10 # Fewer hashes per band = stricter matching
)
# Example: less strict matching (more pairs detected, lower required similarity)
fuzzy_workflow = FuzzyDeduplicationWorkflow(
num_bands=15, # Fewer bands = less strict matching
minhashes_per_band=15 # More hashes per band = less strict matching
)Removing Duplicates
After identifying duplicates, use TextDuplicatesRemovalWorkflow to remove them:
from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow
removal_workflow = TextDuplicatesRemovalWorkflow(
input_path="/path/to/input/data",
ids_to_remove_path="/path/to/output/FuzzyDuplicateIds",
output_path="/path/to/deduplicated",
input_filetype="parquet",
input_id_field="_curator_dedup_id",
ids_to_remove_duplicate_id_field="_curator_dedup_id",
id_generator_path="/path/to/output/fuzzy_id_generator.json" # Required if IDs were auto-assigned
)
result = removal_workflow.run()
ID Field Configuration
When IDs were auto-assigned:
id_generator_pathis required- Ensures consistent ID mapping between identification and removal stages
Output Format
The fuzzy deduplication process produces the following directory structure:
cache_path/
βββ MinHashStage/ # MinHash signatures
β βββ *.parquet
βββ LSHStage/ # LSH buckets
β βββ *.parquet
βββ BucketsToEdges/ # Graph edges
β βββ *.parquet
βββ ConnectedComponents/ # Connected components
βββ *.parquet
output_path/
βββ FuzzyDuplicateIds/ # Duplicate identification results
β βββ *.parquet # Parquet files with document IDs to remove
βββ fuzzy_id_generator.json # ID generator mapping (if IDs were auto-assigned)
File Formats
The workflow produces these output files:
-
Duplicate IDs (
FuzzyDuplicateIds/*.parquet):- Contains document IDs to remove
- Format: Parquet files with column:
["_curator_dedup_id"] - Important: Contains only the IDs of documents to remove, not the full document content
-
ID Generator (
fuzzy_id_generator.json):- JSON file containing ID generator state
- Required for removal workflow when IDs were auto-assigned
- Ensures consistent ID mapping across workflow stages
-
Cache Files (
cache_path/):- Intermediate results for debugging and analysis
- Can be reused if re-running with different parameters
- Clear cache between runs if parameters change significantly
Performance Considerations
Performance characteristics:
- GPU-accelerated MinHash and LSH operations
- Scales across multiple GPUs and nodes using Ray
bands_per_iterationcontrols memory usage- Intermediate results are cached for efficiency
GPU requirements:
- NVIDIA GPU with CUDA support
- Ray cluster with GPU workers
Performance tuning:
- Memory: Adjust
bands_per_iteration(lower = less memory, more iterations) - GPU memory (LSH): Use
lsh_rmm_pool_sizeto control GPU memory allocation andlsh_spill_memory_limitto tune host-spilling behavior during the LSH stage. Reducing the pool size or lowering the spill threshold can prevent out-of-memory errors on smaller GPUs. - Shuffle partitions: Set
lsh_num_output_partitionsto control the number of output partitions during the LSH shuffle. More partitions reduce per-partition memory but increase I/O overhead. - Accuracy: Use
char_ngrams >= 20to reduce false positives - Best practices: Clear cache between runs, use
input_blocksize="1GiB"
Note: Performance depends on hardware configuration, dataset characteristics, and parameter choices such as bands_per_iteration, char_ngrams, and input_blocksize.
For comparison with other deduplication methods and guidance on when to use fuzzy deduplication, refer to the Deduplication overview .