This workflow covers the full image curation pipeline on Slurm, including model download, embedding generation, classification, filtering, and deduplication.

๐Ÿ“

For details on image container environments and Slurm environment variables, see Container Environments.

Prerequisites

  • Create required directories for AWS credentials, NeMo Curator configuration, and local workspace:

    mkdir $HOME/.aws
    mkdir -p $HOME/.config/nemo_curator
    mkdir $HOME/nemo_curator_local_workspace

Prepare configuration files:


Model Download

  1. Copy the following script for downloading all required image processing models into the Slurm cluster.

    #!/bin/bash
    
    #SBATCH --job-name=download_image_models
    #SBATCH -p defq
    #SBATCH --nodes=1
    #SBATCH --ntasks-per-node=1
    #SBATCH --exclusive
    #SBATCH --gres=gpu:1
    
    # Update Me!
    #SBATCH --output=/home/<username>/logs/%x_%j.log
    USER_DIR="/home/${USER}"
    CONTAINER_IMAGE="${USER_DIR}/path-to/curator.sqsh"
    #
    
    LOCAL_WORKSPACE="${USER_DIR}/nemo_curator_local_workspace"
    LOCAL_WORKSPACE_MOUNT="${LOCAL_WORKSPACE}:/config"
    NEMO_CONFIG_MOUNT="${HOME}/.config/nemo_curator/config.yaml:/nemo_curator/config/nemo_curator.yaml"
    CONTAINER_MOUNTS="${LOCAL_WORKSPACE_MOUNT},${NEMO_CONFIG_MOUNT}"
    
    export NEMO_CURATOR_RAY_SLURM_JOB=1
    export NEMO_CURATOR_LOCAL_DOCKER_JOB=1
    
    # Download Image Processing Models
    srun \
      --mpi=none \
      --container-writable \
      --no-container-remap-root \
      --export=NEMO_CURATOR_RAY_SLURM_JOB,NEMO_CURATOR_LOCAL_DOCKER_JOB \
      --container-image "${CONTAINER_IMAGE}" \
      --container-mounts "${CONTAINER_MOUNTS}" \
        --  python3 -c "
    import timm
    from nemo_curator.image.embedders import TimmImageEmbedder
    from nemo_curator.image.classifiers import AestheticClassifier, NsfwClassifier
    
    # Download and cache CLIP model
    embedder = TimmImageEmbedder('vit_large_patch14_clip_quickgelu_224.openai', pretrained=True)
    
    # Download aesthetic and NSFW classifiers
    aesthetic = AestheticClassifier()
    nsfw = NsfwClassifier()
    
    print('Image models downloaded successfully')
    "
  2. Update the SBATCH parameters and paths to match your username and environment.

  3. Run the script.

    sbatch 1_curator_download_image_models.sh

Image Processing Pipeline

The workflow consists of three main Slurm scripts, to be run in order:

  1. curator_image_embed.sh: Generates embeddings and applies classifications to images.
  2. curator_image_filter.sh: Filters images based on quality, aesthetic, and NSFW scores.
  3. curator_image_dedup.sh: Performs semantic deduplication using image embeddings.
  1. Update all # Update Me! sections in the scripts for your environment (paths, usernames, S3 buckets, etc).
  2. Submit each job with sbatch:
sbatch curator_image_embed.sh
sbatch curator_image_filter.sh
sbatch curator_image_dedup.sh

Monitoring and Logs

  1. Check job status:

    squeue
  2. View logs:

    tail -f /path/to/logs/<jobname>-<jobid>.log

Performance Considerations

  • GPU Memory: Image processing requires significant GPU memory. Consider using nodes with high-memory GPUs (40GB+ VRAM) for large batch sizes.
  • Tar Archive Format: Ensure your input data is in tar archive format (.tar files containing JPEG images).
  • Network I/O: Image data can be large. Consider local caching or high-bandwidth storage for better performance.
  • Clustering Scale: For datasets with millions of images, increase n_clusters to 50,000+ to improve deduplication performance.