Description
GPU-accelerated data curation for LLM training. Supports text/image/video/audio. Features fuzzy deduplication (16× faster), quality filtering (30+ heuristics), semantic deduplication, PII redaction, NSFW detection. Scales across GPUs with RAPIDS. Use for preparing high-quality training datasets, cleaning web data, or deduplicating large corpora.
Skill File
Tags
Information
You Might Also Like
Slack Gif Creator
Knowledge and utilities for creating animated GIFs optimized for Slack
Youtube Downloader
Download YouTube videos with customizable quality and format options
Slack Gif Creator
Toolkit for creating animated GIFs optimized for Slack, with validators for size constraints and ...
Web Design Reviewer
This skill enables visual inspection of websites running locally or remotely to identify and fix ...
Frontend Ui Ux
Designer-turned-developer who crafts stunning UI/UX even without design mockups
Ui Design System
UI design system toolkit for Senior UI Designer including design token generation, component docu...