Dataset Hub
Publish, manage, and explore datasets for machine learning training and evaluation.
What is a Dataset?
A dataset is a structured collection of data used for training, validating, or testing machine learning models.
- Structured data - Organized in rows and columns
- Schema - Defined data types for each column
- Splits - Train, validation, and test partitions
- Formats - Parquet, CSV, JSON, JSONL
- Metadata - Task types, tags, licensing
Creating a Dataset
Create a new dataset repository:
Via API:
POST /api/datasets
Authorization: Bearer <your-token>
Content-Type: application/json
{
"name": "sentiment-reviews",
"namespace": "username",
"visibility": "public",
"taskTypes": ["text-classification"],
"tags": ["sentiment", "reviews", "nlp"],
"license": "mit",
"description": "10k product reviews labeled with sentiment"
}Naming conventions:
- Use descriptive names that indicate the data content
- Include data size or version when applicable
- Use lowercase with hyphens or underscores
- Example:
imdb-reviews-50k
Dataset Formats
We support multiple data formats with automatic conversion:
Parquet (Recommended)
Columnar format optimized for analytics. Best for large datasets and fast queries.
CSV
Simple comma-separated values. Easy to create but slower for large datasets.
JSON / JSONL
Flexible format for nested data structures. Good for unstructured data.
💡 Tip: Upload CSV or JSON and we'll automatically convert to Parquet for optimal performance.
Dataset Schema
Define your dataset structure with a schema:
Example schema:
{
"columns": [
{ "name": "id", "type": "int64" },
{ "name": "text", "type": "string" },
{ "name": "label", "type": "string" },
{ "name": "score", "type": "float64" },
{ "name": "metadata", "type": "struct", "fields": [
{ "name": "source", "type": "string" },
{ "name": "timestamp", "type": "timestamp" }
]}
],
"splits": [
{ "name": "train", "num_rows": 8000 },
{ "name": "validation", "num_rows": 1000 },
{ "name": "test", "num_rows": 1000 }
]
}Supported types: int8, int16, int32, int64, float32, float64, string, boolean, list, struct, timestamp, date, binary, image, audio
Dataset Viewer
Explore your dataset with the interactive viewer:
Viewer features:
- Paginated row browsing
- Column filtering and sorting
- Value distribution statistics
- Sample data preview
- Schema inspection
Viewer API:
# Get dataset info (schema, splits, stats)
GET /api/datasets/:namespace/:name/viewer/info
# Get paginated rows
GET /api/datasets/:namespace/:name/viewer/rows?split=train&offset=0&limit=50
# Column filtering coming soon
GET /api/datasets/:namespace/:name/viewer/rows?columns=id,text,labelUploading Data
Upload your dataset files via the web or API:
Web upload:
- Navigate to your dataset repository
- Click "Upload files"
- Drag and drop or select files
- Files are automatically processed and converted
API upload (multipart):
POST /api/datasets/:namespace/:name/upload
Authorization: Bearer <your-token>
Content-Type: multipart/form-data
file: <binary data>
split: train
format: csv⚠️ Size Limits
- Free tier: 10 GB per dataset
- Pro tier: 100 GB per dataset
- Team/Enterprise: Custom limits
Downloading Datasets
Download datasets in your preferred format:
# Download specific split
POST /api/datasets/:namespace/:name/download
{
"split": "train",
"format": "parquet" # or "csv", "json", "jsonl"
}
# Response includes signed download URL
{
"downloadUrl": "https://storage.../dataset.parquet?token=...",
"expiresAt": "2026-01-15T12:00:00Z"
}Searching Datasets
Find datasets using filters and search:
GET /api/datasets?search=sentiment&taskTypes=text-classification&limit=20
Query parameters:
- search: text search in names and descriptions
- taskTypes: filter by task type
- tags: comma-separated tags
- license: filter by license
- sort: downloads, likes, size, created, updated
- limit: results per page
- offset: pagination offsetDataset Splits
Organize your data into standard splits:
Train
Used for training models (typically 70-80% of data)
Validation
Used for hyperparameter tuning (10-15% of data)
Test
Used for final evaluation (10-15% of data)
Best Practices
- ✅Document your data - Include data sources, collection methods, and preprocessing steps
- ✅Check for bias - Document known biases and limitations in your dataset
- ✅Use standard splits - Follow train/val/test conventions
- ✅Choose Parquet - For datasets over 1MB, use Parquet format
- ✅Specify licenses - Always include clear licensing information
- ✅Validate quality - Check for duplicates, nulls, and outliers before publishing