Dataset Hub

Publish, manage, and explore datasets for machine learning training and evaluation.

What is a Dataset?

A dataset is a structured collection of data used for training, validating, or testing machine learning models.

  • Structured data - Organized in rows and columns
  • Schema - Defined data types for each column
  • Splits - Train, validation, and test partitions
  • Formats - Parquet, CSV, JSON, JSONL
  • Metadata - Task types, tags, licensing

Creating a Dataset

Create a new dataset repository:

Via API:

POST /api/datasets
Authorization: Bearer <your-token>
Content-Type: application/json

{
  "name": "sentiment-reviews",
  "namespace": "username",
  "visibility": "public",
  "taskTypes": ["text-classification"],
  "tags": ["sentiment", "reviews", "nlp"],
  "license": "mit",
  "description": "10k product reviews labeled with sentiment"
}

Naming conventions:

  • Use descriptive names that indicate the data content
  • Include data size or version when applicable
  • Use lowercase with hyphens or underscores
  • Example: imdb-reviews-50k

Dataset Formats

We support multiple data formats with automatic conversion:

Parquet (Recommended)

Columnar format optimized for analytics. Best for large datasets and fast queries.

CSV

Simple comma-separated values. Easy to create but slower for large datasets.

JSON / JSONL

Flexible format for nested data structures. Good for unstructured data.

💡 Tip: Upload CSV or JSON and we'll automatically convert to Parquet for optimal performance.

Dataset Schema

Define your dataset structure with a schema:

Example schema:

{
  "columns": [
    { "name": "id", "type": "int64" },
    { "name": "text", "type": "string" },
    { "name": "label", "type": "string" },
    { "name": "score", "type": "float64" },
    { "name": "metadata", "type": "struct", "fields": [
      { "name": "source", "type": "string" },
      { "name": "timestamp", "type": "timestamp" }
    ]}
  ],
  "splits": [
    { "name": "train", "num_rows": 8000 },
    { "name": "validation", "num_rows": 1000 },
    { "name": "test", "num_rows": 1000 }
  ]
}

Supported types: int8, int16, int32, int64, float32, float64, string, boolean, list, struct, timestamp, date, binary, image, audio

Dataset Viewer

Explore your dataset with the interactive viewer:

Viewer features:

  • Paginated row browsing
  • Column filtering and sorting
  • Value distribution statistics
  • Sample data preview
  • Schema inspection

Viewer API:

# Get dataset info (schema, splits, stats)
GET /api/datasets/:namespace/:name/viewer/info

# Get paginated rows
GET /api/datasets/:namespace/:name/viewer/rows?split=train&offset=0&limit=50

# Column filtering coming soon
GET /api/datasets/:namespace/:name/viewer/rows?columns=id,text,label

Uploading Data

Upload your dataset files via the web or API:

Web upload:

  1. Navigate to your dataset repository
  2. Click "Upload files"
  3. Drag and drop or select files
  4. Files are automatically processed and converted

API upload (multipart):

POST /api/datasets/:namespace/:name/upload
Authorization: Bearer <your-token>
Content-Type: multipart/form-data

file: <binary data>
split: train
format: csv

⚠️ Size Limits

  • Free tier: 10 GB per dataset
  • Pro tier: 100 GB per dataset
  • Team/Enterprise: Custom limits

Downloading Datasets

Download datasets in your preferred format:

# Download specific split
POST /api/datasets/:namespace/:name/download
{
  "split": "train",
  "format": "parquet"  # or "csv", "json", "jsonl"
}

# Response includes signed download URL
{
  "downloadUrl": "https://storage.../dataset.parquet?token=...",
  "expiresAt": "2026-01-15T12:00:00Z"
}

Searching Datasets

Find datasets using filters and search:

GET /api/datasets?search=sentiment&taskTypes=text-classification&limit=20

Query parameters:
- search: text search in names and descriptions
- taskTypes: filter by task type
- tags: comma-separated tags
- license: filter by license
- sort: downloads, likes, size, created, updated
- limit: results per page
- offset: pagination offset

Dataset Splits

Organize your data into standard splits:

Train

Used for training models (typically 70-80% of data)

Validation

Used for hyperparameter tuning (10-15% of data)

Test

Used for final evaluation (10-15% of data)

Best Practices

  • Document your data - Include data sources, collection methods, and preprocessing steps
  • Check for bias - Document known biases and limitations in your dataset
  • Use standard splits - Follow train/val/test conventions
  • Choose Parquet - For datasets over 1MB, use Parquet format
  • Specify licenses - Always include clear licensing information
  • Validate quality - Check for duplicates, nulls, and outliers before publishing