Spaces

Host interactive demos and applications powered by machine learning models.

What are Spaces?

Spaces are containerized applications that run on our infrastructure, perfect for:

  • Interactive demos - Showcase your models in action
  • Web applications - Build full-featured ML-powered apps
  • API services - Deploy custom APIs and microservices
  • Data visualizations - Create interactive dashboards
  • Prototypes - Rapid development and testing

Supported Runtimes

Gradio

Build ML interfaces with Python in minutes. Perfect for quick demos and prototypes.

import gradio as gr ...

Streamlit

Create data apps with Python. Great for dashboards and data exploration.

import streamlit as st ...

FastAPI

Build REST APIs with Python. Ideal for custom backend services.

from fastapi import FastAPI ...

Static Sites

Host HTML/CSS/JS sites. Perfect for documentation and portfolios.

Docker (Custom)

Use your own Dockerfile for complete control over the runtime environment.

Creating a Space

Create a Space through the web or API:

Via API:

POST /api/spaces
Authorization: Bearer <your-token>
Content-Type: application/json

{
  "name": "sentiment-demo",
  "namespace": "username",
  "visibility": "public",
  "runtime": "gradio",
  "hardwareType": "cpu-basic",
  "description": "Interactive sentiment analysis demo"
}

Quick start with Gradio:

# app.py
import gradio as gr

def predict(text):
    # Your ML model inference here
    return {"positive": 0.8, "negative": 0.2}

demo = gr.Interface(
    fn=predict,
    inputs=gr.Textbox(label="Enter text"),
    outputs=gr.Label(label="Sentiment"),
    title="Sentiment Analysis"
)

demo.launch()

Hardware Options

Choose the right hardware for your application:

CPU Basic

2 vCPU, 4GB RAM - Free tier

FREE

CPU Upgrade

8 vCPU, 16GB RAM

$0.50/hour

GPU T4 Small

1x T4 GPU, 4 vCPU, 16GB RAM, 16GB VRAM

$0.60/hour

GPU A10 Medium

1x A10 GPU, 8 vCPU, 32GB RAM, 24GB VRAM

$1.20/hour

GPU A10G

1x A10G GPU, 8 vCPU, 32GB RAM, 24GB VRAM

$1.50/hour

GPU L4

1x L4 GPU, 8 vCPU, 32GB RAM, 24GB VRAM

$0.90/hour

GPU L40S

1x L40S GPU, 12 vCPU, 48GB RAM, 48GB VRAM

$2.40/hour

GPU H100 Large

GPU H100 Large

1x H100 GPU, 16 vCPU, 80GB RAM, 80GB VRAM

$3.00/hour

⚡ ZeroGPU H200 LargePRO+

1x H200 GPU, 16 vCPU, 70GB VRAM - Dynamic allocation

FREE*

⚡ ZeroGPU H200 XLargeTEAM+

1x H200 GPU, 16 vCPU, 141GB VRAM - Dynamic allocation

FREE*

🔥 Multi-GPU 2x A10TEAM+

2x A10 GPUs, 16 vCPU, 64GB RAM, 48GB VRAM total

$2.40/hour

🔥 Multi-GPU 4x H100ENTERPRISE

4x H100 GPUs, 64 vCPU, 320GB RAM, 320GB VRAM total

$12.00/hour

💡 About ZeroGPU: Dynamic GPU allocation allows you to run models with up to 70GB or 141GB VRAM without paying per-hour costs. Your Space joins a queue and receives GPU time based on your plan's priority. Free usage within daily quota limits (PRO: 45 mins/day, TEAM: 120 mins/day).

💪 About Multi-GPU: Run larger models that require tensor parallelism or distributed training across multiple GPUs. Supports advanced features like model parallelism and pipeline parallelism.

GPU Quotas by Plan:

Free:0 GPU minutes/day
Pro ($9/mo):45 GPU minutes/day + ZeroGPU
Team ($20/user):120 GPU minutes/day + Multi-GPU
Enterprise:Custom quota + Dedicated pools

⚠️ Auto-Sleep: Free tier Spaces sleep after 48 hours of inactivity. GPU Spaces on Pro/Team plans sleep after 24 hours of idle time. Upgrade to keep your Space always running.

Environment Variables & Secrets

Configure your Space with environment variables:

Setting variables via API:

PATCH /api/spaces/:namespace/:name
{
  "envVars": {
    "MODEL_NAME": "bert-base-uncased",
    "MAX_TOKENS": "128",
    "CACHE_DIR": "/tmp/cache"
  }
}

For sensitive data (API keys, tokens):

  • • Use encrypted secrets (not environment variables)
  • • Secrets are stored securely and injected at runtime
  • • Never commit secrets to your repository

Deploying Your Space

Deploy from Git or upload files directly:

Option 1: Link to Git repository

Connect your GitHub or GitLab repo. We'll auto-deploy on every push.

Option 2: Upload files

Drag and drop your app files. We'll build and deploy automatically.

Option 3: Use platform repo

Create a Space repo on the platform and push files via Git.

Custom Domains

Connect your own domain name to your Space:

  1. Add a CNAME record pointing to your Space's default URL
  2. Add your custom domain in the Space settings
  3. We'll automatically provision a TLS certificate (via Let's Encrypt)
  4. Your Space will be accessible at your custom domain

Requirements:

  • Available on Pro, Team, and Enterprise plans
  • DNS propagation may take up to 24 hours

Monitoring & Logs

Track your Space's performance and debug issues:

View logs:

GET /api/spaces/:namespace/:name/logs?lines=100&follow=true

# Streaming logs in real-time
# Tail the last 100 lines
# Filter by log level (info, warning, error)

Metrics available:

  • CPU and memory usage
  • GPU utilization (if applicable)
  • Request count and latency
  • Error rates
  • Build and deployment history

Best Practices

  • Start with CPU - Test your app on CPU before upgrading to GPU
  • Choose the right GPU - Match VRAM to your model size (use ZeroGPU for large models)
  • Optimize for queue - PRO/TEAM plans get higher priority in ZeroGPU queue
  • Monitor usage - Track GPU minutes in your dashboard to avoid quota limits
  • Add loading indicators - Provide feedback during model inference and queue wait time
  • Handle errors gracefully - Show user-friendly error messages
  • Optimize model loading - Cache models to reduce cold start time
  • Use requirements.txt - Pin package versions for reproducibility
  • Enable safetensors - Use safetensors format for faster, secure model loading
  • Consider quantization - 4-bit or 8-bit quantization reduces VRAM and speeds inference

Inference Optimization

Advanced optimization features to maximize GPU performance:

Text Generation Inference (TGI)

Production-ready LLM serving with continuous batching, tensor parallelism, and token streaming. Up to 10x faster than standard inference for large language models.

Quantization

Reduce model size and speed up inference with 4-bit (bitsandbytes), 8-bit, FP8, GPTQ, AWQ, or GGUF formats. Run 70B models on 24GB GPUs with minimal quality loss.

Optimum-NVIDIA (TensorRT-LLM)

Accelerate LLM inference up to 28x using TensorRT-LLM. Optimizes for NVIDIA GPUs with FP8 support.

FlashAttention-2

Memory-efficient attention implementation that speeds up training and inference by 2-4x.

WebGPU

Run models directly in the browser using hardware acceleration. No backend server required.

PEFT (Parameter-Efficient Fine-Tuning)

Fine-tune large models on consumer GPUs using LoRA, QLoRA, and other efficient methods.

DeepSpeed

Gradient checkpointing and mixed precision (FP16/BF16) to optimize VRAM usage and training speed.

Safetensors

Secure, fast format for storing model weights. Loads 3-5x faster than pickle-based formats.

💡 Enable optimizations: Configure optimization settings in your model repository settings or when creating an inference endpoint. Most optimizations are automatically detected and enabled.

Hardware Compatibility

NVIDIA CUDA

Full support for CUDA-enabled GPUs including A100, H100, H200, L40S, A10G, L4, and T4.

A100H100H200L40SA10GL4T4

AMD ROCm (Coming Soon)

Support for AMD Instinct MI250 and MI300 GPUs through ROCm, including native TGI and Transformer libraries.

MI250MI300

Cloud Integration

Optimized containers for major cloud providers:

Google GKEVertex AIAWS SageMaker