Spaces
Host interactive demos and applications powered by machine learning models.
What are Spaces?
Spaces are containerized applications that run on our infrastructure, perfect for:
- Interactive demos - Showcase your models in action
- Web applications - Build full-featured ML-powered apps
- API services - Deploy custom APIs and microservices
- Data visualizations - Create interactive dashboards
- Prototypes - Rapid development and testing
Supported Runtimes
Gradio
Build ML interfaces with Python in minutes. Perfect for quick demos and prototypes.
import gradio as gr ...Streamlit
Create data apps with Python. Great for dashboards and data exploration.
import streamlit as st ...FastAPI
Build REST APIs with Python. Ideal for custom backend services.
from fastapi import FastAPI ...Static Sites
Host HTML/CSS/JS sites. Perfect for documentation and portfolios.
Docker (Custom)
Use your own Dockerfile for complete control over the runtime environment.
Creating a Space
Create a Space through the web or API:
Via API:
POST /api/spaces
Authorization: Bearer <your-token>
Content-Type: application/json
{
"name": "sentiment-demo",
"namespace": "username",
"visibility": "public",
"runtime": "gradio",
"hardwareType": "cpu-basic",
"description": "Interactive sentiment analysis demo"
}Quick start with Gradio:
# app.py
import gradio as gr
def predict(text):
# Your ML model inference here
return {"positive": 0.8, "negative": 0.2}
demo = gr.Interface(
fn=predict,
inputs=gr.Textbox(label="Enter text"),
outputs=gr.Label(label="Sentiment"),
title="Sentiment Analysis"
)
demo.launch()Hardware Options
Choose the right hardware for your application:
CPU Basic
2 vCPU, 4GB RAM - Free tier
CPU Upgrade
8 vCPU, 16GB RAM
GPU T4 Small
1x T4 GPU, 4 vCPU, 16GB RAM, 16GB VRAM
GPU A10 Medium
1x A10 GPU, 8 vCPU, 32GB RAM, 24GB VRAM
GPU A10G
1x A10G GPU, 8 vCPU, 32GB RAM, 24GB VRAM
GPU L4
1x L4 GPU, 8 vCPU, 32GB RAM, 24GB VRAM
GPU L40S
1x L40S GPU, 12 vCPU, 48GB RAM, 48GB VRAM
GPU H100 Large
GPU H100 Large
1x H100 GPU, 16 vCPU, 80GB RAM, 80GB VRAM
⚡ ZeroGPU H200 LargePRO+
1x H200 GPU, 16 vCPU, 70GB VRAM - Dynamic allocation
⚡ ZeroGPU H200 XLargeTEAM+
1x H200 GPU, 16 vCPU, 141GB VRAM - Dynamic allocation
🔥 Multi-GPU 2x A10TEAM+
2x A10 GPUs, 16 vCPU, 64GB RAM, 48GB VRAM total
🔥 Multi-GPU 4x H100ENTERPRISE
4x H100 GPUs, 64 vCPU, 320GB RAM, 320GB VRAM total
💡 About ZeroGPU: Dynamic GPU allocation allows you to run models with up to 70GB or 141GB VRAM without paying per-hour costs. Your Space joins a queue and receives GPU time based on your plan's priority. Free usage within daily quota limits (PRO: 45 mins/day, TEAM: 120 mins/day).
💪 About Multi-GPU: Run larger models that require tensor parallelism or distributed training across multiple GPUs. Supports advanced features like model parallelism and pipeline parallelism.
GPU Quotas by Plan:
⚠️ Auto-Sleep: Free tier Spaces sleep after 48 hours of inactivity. GPU Spaces on Pro/Team plans sleep after 24 hours of idle time. Upgrade to keep your Space always running.
Environment Variables & Secrets
Configure your Space with environment variables:
Setting variables via API:
PATCH /api/spaces/:namespace/:name
{
"envVars": {
"MODEL_NAME": "bert-base-uncased",
"MAX_TOKENS": "128",
"CACHE_DIR": "/tmp/cache"
}
}For sensitive data (API keys, tokens):
- • Use encrypted secrets (not environment variables)
- • Secrets are stored securely and injected at runtime
- • Never commit secrets to your repository
Deploying Your Space
Deploy from Git or upload files directly:
Option 1: Link to Git repository
Connect your GitHub or GitLab repo. We'll auto-deploy on every push.
Option 2: Upload files
Drag and drop your app files. We'll build and deploy automatically.
Option 3: Use platform repo
Create a Space repo on the platform and push files via Git.
Custom Domains
Connect your own domain name to your Space:
- Add a CNAME record pointing to your Space's default URL
- Add your custom domain in the Space settings
- We'll automatically provision a TLS certificate (via Let's Encrypt)
- Your Space will be accessible at your custom domain
Requirements:
- Available on Pro, Team, and Enterprise plans
- DNS propagation may take up to 24 hours
Monitoring & Logs
Track your Space's performance and debug issues:
View logs:
GET /api/spaces/:namespace/:name/logs?lines=100&follow=true
# Streaming logs in real-time
# Tail the last 100 lines
# Filter by log level (info, warning, error)Metrics available:
- CPU and memory usage
- GPU utilization (if applicable)
- Request count and latency
- Error rates
- Build and deployment history
Best Practices
- ✅Start with CPU - Test your app on CPU before upgrading to GPU
- ✅Choose the right GPU - Match VRAM to your model size (use ZeroGPU for large models)
- ✅Optimize for queue - PRO/TEAM plans get higher priority in ZeroGPU queue
- ✅Monitor usage - Track GPU minutes in your dashboard to avoid quota limits
- ✅Add loading indicators - Provide feedback during model inference and queue wait time
- ✅Handle errors gracefully - Show user-friendly error messages
- ✅Optimize model loading - Cache models to reduce cold start time
- ✅Use requirements.txt - Pin package versions for reproducibility
- ✅Enable safetensors - Use safetensors format for faster, secure model loading
- ✅Consider quantization - 4-bit or 8-bit quantization reduces VRAM and speeds inference
Inference Optimization
Advanced optimization features to maximize GPU performance:
Text Generation Inference (TGI)
Production-ready LLM serving with continuous batching, tensor parallelism, and token streaming. Up to 10x faster than standard inference for large language models.
Quantization
Reduce model size and speed up inference with 4-bit (bitsandbytes), 8-bit, FP8, GPTQ, AWQ, or GGUF formats. Run 70B models on 24GB GPUs with minimal quality loss.
Optimum-NVIDIA (TensorRT-LLM)
Accelerate LLM inference up to 28x using TensorRT-LLM. Optimizes for NVIDIA GPUs with FP8 support.
FlashAttention-2
Memory-efficient attention implementation that speeds up training and inference by 2-4x.
WebGPU
Run models directly in the browser using hardware acceleration. No backend server required.
PEFT (Parameter-Efficient Fine-Tuning)
Fine-tune large models on consumer GPUs using LoRA, QLoRA, and other efficient methods.
DeepSpeed
Gradient checkpointing and mixed precision (FP16/BF16) to optimize VRAM usage and training speed.
Safetensors
Secure, fast format for storing model weights. Loads 3-5x faster than pickle-based formats.
💡 Enable optimizations: Configure optimization settings in your model repository settings or when creating an inference endpoint. Most optimizations are automatically detected and enabled.
Hardware Compatibility
NVIDIA CUDA
Full support for CUDA-enabled GPUs including A100, H100, H200, L40S, A10G, L4, and T4.
AMD ROCm (Coming Soon)
Support for AMD Instinct MI250 and MI300 GPUs through ROCm, including native TGI and Transformer libraries.
Cloud Integration
Optimized containers for major cloud providers: