Quickstart
From zero to your first H200 training run in under 90 seconds. This guide walks you through installing the Upod CLI, provisioning a pod, and pushing a training job to all eight GPUs.
Jump straight to First training run to skip provisioning.
1Install the CLI
The upod CLI is a single static binary. It works on macOS, Linux,
and the WSL2 layer of Windows.
# Install with the official one-liner curl -fsSL https://upod.vercel.app/install.sh | sh # Verify the install upod --version → upod 0.4.2 (build a91f0c)
Then authenticate with your workspace token from app.upod.com/keys:
upod auth login --token upd_live_*****************
2Provision a pod
Pick an SKU and a region. Provisioning is single-tenant and racked in under 60 seconds — you get a private VPC, a static IPv4, and an SSH key on the way back.
upod pods create \ --sku UP-H8 # 8× H200, 1.1 TB HBM3e --region us-west-2 # PDX-A, liquid cooled --name llama-pretrain \ --ssh-key ~/.ssh/id_ed25519.pub
You'll see the pod show up as provisioning for ~45s, then
flip to ready:
NAME SKU REGION STATUS AGE ENDPOINT llama-pretrain UP-H8 us-west-2 ready 47s pod-9f3a.upod.run
3SSH in & check the GPUs
Use the convenience wrapper so you don't have to copy hostnames around.
The first connection will provision a shell session and mount your team's
shared scratch volume at /mnt/shared.
upod ssh llama-pretrain → Connected to pod-9f3a.upod.run · 8× NVIDIA H200 141GB nvidia-smi --query-gpu=name,memory.total,driver_version --format=csv
4Run your first training step
Every pod ships pre-imaged with CUDA 12.6, PyTorch 2.5, vLLM, NCCL 2.21, and SLURM. Here's a minimal distributed Hello-World across all 8 GPUs:
import torch, torch.distributed as dist dist.init_process_group("nccl") rank = dist.get_rank() device = f"cuda:{rank}" x = torch.ones(1, device=device) * rank dist.all_reduce(x) print(f"rank {rank} → reduced sum = {x.item()}") dist.destroy_process_group()
Launch it across all 8 GPUs with torchrun:
torchrun --nproc_per_node=8 hello.py
UP-H8 uses NVSwitch (any-to-any 900 GB/s); UP-A8 uses NVLink3 (cube-mesh). For models > 70B params, prefer UP-H8 to avoid cross-mesh hops during tensor-parallel collective ops.
Region availability
All twelve regions ship the same image and the same SLA. Provisioning
latency is measured P50, end-to-end, from pods create to
ready.
| Region | Code | UP-H8 | UP-A8 | P50 provision |
|---|---|---|---|---|
| US West (Portland) | us-west-2 | live | live | 42s |
| US East (Ashburn) | us-east-1 | live | live | 48s |
| EU Central (Frankfurt) | eu-central-1 | live | live | 51s |
| EU North (Stockholm) | eu-north-1 | live | live | 54s |
| Asia Pacific (Tokyo) | ap-northeast-1 | live | live | 62s |
| Asia Pacific (Singapore) | ap-southeast-1 | Q3 | live | — |
What's next?
You now have a working pod and a verified collective-ops call across 8 GPUs. From here, most teams head straight to the training-loop guide or wire up checkpoint persistence.