Quickstart

From zero to your first H200 training run in under 90 seconds. This guide walks you through installing the Upod CLI, provisioning a pod, and pushing a training job to all eight GPUs.

Already have a pod?

Jump straight to First training run to skip provisioning.

1Install the CLI

The upod CLI is a single static binary. It works on macOS, Linux, and the WSL2 layer of Windows.

shell

# Install with the official one-liner
curl -fsSL https://upod.vercel.app/install.sh | sh

# Verify the install
upod --version
→ upod 0.4.2 (build a91f0c)

Then authenticate with your workspace token from app.upod.com/keys:

shell

upod auth login --token upd_live_*****************

2Provision a pod

Pick an SKU and a region. Provisioning is single-tenant and racked in under 60 seconds — you get a private VPC, a static IPv4, and an SSH key on the way back.

shell

upod pods create \
  --sku    UP-H8          # 8× H200, 1.1 TB HBM3e
  --region us-west-2      # PDX-A, liquid cooled
  --name   llama-pretrain \
  --ssh-key ~/.ssh/id_ed25519.pub

You'll see the pod show up as provisioning for ~45s, then flip to ready:

output

NAME            SKU      REGION     STATUS    AGE   ENDPOINT
llama-pretrain  UP-H8    us-west-2  ready     47s   pod-9f3a.upod.run

3SSH in & check the GPUs

Use the convenience wrapper so you don't have to copy hostnames around. The first connection will provision a shell session and mount your team's shared scratch volume at /mnt/shared.

shell

upod ssh llama-pretrain
→ Connected to pod-9f3a.upod.run · 8× NVIDIA H200 141GB

nvidia-smi --query-gpu=name,memory.total,driver_version --format=csv

4Run your first training step

Every pod ships pre-imaged with CUDA 12.6, PyTorch 2.5, vLLM, NCCL 2.21, and SLURM. Here's a minimal distributed Hello-World across all 8 GPUs:

python

import torch, torch.distributed as dist
dist.init_process_group("nccl")

rank = dist.get_rank()
device = f"cuda:{rank}"
x = torch.ones(1, device=device) * rank
dist.all_reduce(x)

print(f"rank {rank} → reduced sum = {x.item()}")
dist.destroy_process_group()

Launch it across all 8 GPUs with torchrun:

shell

torchrun --nproc_per_node=8 hello.py

Heads up on NVLink topology

UP-H8 uses NVSwitch (any-to-any 900 GB/s); UP-A8 uses NVLink3 (cube-mesh). For models > 70B params, prefer UP-H8 to avoid cross-mesh hops during tensor-parallel collective ops.

Region availability

All twelve regions ship the same image and the same SLA. Provisioning latency is measured P50, end-to-end, from pods create to ready.

Region	Code	UP-H8	UP-A8	P50 provision
US West (Portland)	`us-west-2`	live	live	42s
US East (Ashburn)	`us-east-1`	live	live	48s
EU Central (Frankfurt)	`eu-central-1`	live	live	51s
EU North (Stockholm)	`eu-north-1`	live	live	54s
Asia Pacific (Tokyo)	`ap-northeast-1`	live	live	62s
Asia Pacific (Singapore)	`ap-southeast-1`	Q3	live	—

What's next?

You now have a working pod and a verified collective-ops call across 8 GPUs. From here, most teams head straight to the training-loop guide or wire up checkpoint persistence.

← Previous

Welcome to Upod

First training run