By Jagane Sundar and Nikhil Koduri

At OXMIQ Labs, we develop AI technologies all the way from GPU hardware to Model Orchestration software.

Running 70B, 120B, or even larger language models locally requires serious hardware. Cloud inference works for prototyping, but for production workloads, sensitive data, or simply avoiding per-token costs, you need local GPU power.

In this post, we'll show you how to build a single workstation with four NVIDIA RTX 6000 Blackwell GPUs: 384 GB of total VRAM and over 500 TFLOPs of FP32 compute. The catch? NVIDIA's software stack artificially limits you to two PCIe GPUs for collective operations. We'll show you how to work around that.

The RTX 6000 Blackwell advantage

The RTX 6000 Blackwell Workstation Edition is NVIDIA's flagship professional GPU, and four of them in a single system deliver remarkable capabilities:

FP32 Compute~500 TFLOPs
Total VRAM384 GB (96 GB per card)
Memory Bandwidth~960 GB/s per card
FP4 AI Performance1000+ TOPS per card
InterconnectPCIe 5.0 (128 GB/s per card)

This is enough memory to run a 120B parameter model in FP16, or significantly larger models with quantization. The Blackwell architecture's FP4 capabilities make it particularly well-suited for quantized inference workloads.

The hardware

Our Quad Blackwell Build:

Hardware challenges

Challenge #1: Power requirements

Each RTX 6000 Blackwell draws up to 350W under load. With four cards, that's 1400W for the GPUs alone before accounting for the AMD EPYC CPU (300W TDP), memory, storage, and cooling fans. Peak system draw can easily exceed 2000W.

We used a dual PSU configuration: two standard 1600W units connected with a Dual PSU Multiple Power Supply Adapter (ATX 24-pin to Molex 4-pin). This adapter synchronizes the start signal between PSUs so they power on together.

Challenge #2: PCIe instability

Having 4 RTX 6000 Pro Blackwells caused frequent hangs and boot issues. What finally fixed everything was forcing each slot to run on PCIe Gen4 instead of Gen5. The Blackwells have PCIe instability issues, and using server-class motherboards also comes with their own quirks.

Challenge #3: Physical fitting

We used the Phanteks Enthoo Pro 2 Full Tower with vertical mounting for one GPU and a PCIe riser cable for the final card, enabling all four cards in one box.

The software challenge: NCCL's two-GPU limit

NVIDIA's NCCL (NVIDIA Collective Communications Library) enforces a maximum of two PCIe NVIDIA GPUs for peer-to-peer communication in a single process. Try to initialize four GPUs in one container and you'll hit errors or severely degraded performance.

The workaround: containerized GPU isolation

The solution is to split your GPUs across multiple Docker containers, with each container seeing at most two GPUs. We then use Ray to orchestrate distributed inference across containers, with NCCL communicating over TCP sockets instead of PCIe peer-to-peer.

Step 1 — Configure NCCL

Create a file called nccl.conf:

NCCL_NET=Socket
NCCL_SOCKET_FAMILY=AF_INET
NCCL_IB_DISABLE=1

Step 2 — Launch the head container (GPUs 0 & 1)

docker run -it --rm \
  --gpus '"device=0,1"' \
  -p 8265:8265 -p 8000:8000 \
  --ipc=host --ulimit memlock=-1 \
  --shm-size=10.24gb \
  -v ./nccl.conf:/etc/nccl.conf \
  -v /opt/models:/opt/models \
  --entrypoint /bin/bash \
  nvcr.io/nvidia/vllm:25.11-py3

Step 3 — Launch the worker container (GPUs 2 & 3)

docker run -it --rm \
  --gpus '"device=2,3"' \
  --ipc=host --ulimit memlock=-1 \
  --shm-size=10.24gb \
  -v ./nccl.conf:/etc/nccl.conf \
  -v /opt/models:/opt/models \
  --entrypoint /bin/bash \
  nvcr.io/nvidia/vllm:25.11-py3

Step 4 — Start vLLM

vllm serve /opt/models/your-model \
  --pipeline-parallel 2 \
  --tensor-parallel 2 \
  --trust-remote-code \
  --distributed-executor-backend ray

Step 5 — Test the endpoint

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model",
    "prompt": "The future of AI is",
    "max_tokens": 100
  }'

Capsule: our in-house GPU cloud

At OXMIQ Labs, we use OxCapsule to manage and share our GPU-enabled hardware across the team. Capsule functions as an in-house GPU cloud — we register all of our GPU systems and tag them with descriptive labels. When a developer needs access to specific hardware, they simply request it with a single command:

capsule launch Nvidia-RTX6000x4

This gives us the flexibility of cloud computing — on-demand access to powerful hardware — without the per-hour costs or data privacy concerns.

Try OxCapsule