Question 1

Where can I find community setup guides for NemoClaw + vLLM?

Accepted Answer

Several community guides cover practical local inference configurations. DEV Community has "Punching Through NemoClaw's Sandbox to Hit Local vLLM on RTX 5090" which walks through the RTX-specific configuration. Codersera published "NemoClaw + OpenClaw: Secure Sandbox Guide for Local vLLM Agents" covering the security boundary between the sandbox and the inference server. PacketMoat has "OpenClaw + NemoClaw + Nemotron: Local Setup Guide" for a full-stack walkthrough. For official model documentation, the vLLM Recipes repository includes the NVIDIA Nemotron-3-Nano-30B user guide with validated configurations.

Question 2

Can I run vLLM with NemoClaw on an RTX 4090 instead of the 5090?

Accepted Answer

Yes, but without NVFP4 support. The RTX 4090 uses the Ada Lovelace architecture, which supports FP8 but not NVFP4. You can run Nemotron 3 Super 12B in FP8 on the 4090's 24GB VRAM, but throughput will be approximately 4x lower than NVFP4 on an RTX 5090. The 120B model will not fit on a single 4090. For the hardware comparison, see our NemoClaw Hardware Guide.

Question 3

Does the gateway VRAM management work with Docker deployments?

Accepted Answer

Yes, but the gateway must have access to the Docker socket to start and stop the vLLM container. Set vllm_command to a docker start / docker stop command pair instead of the direct vLLM binary. The VRAM release behavior depends on the NVIDIA Container Toolkit correctly unmapping GPU memory when the container stops — verify with nvidia-smi after a gateway-triggered shutdown.

Question 4

What is the difference between vllm-local and using the NVIDIA API catalog?

Accepted Answer

The vllm-local provider sends inference requests to a vLLM server running on your hardware — no data leaves the machine. The NVIDIA API catalog (build.nvidia.com) is a cloud endpoint where NVIDIA hosts the model. Using the API catalog means your prompts traverse the network to NVIDIA's infrastructure. For regulated workloads requiring data sovereignty, vllm-local is the only compliant option. For development and non-sensitive workloads, the API catalog is faster to set up because there is no hardware requirement.

Question 5

How do I monitor vLLM performance under NemoClaw agent workloads?

Accepted Answer

vLLM exposes Prometheus metrics on /metrics by default. Key metrics to monitor: vllm:num_requests_running (concurrent active sessions), vllm:gpu_cache_usage_perc (KV-cache utilization — alert at 90%), vllm:avg_generation_throughput_toks_per_s (tokens per second — baseline and track degradation), and vllm:num_requests_swapped (sessions swapped to CPU — high counts indicate VRAM pressure). Pipe these into Grafana or your existing monitoring stack.

Tag: Local Inference

NemoClaw + vLLM: Local Inference Performance Optimization on RTX and DGX