Page cover

Example: Deploying GPT-OSS on Voltage Park

Voltage Park operates a massive fleet of high-performance NVIDIA GPUs, highly-optimized for hosting open-source models like GPT-OSS. In this tutorial, we'll deploy GPT-OSS on a Voltage Park GPU server

GPT-OSS: open-source, step-function improvement

GPT-OSS presents two, highly-efficient models: 20B and 120B parameters. Both models can use tools (e.g. web browsing) to enhance output accuracy.

The 20B parameter model, hosted on Voltage Park's cloud H100 servers with maximum inference optimizations and batch sizes, can cost as low as $0.10 per million input tokens and $0.20 per million output tokens at continuous saturation. We've also seen the 120B parameter model, hosted on Voltage Park's cloud, cost as little as $0.20 per million and $0.60 per million output tokens at continuous saturation.

In the following guide, we'll deploy GPT-OSS 120B using both Ollama and vLLM inference engines. If you run into any issues, please reach out - we're happy to help!

Speak to an engineer

Step 1: Deploy a GPU server

Deploying a server is easy! Register for an account on our dashboard, and configure autopay with a credit/debit card to ensure you'll never run out of credits.

Head to our deploy page, and either select virtual machines or bare metal. Deploy 1, 2, and 4x GPU instances for smaller models like GPT-OSS, or host massive inference/training fleets with our dedicated, one-click bare metal clusters.

For this example, we'll deploy GPT-OSS on a single NVIDIA H100 GPU, at just $1.99/GPU/hr.

Drag the number of desired GPUs. We're deploying 1 GPU to run GPT-OSS 120B

Name your server & attach your SSH key. For more details on getting started, explore our guide here:Getting Started

Then, deploy! 1x H100 instances should come online near-instantly.

Step 2: Accessing your GPU server

We'll SSH in using the provided commands. Voltage Park's H100 servers come with a preinstalled CUDA environment with all popular ML tools included.

Step 3: Deploying GPT-OSS

Ollama is the easiest way to spin up an instance of GPT-OSS, vLLM delivers stronger performance with robust multi-model architecture support, and SGLang provides a fast serving framework for LLMs and VLMs, excelling at low-latency, multi-turn conversations, structured outputs, and efficient KV-cache reuse.

Now, we download the model from Huggingface through the Ollama command-line interface:

With Voltage Park's high-speed 100G public internet bandwidth, this should take less than two minutes.

And just like that, we can run our model! Ollama presents an interactive input for us to play around with.

Note: we set Ollama's host to be 0.0.0.0:8000 to make a the service accessible on a port that is proxied by the Voltage Park firewall. For instance, if your virtual machine comes with the following ports enabled...

then Ollama will be accessible externally at 20155 on the external IP address of your virtual machine.

Output 🎉

When deploying Ollama or vLLM servers for production use, implementing proper network security through iptables is crucial to prevent unauthorized access. We can use iptables to block these unwanted connections.

Here are some example iptable rules that would help secure the vLLM / Ollama environments.

  • iptables -P INPUT DROP

    • This drops all incoming connections by default

  • iptables -A INPUT -p tcp --dport 22 -s YOUR_ADMIN_NETWORK/24 -j ACCEPT

    • Allow SSH access from your own IP address

  • iptables -A INPUT -p tcp --dport 8000 -s YOUR_ALLOWED_SERVERS/24 -j ACCEPT

    • Limit Ollama / vLLM to only be accessible from your application servers serving your application, to prevent abuse

  • iptables -A INPUT -p tcp --dport 8000 -m limit --limit 25/minute --limit-burst 100 -j ACCEPT

    • Rate limiting, also to prevent abuse

  • iptables -A INPUT -i lo -j ACCEPT

    • Enable loopback traffic for internal processes

Voltage Park network engineers can implement rules on the firewall level, too. Just ask!

Summary

In this tutorial, we deployed GPT-OSS, a leading open-source RL model on both Ollama and vLLM. Voltage Park's H100 cloud servers are available from $1.99/GPU/hr here. You can deploy your own instance with just $10.

Appendix

History of RL models

Large language models (LLMs) have fundamentally transformed how we approach artificial intelligence applications, moving from experimental research tools to production-ready systems that power everything from customer service chatbots to code generation platforms all within the past half decade.

From 2019-2024, foundational model companies generally focused on scaling pretraining, sharing ever-larger models trained on more and more data. Output distillation to train smaller, more-efficient models grew in popularity, too. However, there are only so many trillions of tokens to train on...

OpenAI's o1 model, launched in September 2024, introduced a new paradigm of compute: reinforcement learning. For the first time, models could "think" by generating tokens in a "scratchpad" before outputting a final result. Notably, by providing models tools to interact with, models can produce more informed, accurate outputs without hallucinations.

o1 costs $15.00 per million input tokens and $60.00 per million output tokens

In December 2024, OpenAI launched o3.

o3 costs $2.00 per million input tokens and $8.00 per million output tokens

GPT-OSS costs as low as $0.20 per million input tokens and $0.60 per million output tokens in your own environment depending on batch size and saturation.

Last updated