Kubernetes
Kubernetes installation guide for Voltage Park customers
Voltage Park currently manages Kubernetes deployments via Ansible, which automates the installation and configuration of cluster nodes. These deployments provide a scalable and reproducible infrastructure across both our on-demand and reserved clusters, enabling seamless containerized workloads across multiple bare metal machines.
These playbooks ensure each Kubernetes cluster node is correctly configured, optimized, and supporting GPU acceleration where applicable. The deployment is used in conjunction with our dashboard, allowing streamlined monitoring, scaling, and lifecycle management of Kubernetes environments.
Running the Ansible Playbooks
Download each of the files below, retrieve the external IPs from your Voltage Park On Demand dashboard, and update the inventory file (hosts.ini
) .
Inventory File
hosts.ini
[control_plane]
147.0.0.0
# K8s workers
[workers]
147.0.0.1
147.0.0.2
Ansible config file
ansible.cfg
[defaults]
interpreter_python = auto_silent
inventory = hosts.ini
host_key_checking = False
remote_user = ubuntu
[privilege_escalation]
become = True
become_method = sudo
become_ask_pass = False
This is the top level script that can be used to run all the playbooks and setup your K8s deployment.
Variables
We only have one variable being passed to the playbooks and that's the ansible user. Please create the file as so:
group_vars/all.yaml
ansible_user: ubuntu
Main Script
Once all the files have been downloaded/copied to your working directory, you can run this script for a complete deploy:
main.yaml
#!/usr/bin/env ansible-playbook
---
- import_playbook: setup_kubernetes_nodes.yaml
- import_playbook: install_cuda.yaml
- import_playbook: setup_controlplane.yaml
- import_playbook: join_workers.yaml
- import_playbook: setup_operators.yaml
- import_playbook: deploy_prometheus.yaml
- import_playbook: prometheus_port_forwards.yaml
- import_playbook: deploy_kube_proxy.yaml
Initial Node setup
This first playbook automates the base setup of a VP Kubernetes (K8s) environment on all nodes, including package installations, networking configurations, essential tools, and some nice-to-haves.
Installs Python 3, Pip, and required Kubernetes Python packages.
Disables swap and configures network settings for K8s.
Opens necessary Kubernetes ports (
80
,443
,6443
).Installs container runtime (Containerd) and configures it.
Sets up Docker and NVIDIA Container Toolkit for GPU support.
Installs Helm for Kubernetes package management.
Configures Kubernetes apt repositories and installs:
kubectl
kubelet
kubeadm
setup_kubernetes_nodes.yaml
---
- name: Initial K8s System Install
hosts: workers, control_plane
any_errors_fatal: true
become: yes
gather_facts: false
vars:
python_packages:
- openshift
- pyyaml
- kubernetes
tasks:
- name: Ensure Python 3 and pip are installed
package:
name:
- python3
- python3-pip
state: present
- name: Install required Python packages
pip:
name: "{{ python_packages }}"
state: present
executable: pip3
- name: Disable swap
command: swapoff -a
register: output
- name: Load br_netfilter
modprobe:
name: br_netfilter
state: present
- name: Bridge network settings
lineinfile:
path: /etc/sysctl.conf
line: "{{ item }}"
state: present
with_items:
- "net.bridge.bridge-nf-call-ip6tables = 1"
- "net.bridge.bridge-nf-call-iptables = 1"
- "net.ipv4.ip_forward = 1"
- name: Open K8s ports
ansible.builtin.iptables:
chain: INPUT
protocol: tcp
destination_ports:
- "80"
- "443"
- "6443"
- "10257"
- "10249"
jump: ACCEPT
- name: Reload sysctl settings
command: sysctl --system
- name: Install necessary packages
apt:
name:
- apt-transport-https
- ca-certificates
- curl
- tmux
- software-properties-common
- gpg
- wget
- gdebi-core
- containerd
- jq
- rdma-core
- ibverbs-utils
- perftest
- libnl-3-dev
- libnl-route-3-dev
- autoconf
- swig
- automake
- libltdl-dev
- quilt
- flex
- bison
- graphviz
- gfortran
- libgfortran5
- libfuse2
- tk
- rootlesskit
state: present
update_cache: yes
retries: 5
delay: 5
- name: Create containerd configuration directory
file:
path: /etc/containerd
state: directory
owner: root
group: root
mode: '0755'
- name: Deploy containerd configuration file
copy:
src: containerd_config.toml
dest: /etc/containerd/config.toml
owner: root
group: root
mode: '0644'
- name: Start containerd service
ansible.builtin.service:
name: containerd
state: restarted
- name: Add Docker repo GPG key
command: curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
- name: Update Docker key permissions
command: chmod a+r /etc/apt/keyrings/docker.asc
- name: Add Docker repo
lineinfile:
path: /etc/apt/sources.list.d/docker.list
line: "deb [arch=amd64 signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu jammy stable"
create: yes
- name: Ensure and old Nvidia container toolkit keys are absent
ansible.builtin.file:
path: /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
state: absent
- name: Add Nvidia container toolkit GPG key
shell: curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
ignore_errors: True
- name: Update Nvidia container toolkit GPG key permissions
command: chmod a+r /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
- name: Add Nvidia container toolkit sources.list
lineinfile:
path: /etc/apt/sources.list.d/nvidia-container-toolkit.list
line: "deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://nvidia.github.io/libnvidia-container/stable/deb/$(ARCH) /"
args:
create: yes
- name: Update apt package index
apt:
update_cache: yes
- name: Install the Nvidia Container Toolkit
apt:
name:
- nvidia-container-toolkit
state: present
allow_unauthenticated: yes
- name: Add Helm GPG key
shell: >
curl https://baltocdn.com/helm/signing.asc | gpg --dearmor -o /usr/share/keyrings/helm.gpg
args:
creates: /usr/share/keyrings/helm.gpg
- name: Add Helm GPG key permissions
command: chmod a+r /usr/share/keyrings/helm.gpg
- name: Add Helm apt repository
lineinfile:
path: /etc/apt/sources.list.d/helm-stable-debian.list
line: "deb [arch=amd64 signed-by=/usr/share/keyrings/helm.gpg] https://baltocdn.com/helm/stable/debian/ all main"
create: yes
- name: Update apt package index
apt:
update_cache: yes
- name: Install Helm
apt:
name:
- helm
state: present
allow_unauthenticated: yes
- name: Add K8s GPG key
shell: >
curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.32/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
args:
creates: /etc/apt/keyrings/kubernetes-apt-keyring.gpg
- name: Add Kubernetes apt repository
lineinfile:
path: /etc/apt/sources.list.d/kubernetes.list
line: "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.32/deb/ /"
create: yes
- name: Update apt package index again
apt:
update_cache: yes
- name: Install Kubernetes packages
apt:
name:
- kubelet
- kubeadm
- kubectl
state: present
allow_unauthenticated: yes
Cuda Installation
This playbook installs CUDA 12.4 on Ubuntu 22.04, ensuring a properly configured GPU computing environment that works with VPOD deployed Nvidia Drivers.
Download and Configure CUDA Repository
Fetches the CUDA repository pin file for package prioritization.
Downloads the CUDA repository
.deb
package for version 12.4.Installs the repository package to enable CUDA package access.
Manage CUDA Keyring
Locates the CUDA keyring file within the installed repository.
Copies the keyring file to
/usr/share/keyrings/
for package verification.
Install CUDA Packages
Updates the APT package list to recognize new repositories.
Installs the CUDA Toolkit 12.4 and CUDA Sanitizer 12.4.
Create a Symbolic Link for CUDA
Ensures
/usr/local/cuda
points to/usr/local/cuda-12.4
for easier compatibility with applications.
install_cuda.yaml
---
- name: Install CUDA 12.4 on Ubuntu 22.04
hosts: workers
become: yes
tasks:
- name: Download CUDA repository pin file
ansible.builtin.get_url:
url: "https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin"
dest: "/etc/apt/preferences.d/cuda-repository-pin-600"
mode: '0644'
retries: 5
register: result
delay: 5
until: result is succeeded
- name: Download CUDA repository deb package
ansible.builtin.get_url:
url: "https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda-repo-ubuntu2204-12-4-local_12.4.0-550.54.14-1_amd64.deb"
dest: "/tmp/cuda-repo-ubuntu2204-12-4-local.deb"
mode: '0644'
retries: 5
register: result
delay: 5
until: result is succeeded
- name: Install CUDA repository deb package
ansible.builtin.apt:
deb: "/tmp/cuda-repo-ubuntu2204-12-4-local.deb"
- name: Find CUDA keyring file
ansible.builtin.shell: "ls /var/cuda-repo-ubuntu2204-12-4-local/cuda-*-keyring.gpg | head -n 1"
register: cuda_keyring
changed_when: false
failed_when: cuda_keyring.stdout == ""
- name: Debug CUDA keyring file path
ansible.builtin.debug:
msg: "CUDA keyring file found: {{ cuda_keyring.stdout }}"
- name: COPY CUDA GPG key file if present
ansible.builtin.copy:
dest: /usr/share/keyrings/
src: "{{ cuda_keyring.stdout }}"
remote_src: yes
- name: Update apt package list
ansible.builtin.apt:
update_cache: yes
- name: Install CUDA Toolkit 12.4
ansible.builtin.apt:
name: "cuda-toolkit-12-4"
state: present
- name: Install CUDA Sanitizer 12.4
ansible.builtin.apt:
name: "cuda-sanitizer-12-4"
state: present
- name: Create symbolic link for CUDA
ansible.builtin.file:
src: "/usr/local/cuda-12.4"
dest: "/usr/local/cuda"
state: link
force: yes
Setup Control Plane
This Ansible playbook automates the setup of a Kubernetes control plane, including networking with Calico and preparation for NVIDIA GPU support via Helm.
Initialize the Kubernetes Control Plane
Runs
kubeadm init
with a Pod network CIDR of 192.168.0.0/16.Ensures the cluster is only initialized if
/etc/kubernetes/admin.conf
does not already exist.
Configure Kubernetes Admin Access
Creates a
.kube
directory for the admin user.Copies the
admin.conf
file to allowkubectl
access.Sets appropriate file ownership and permissions.
Install and Configure Calico (CNI Plugin)
Adds the Project Calico Helm repository.
Creates the
tigera-operator
namespace for the Calico installation.Deploys Calico v3.29.1 using Helm.
Prepare NVIDIA GPU Support
Adds the NVIDIA Helm repository (
https://helm.ngc.nvidia.com/nvidia
).Updates all Helm repositories to ensure the latest versions are available.
setup_controlplane.yaml
---
- name: Setup K8s Control Plane
hosts: control_plane
become: yes
collections:
- community.kubernetes
tasks:
- name: Initialize Kubernetes cluster
command: kubeadm init --pod-network-cidr=192.168.0.0/16
args:
creates: /etc/kubernetes/admin.conf
register: kubeadm_init
environment:
KUBECONFIG: /etc/kubernetes/admin.conf
- name: Create .kube directory for admin user
file:
path: "/home/{{ ansible_user }}/.kube"
state: directory
owner: "{{ ansible_user }}"
group: "{{ ansible_user }}"
mode: '0755'
- name: Copy admin.conf to user's kube config
copy:
src: /etc/kubernetes/admin.conf
dest: "/home/{{ ansible_user }}/.kube/config"
owner: "{{ ansible_user }}"
group: "{{ ansible_user }}"
mode: '0644'
remote_src: yes
- name: Set ownership of .kube directory
file:
path: "/home/{{ ansible_user }}/.kube"
owner: "{{ ansible_user }}"
group: "{{ ansible_user }}"
recurse: yes
- name: Ensure Helm is installed
become_user: "{{ ansible_user }}"
command: helm version --short
register: helm_version
- name: Add Project Calico Helm repository
become_user: "{{ ansible_user }}"
kubernetes.core.helm_repository:
name: projectcalico
repo_url: "https://docs.tigera.io/calico/charts"
- name: Create tigera-operator namespace
become_user: "{{ ansible_user }}"
kubernetes.core.k8s:
name: tigera-operator
api_version: v1
kind: Namespace
state: present
kubeconfig: "/home/{{ ansible_user }}/.kube/config"
- name: Ensure Calico is installed via Helm
become_user: "{{ ansible_user }}"
kubernetes.core.helm:
name: calico
chart_ref: projectcalico/tigera-operator
release_namespace: tigera-operator
chart_version: v3.29.1
state: present # Ensures the release is installed or upgraded
kubeconfig: "/home/{{ ansible_user }}/.kube/config"
- name: Add Nvidia Helm repo
become_user: "{{ ansible_user }}"
command: helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
- name: Update Helm repo
become: yes
become_user: "{{ ansible_user }}"
command: helm repo update
Join workers
This Ansible playbook automates the process of joining worker nodes to an existing Kubernetes cluster by retrieving and executing the kubeadm join
command.
Retrieve
kubeadm join
Command from the Control PlaneRuns
kubeadm token create --print-join-command
on a control plane node to generate a join command.Stores the generated command as an Ansible fact for later use.
Join Worker Nodes to the Cluster
Executes the stored join command on all worker nodes.
Ensures the worker nodes only run the command if
/etc/kubernetes/kubelet.conf
does not exist (to prevent rejoining).
join_workers.yaml
---
- name: Get kubeadm join command from control-plane and join worker nodes
hosts: control_plane
become: yes
gather_facts: no
tasks:
- name: Generate kubeadm join command
command: kubeadm token create --print-join-command
register: join_command_raw
- name: Set join command as a fact
set_fact:
kubeadm_join_cmd: "{{ join_command_raw.stdout }}"
delegate_to: "{{ inventory_hostname }}"
delegate_facts: true
- name: Join worker nodes to Kubernetes cluster
hosts: workers
become: yes
gather_facts: no
tasks:
- name: Run kubeadm join command
shell: "{{ hostvars[groups['control_plane'][0]]['kubeadm_join_cmd'] }}"
args:
creates: /etc/kubernetes/kubelet.conf
Add Nvidia Operators
This Ansible playbook automates the installation of NVIDIA GPU and Network Operators in a Kubernetes cluster, specifically targeting the control plane nodes.
Key Actions:
Copy Configuration Files
Transfers Network Operator Helm values (
network-operator-values.yaml
) to the control plane.Copies a CUDA test deployment manifest (
cuda-deployment.yaml
) for GPU workload testing.
Check Helm Repository List
Runs
helm repo list
to verify Helm repository configurations.Logs the repository list for debugging purposes.
Install NVIDIA GPU Operator
Deploys the NVIDIA GPU Operator using Helm with version v24.6.2.
Creates the
nvidia-gpu-operator
namespace if it does not exist.Disables the installation of GPU drivers and toolkit (
driver.enabled=false
,toolkit.enabled=false
).Waits for the installation to complete before proceeding.
setup_operators.yaml
---
- name: Install GPU and Network operators
hosts: control_plane
gather_facts: no
become_user: ubuntu
tasks:
- name: Copy over Network Operator values.yaml
copy:
src: network-operator-values.yaml
dest: /tmp/network-operator-values.yaml
mode: '0755'
- name: Copy over test deployment
copy:
src: cuda-deployment.yaml
dest: /tmp/cuda-deployment.yaml
mode: '0755'
- name: Check Helm repo list
command: helm repo list
register: helm_repo_list
ignore_errors: true
- debug:
var: helm_repo_list.stdout_lines
- name: Install NVIDIA Network Operator
shell: >
helm install network-operator nvidia/network-operator \
-n nvidia-network-operator \
--create-namespace \
--version v24.1.0 \
-f /tmp/network-operator-values.yaml \
--wait
- name: Install NVIDIA GPU Operator
shell: >
helm install gpu-operator nvidia/gpu-operator \
--version v24.6.2 \
-n nvidia-gpu-operator \
--create-namespace \
--set driver.enabled=false \
--set toolkit.enabled=false \
--wait
Deploy Prometheus and Grafana
This Ansible playbook automates the installation and configuration of GPU and Network monitoring components in a Kubernetes cluster, specifically targeting the control plane nodes. The key steps are:
Add Prometheus Helm Repository
Adds the Prometheus Community Helm Chart Repository to the system.
Install Prometheus
Creates the
monitoring
namespace if it does not exist.Deploys Prometheus and related monitoring components using the
kube-prometheus-stack
Helm chart.
Deploy DCGM Exporter ServiceMonitor
Defines a ServiceMonitor resource for NVIDIA's DCGM Exporter, which collects GPU metrics.
Ensures the Prometheus Operator can scrape GPU metrics from the exporter every 30 seconds.
deploy_prometheus.yaml
---
- name: Install Prometheus
hosts: control_plane
gather_facts: no
become_user: ubuntu
tasks:
- name: Add Prometheus Helm repository
become_user: "{{ ansible_user }}"
kubernetes.core.helm_repository:
name: prometheus-community
repo_url: https://prometheus-community.github.io/helm-charts
- name: Install Prometheus
shell: >
kubectl create namespace monitoring || true &&
helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring
- name: Deploy DCGM Exporter ServiceMonitor
kubernetes.core.k8s:
state: present
namespace: monitoring
definition:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter
namespace: monitoring
labels:
release: prometheus
spec:
selector:
matchLabels:
app.kubernetes.io/name: dcgm-exporter
endpoints:
- port: metrics
path: /metrics
interval: 30s
Add Port Forwards
This Ansible playbook automates the configuration of systemd services to establish persistent port forwarding for Prometheus and Grafana in a Kubernetes cluster. Here’s a breakdown of its functionality:
prometheus_port_forwards.yaml
- name: Setup Kubernetes Port Forwarding Services
hosts: control_plane
become: yes
tasks:
- name: Ensure kubectl is installed
command: which kubectl
register: kubectl_installed
failed_when: kubectl_installed.rc != 0
- name: Create Prometheus Port Forwarding service
copy:
dest: /etc/systemd/system/prometheus-port-forward.service
content: |
[Unit]
Description=Kubernetes Port Forward for Prometheus
After=network.target
[Service]
ExecStart=/usr/bin/kubectl port-forward svc/prometheus-kube-prometheus-prometheus -n monitoring 9090
Restart=always
User=ubuntu
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=prometheus-port-forward
[Install]
WantedBy=multi-user.target
- name: Create Grafana Port Forwarding service
copy:
dest: /etc/systemd/system/grafana-port-forward.service
content: |
[Unit]
Description=Kubernetes Port Forward for Grafana
After=network.target
[Service]
ExecStart=/usr/bin/kubectl port-forward svc/prometheus-grafana -n monitoring 3000:80
Restart=always
User=ubuntu
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=grafana-port-forward
[Install]
WantedBy=multi-user.target
- name: Reload systemd daemon
command: systemctl daemon-reload
- name: Enable and start Prometheus Port Forwarding service
systemd:
name: prometheus-port-forward
enabled: yes
state: started
- name: Enable and start Grafana Port Forwarding service
systemd:
name: grafana-port-forward
enabled: yes
state: started
Setup Kube Proxy
Add the config file:
kube-proxy-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: kube-proxy
namespace: kube-system
data:
config.conf: |
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
bindAddress: 0.0.0.0
metricsBindAddress: "0.0.0.0:10249"
And then the playbook:
deploy_kube_proxy.yaml
---
- name: Deploy kube-proxy
hosts: control_plane
gather_facts: no
become_user: ubuntu
tasks:
- name: Copy kube-proxy config to the target machine
copy:
src: kube-proxy-config.yaml
dest: /tmp/kube-proxy-config.yaml
mode: '0644'
- name: Apply the new kube-proxy config
command: kubectl apply -f /tmp/kube-proxy-config.yaml
register: kube_proxy_result
changed_when: "'configured' in kube_proxy_result.stdout"
- name: Restart kube-proxy DaemonSet
command: kubectl rollout restart daemonset/kube-proxy -n kube-system
when: kube_proxy_result.changed
Deploy your cluster
Once the files above have been copied, you should be able to simply run the main.yaml
script to deploy:
./main.yaml
And then once you've run your playbooks, you can login to the control plane and run this command to create an SSH tunnel and get the Grafana admin password, which is auto-generated:
tunnel.sh
#!/bin/bash
INV_FILE="hosts.ini"
HOST=$(awk '/\[control_plane\]/ {getline; print $1}' "$INV_FILE")
ssh -fN -L 9090:localhost:9090 -L 3000:localhost:3000 ubuntu@${HOST}
pass_cmd="kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 --decode"
pass=$(ssh ubuntu@${HOST} "$pass_cmd")
echo "Grafana username: admin"
echo "Grafana password: ${pass}"
You can then navigate to a browser with localhost:9090
to access the Prometheus API, and then localhost:3000
for the Grafana server.
Last updated