Kubernetes installation guide for Voltage Park customers
Voltage Park currently manages Kubernetes deployments via Ansible, which automates the installation and configuration of cluster nodes. These deployments provide a scalable and reproducible infrastructure across both our on-demand and reserved clusters, enabling seamless containerized workloads across multiple bare metal machines.
These playbooks ensure each Kubernetes cluster node is correctly configured, optimized, and supporting GPU acceleration where applicable. The deployment is used in conjunction with our dashboard, allowing streamlined monitoring, scaling, and lifecycle management of Kubernetes environments.
Interested in a fully-managed solution? Our Kubernetes experts are ready around-the-clock to deliver a world-class experience. Chat with a solutions engineer!
Running the Ansible Playbooks
Download each of the files below, retrieve the external IPs from your Voltage Park On Demand dashboard, and update the inventory file (hosts.ini) .
This first playbook automates the base setup of a VP Kubernetes (K8s) environment on all nodes, including package installations, networking configurations, essential tools, and some nice-to-haves.
Installs Python 3, Pip, and required Kubernetes Python packages.
Disables swap and configures network settings for K8s.
Opens necessary Kubernetes ports (80, 443, 6443).
Installs container runtime (Containerd) and configures it.
Sets up Docker and NVIDIA Container Toolkit for GPU support.
Installs Helm for Kubernetes package management.
Configures Kubernetes apt repositories and installs:
This playbook installs CUDA 12.4 on Ubuntu 22.04, ensuring a properly configured GPU computing environment that works with VPOD deployed Nvidia Drivers.
Download and Configure CUDA Repository
Fetches the CUDA repository pin file for package prioritization.
Downloads the CUDA repository .deb package for version 12.4.
Installs the repository package to enable CUDA package access.
Manage CUDA Keyring
Locates the CUDA keyring file within the installed repository.
Copies the keyring file to /usr/share/keyrings/ for package verification.
Install CUDA Packages
Updates the APT package list to recognize new repositories.
Installs the CUDA Toolkit 12.4 and CUDA Sanitizer 12.4.
Create a Symbolic Link for CUDA
Ensures /usr/local/cuda points to /usr/local/cuda-12.4 for easier compatibility with applications.
install_cuda.yaml
---
- name: Install CUDA 12.4 on Ubuntu 22.04
hosts: workers
become: yes
tasks:
- name: Download CUDA repository pin file
ansible.builtin.get_url:
url: "https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin"
dest: "/etc/apt/preferences.d/cuda-repository-pin-600"
mode: '0644'
retries: 5
register: result
delay: 5
until: result is succeeded
- name: Download CUDA repository deb package
ansible.builtin.get_url:
url: "https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda-repo-ubuntu2204-12-4-local_12.4.0-550.54.14-1_amd64.deb"
dest: "/tmp/cuda-repo-ubuntu2204-12-4-local.deb"
mode: '0644'
retries: 5
register: result
delay: 5
until: result is succeeded
- name: Install CUDA repository deb package
ansible.builtin.apt:
deb: "/tmp/cuda-repo-ubuntu2204-12-4-local.deb"
- name: Find CUDA keyring file
ansible.builtin.shell: "ls /var/cuda-repo-ubuntu2204-12-4-local/cuda-*-keyring.gpg | head -n 1"
register: cuda_keyring
changed_when: false
failed_when: cuda_keyring.stdout == ""
- name: Debug CUDA keyring file path
ansible.builtin.debug:
msg: "CUDA keyring file found: {{ cuda_keyring.stdout }}"
- name: COPY CUDA GPG key file if present
ansible.builtin.copy:
dest: /usr/share/keyrings/
src: "{{ cuda_keyring.stdout }}"
remote_src: yes
- name: Update apt package list
ansible.builtin.apt:
update_cache: yes
- name: Install CUDA Toolkit 12.4
ansible.builtin.apt:
name: "cuda-toolkit-12-4"
state: present
- name: Install CUDA Sanitizer 12.4
ansible.builtin.apt:
name: "cuda-sanitizer-12-4"
state: present
- name: Create symbolic link for CUDA
ansible.builtin.file:
src: "/usr/local/cuda-12.4"
dest: "/usr/local/cuda"
state: link
force: yes
Setup Control Plane
This Ansible playbook automates the setup of a Kubernetes control plane, including networking with Calico and preparation for NVIDIA GPU support via Helm.
Initialize the Kubernetes Control Plane
Runs kubeadm init with a Pod network CIDR of 192.168.0.0/16.
Ensures the cluster is only initialized if /etc/kubernetes/admin.conf does not already exist.
Configure Kubernetes Admin Access
Creates a .kube directory for the admin user.
Copies the admin.conf file to allow kubectl access.
Sets appropriate file ownership and permissions.
Install and Configure Calico (CNI Plugin)
Adds the Project Calico Helm repository.
Creates the tigera-operator namespace for the Calico installation.
Deploys Calico v3.29.1 using Helm.
Prepare NVIDIA GPU Support
Adds the NVIDIA Helm repository (https://helm.ngc.nvidia.com/nvidia).
Updates all Helm repositories to ensure the latest versions are available.
This Ansible playbook automates the process of joining worker nodes to an existing Kubernetes cluster by retrieving and executing the kubeadm join command.
Retrieve kubeadm join Command from the Control Plane
Runs kubeadm token create --print-join-command on a control plane node to generate a join command.
Stores the generated command as an Ansible fact for later use.
Join Worker Nodes to the Cluster
Executes the stored join command on all worker nodes.
Ensures the worker nodes only run the command if /etc/kubernetes/kubelet.conf does not exist (to prevent rejoining).
join_workers.yaml
---
- name: Get kubeadm join command from control-plane and join worker nodes
hosts: control_plane
become: yes
gather_facts: no
tasks:
- name: Generate kubeadm join command
command: kubeadm token create --print-join-command
register: join_command_raw
- name: Set join command as a fact
set_fact:
kubeadm_join_cmd: "{{ join_command_raw.stdout }}"
delegate_to: "{{ inventory_hostname }}"
delegate_facts: true
- name: Join worker nodes to Kubernetes cluster
hosts: workers
become: yes
gather_facts: no
tasks:
- name: Run kubeadm join command
shell: "{{ hostvars[groups['control_plane'][0]]['kubeadm_join_cmd'] }}"
args:
creates: /etc/kubernetes/kubelet.conf
Add Nvidia Operators
This Ansible playbook automates the installation of NVIDIA GPU and Network Operators in a Kubernetes cluster, specifically targeting the control plane nodes.
Key Actions:
Copy Configuration Files
Transfers Network Operator Helm values (network-operator-values.yaml) to the control plane.
Copies a CUDA test deployment manifest (cuda-deployment.yaml) for GPU workload testing.
Check Helm Repository List
Runs helm repo list to verify Helm repository configurations.
Logs the repository list for debugging purposes.
Install NVIDIA GPU Operator
Deploys the NVIDIA GPU Operator using Helm with version v24.6.2.
Creates the nvidia-gpu-operator namespace if it does not exist.
Disables the installation of GPU drivers and toolkit (driver.enabled=false, toolkit.enabled=false).
Waits for the installation to complete before proceeding.
This Ansible playbook automates the installation and configuration of GPU and Network monitoring components in a Kubernetes cluster, specifically targeting the control plane nodes. The key steps are:
Add Prometheus Helm Repository
Adds the Prometheus Community Helm Chart Repository to the system.
Install Prometheus
Creates the monitoring namespace if it does not exist.
Deploys Prometheus and related monitoring components using the kube-prometheus-stack Helm chart.
Deploy DCGM Exporter ServiceMonitor
Defines a ServiceMonitor resource for NVIDIA's DCGM Exporter, which collects GPU metrics.
Ensures the Prometheus Operator can scrape GPU metrics from the exporter every 30 seconds.
This Ansible playbook automates the configuration of systemd services to establish persistent port forwarding for Prometheus and Grafana in a Kubernetes cluster. Here’s a breakdown of its functionality:
prometheus_port_forwards.yaml
- name: Setup Kubernetes Port Forwarding Services
hosts: control_plane
become: yes
tasks:
- name: Ensure kubectl is installed
command: which kubectl
register: kubectl_installed
failed_when: kubectl_installed.rc != 0
- name: Create Prometheus Port Forwarding service
copy:
dest: /etc/systemd/system/prometheus-port-forward.service
content: |
[Unit]
Description=Kubernetes Port Forward for Prometheus
After=network.target
[Service]
ExecStart=/usr/bin/kubectl port-forward svc/prometheus-kube-prometheus-prometheus -n monitoring 9090
Restart=always
User=ubuntu
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=prometheus-port-forward
[Install]
WantedBy=multi-user.target
- name: Create Grafana Port Forwarding service
copy:
dest: /etc/systemd/system/grafana-port-forward.service
content: |
[Unit]
Description=Kubernetes Port Forward for Grafana
After=network.target
[Service]
ExecStart=/usr/bin/kubectl port-forward svc/prometheus-grafana -n monitoring 3000:80
Restart=always
User=ubuntu
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=grafana-port-forward
[Install]
WantedBy=multi-user.target
- name: Reload systemd daemon
command: systemctl daemon-reload
- name: Enable and start Prometheus Port Forwarding service
systemd:
name: prometheus-port-forward
enabled: yes
state: started
- name: Enable and start Grafana Port Forwarding service
systemd:
name: grafana-port-forward
enabled: yes
state: started
---
- name: Deploy kube-proxy
hosts: control_plane
gather_facts: no
become_user: ubuntu
tasks:
- name: Copy kube-proxy config to the target machine
copy:
src: kube-proxy-config.yaml
dest: /tmp/kube-proxy-config.yaml
mode: '0644'
- name: Apply the new kube-proxy config
command: kubectl apply -f /tmp/kube-proxy-config.yaml
register: kube_proxy_result
changed_when: "'configured' in kube_proxy_result.stdout"
- name: Restart kube-proxy DaemonSet
command: kubectl rollout restart daemonset/kube-proxy -n kube-system
when: kube_proxy_result.changed
Deploy your cluster
Once the files above have been copied, you should be able to simply run the main.yamlscript to deploy:
./main.yaml
And then once you've run your playbooks, you can login to the control plane and run this command to create an SSH tunnel and get the Grafana admin password, which is auto-generated: