Kubernetes

Kubernetes installation guide for Voltage Park customers

Voltage Park currently manages Kubernetes deployments via Ansible, which automates the installation and configuration of cluster nodes. These deployments provide a scalable and reproducible infrastructure across both our on-demand and reserved clusters, enabling seamless containerized workloads across multiple bare metal machines.

These playbooks ensure each Kubernetes cluster node is correctly configured, optimized, and supporting GPU acceleration where applicable. The deployment is used in conjunction with our dashboard, allowing streamlined monitoring, scaling, and lifecycle management of Kubernetes environments.

Interested in a fully-managed solution? Our Kubernetes experts are ready around-the-clock to deliver a world-class experience. Chat with a solutions engineer!

Running the Ansible Playbooks

Download each of the files below, retrieve the external IPs from your Voltage Park On Demand dashboard, and update the inventory file (hosts.ini) .

Inventory File

hosts.ini

[control_plane]
147.0.0.0

# K8s workers
[workers]
147.0.0.1
147.0.0.2

Ansible config file

ansible.cfg

[defaults]
interpreter_python = auto_silent
inventory = hosts.ini
host_key_checking = False
remote_user = ubuntu

[privilege_escalation]
become = True
become_method = sudo
become_ask_pass = False

This is the top level script that can be used to run all the playbooks and setup your K8s deployment.

Variables

We only have one variable being passed to the playbooks and that's the ansible user. Please create the file as so:

group_vars/all.yaml

ansible_user: ubuntu

Main Script

Once all the files have been downloaded/copied to your working directory, you can run this script for a complete deploy:

main.yaml

#!/usr/bin/env ansible-playbook
---
 - import_playbook: setup_kubernetes_nodes.yaml
 - import_playbook: install_cuda.yaml
 - import_playbook: setup_controlplane.yaml
 - import_playbook: join_workers.yaml
 - import_playbook: setup_operators.yaml
 - import_playbook: deploy_prometheus.yaml
 - import_playbook: prometheus_port_forwards.yaml
 - import_playbook: deploy_kube_proxy.yaml

Initial Node setup

This first playbook automates the base setup of a VP Kubernetes (K8s) environment on all nodes, including package installations, networking configurations, essential tools, and some nice-to-haves.

Installs Python 3, Pip, and required Kubernetes Python packages.
Disables swap and configures network settings for K8s.
Opens necessary Kubernetes ports (80, 443, 6443).
Installs container runtime (Containerd) and configures it.
Sets up Docker and NVIDIA Container Toolkit for GPU support.
Installs Helm for Kubernetes package management.
Configures Kubernetes apt repositories and installs:
- kubectl
- kubelet
- kubeadm

setup_kubernetes_nodes.yaml

---
- name: Initial K8s System Install
  hosts: workers, control_plane
  any_errors_fatal: true
  become: yes
  gather_facts: false
  vars:
    python_packages:
      - openshift
      - pyyaml
      - kubernetes

  tasks:
    - name: Ensure Python 3 and pip are installed
      package:
        name:
          - python3
          - python3-pip
        state: present

    - name: Install required Python packages
      pip:
        name: "{{ python_packages }}"
        state: present
        executable: pip3

    - name: Disable swap
      command: swapoff -a
      register: output

    - name: Load br_netfilter
      modprobe:
        name: br_netfilter
        state: present

    - name: Bridge network settings
      lineinfile:
        path: /etc/sysctl.conf
        line: "{{ item }}"
        state: present
      with_items:
        - "net.bridge.bridge-nf-call-ip6tables = 1"
        - "net.bridge.bridge-nf-call-iptables = 1"
        - "net.ipv4.ip_forward = 1"

    - name: Open K8s ports
      ansible.builtin.iptables:
        chain: INPUT
        protocol: tcp
        destination_ports:
          - "80"
          - "443"
          - "6443"
          - "10257"
          - "10249"
        jump: ACCEPT

    - name: Reload sysctl settings
      command: sysctl --system

    - name: Install necessary packages
      apt:
        name:
          - apt-transport-https
          - ca-certificates
          - curl
          - tmux
          - software-properties-common
          - gpg
          - wget
          - gdebi-core
          - containerd
          - jq
          - rdma-core
          - ibverbs-utils
          - perftest
          - libnl-3-dev
          - libnl-route-3-dev
          - autoconf
          - swig
          - automake
          - libltdl-dev
          - quilt
          - flex
          - bison
          - graphviz
          - gfortran
          - libgfortran5
          - libfuse2
          - tk
          - rootlesskit
        state: present
        update_cache: yes
      retries: 5
      delay: 5

    - name: Create containerd configuration directory
      file:
        path: /etc/containerd
        state: directory
        owner: root
        group: root
        mode: '0755'

    - name: Deploy containerd configuration file
      copy:
        src: containerd_config.toml
        dest: /etc/containerd/config.toml
        owner: root
        group: root
        mode: '0644'

    - name: Start containerd service
      ansible.builtin.service:
        name: containerd
        state: restarted

    - name: Add Docker repo GPG key
      command: curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc

    - name: Update Docker key permissions
      command: chmod a+r /etc/apt/keyrings/docker.asc

    - name: Add Docker repo
      lineinfile:
        path: /etc/apt/sources.list.d/docker.list
        line: "deb [arch=amd64 signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu  jammy stable"
        create: yes

    - name: Ensure and old Nvidia container toolkit keys are absent
      ansible.builtin.file:
        path: /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
        state: absent

    - name: Add Nvidia container toolkit GPG key
      shell: curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
      ignore_errors: True

    - name: Update Nvidia container toolkit GPG key permissions
      command: chmod a+r /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

    - name: Add Nvidia container toolkit sources.list
      lineinfile:
        path: /etc/apt/sources.list.d/nvidia-container-toolkit.list
        line: "deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://nvidia.github.io/libnvidia-container/stable/deb/$(ARCH) /"
      args:
        create: yes

    - name: Update apt package index
      apt:
        update_cache: yes

    - name: Install the Nvidia Container Toolkit
      apt:
        name:
          - nvidia-container-toolkit
        state: present
        allow_unauthenticated: yes

    - name: Add Helm GPG key
      shell: >
        curl https://baltocdn.com/helm/signing.asc | gpg --dearmor -o /usr/share/keyrings/helm.gpg
      args:
        creates: /usr/share/keyrings/helm.gpg

    - name: Add Helm GPG key permissions
      command: chmod a+r /usr/share/keyrings/helm.gpg

    - name: Add Helm apt repository
      lineinfile:
        path: /etc/apt/sources.list.d/helm-stable-debian.list
        line: "deb [arch=amd64 signed-by=/usr/share/keyrings/helm.gpg] https://baltocdn.com/helm/stable/debian/ all main"
        create: yes

    - name: Update apt package index
      apt:
        update_cache: yes

    - name: Install Helm
      apt:
        name:
          - helm
        state: present
        allow_unauthenticated: yes

    - name: Add K8s GPG key
      shell: >
        curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.32/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
      args:
        creates: /etc/apt/keyrings/kubernetes-apt-keyring.gpg

    - name: Add Kubernetes apt repository
      lineinfile:
        path: /etc/apt/sources.list.d/kubernetes.list
        line: "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.32/deb/ /"
        create: yes

    - name: Update apt package index again
      apt:
        update_cache: yes

    - name: Install Kubernetes packages
      apt:
        name:
          - kubelet
          - kubeadm
          - kubectl
        state: present
        allow_unauthenticated: yes

Cuda Installation

This playbook installs CUDA 12.4 on Ubuntu 22.04, ensuring a properly configured GPU computing environment that works with VPOD deployed Nvidia Drivers.

Download and Configure CUDA Repository
- Fetches the CUDA repository pin file for package prioritization.
- Downloads the CUDA repository .deb package for version 12.4.
- Installs the repository package to enable CUDA package access.
Manage CUDA Keyring
- Locates the CUDA keyring file within the installed repository.
- Copies the keyring file to /usr/share/keyrings/ for package verification.
Install CUDA Packages
- Updates the APT package list to recognize new repositories.
- Installs the CUDA Toolkit 12.4 and CUDA Sanitizer 12.4.
Create a Symbolic Link for CUDA
- Ensures /usr/local/cuda points to /usr/local/cuda-12.4 for easier compatibility with applications.

install_cuda.yaml

---
- name: Install CUDA 12.4 on Ubuntu 22.04
  hosts: workers
  become: yes
  tasks:

    - name: Download CUDA repository pin file
      ansible.builtin.get_url:
        url: "https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin"
        dest: "/etc/apt/preferences.d/cuda-repository-pin-600"
        mode: '0644'
      retries: 5
      register: result
      delay: 5
      until: result is succeeded

    - name: Download CUDA repository deb package
      ansible.builtin.get_url:
        url: "https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda-repo-ubuntu2204-12-4-local_12.4.0-550.54.14-1_amd64.deb"
        dest: "/tmp/cuda-repo-ubuntu2204-12-4-local.deb"
        mode: '0644'
      retries: 5
      register: result
      delay: 5
      until: result is succeeded

    - name: Install CUDA repository deb package
      ansible.builtin.apt:
        deb: "/tmp/cuda-repo-ubuntu2204-12-4-local.deb"

    - name: Find CUDA keyring file
      ansible.builtin.shell: "ls /var/cuda-repo-ubuntu2204-12-4-local/cuda-*-keyring.gpg | head -n 1"
      register: cuda_keyring
      changed_when: false
      failed_when: cuda_keyring.stdout == ""

    - name: Debug CUDA keyring file path
      ansible.builtin.debug:
        msg: "CUDA keyring file found: {{ cuda_keyring.stdout }}"

    - name: COPY CUDA GPG key file if present
      ansible.builtin.copy:
        dest: /usr/share/keyrings/
        src: "{{ cuda_keyring.stdout }}"
        remote_src: yes

    - name: Update apt package list
      ansible.builtin.apt:
        update_cache: yes

    - name: Install CUDA Toolkit 12.4
      ansible.builtin.apt:
        name: "cuda-toolkit-12-4"
        state: present

    - name: Install CUDA Sanitizer 12.4
      ansible.builtin.apt:
        name: "cuda-sanitizer-12-4"
        state: present

    - name: Create symbolic link for CUDA
      ansible.builtin.file:
        src: "/usr/local/cuda-12.4"
        dest: "/usr/local/cuda"
        state: link
        force: yes

Setup Control Plane

This Ansible playbook automates the setup of a Kubernetes control plane, including networking with Calico and preparation for NVIDIA GPU support via Helm.

Initialize the Kubernetes Control Plane
- Runs kubeadm init with a Pod network CIDR of 192.168.0.0/16.
- Ensures the cluster is only initialized if /etc/kubernetes/admin.conf does not already exist.
Configure Kubernetes Admin Access
- Creates a .kube directory for the admin user.
- Copies the admin.conf file to allow kubectl access.
- Sets appropriate file ownership and permissions.
Install and Configure Calico (CNI Plugin)
- Adds the Project Calico Helm repository.
- Creates the tigera-operator namespace for the Calico installation.
- Deploys Calico v3.29.1 using Helm.
Prepare NVIDIA GPU Support
- Adds the NVIDIA Helm repository (https://helm.ngc.nvidia.com/nvidia).
- Updates all Helm repositories to ensure the latest versions are available.

setup_controlplane.yaml

---
- name: Setup K8s Control Plane
  hosts: control_plane
  become: yes
  collections:
    - community.kubernetes
  tasks:
    - name: Initialize Kubernetes cluster
      command: kubeadm init --pod-network-cidr=192.168.0.0/16
      args:
        creates: /etc/kubernetes/admin.conf
      register: kubeadm_init
      environment:
        KUBECONFIG: /etc/kubernetes/admin.conf

    - name: Create .kube directory for admin user
      file:
        path: "/home/{{ ansible_user }}/.kube"
        state: directory
        owner: "{{ ansible_user }}"
        group: "{{ ansible_user }}"
        mode: '0755'

    - name: Copy admin.conf to user's kube config
      copy:
        src: /etc/kubernetes/admin.conf
        dest: "/home/{{ ansible_user }}/.kube/config"
        owner: "{{ ansible_user }}"
        group: "{{ ansible_user }}"
        mode: '0644'
        remote_src: yes

    - name: Set ownership of .kube directory
      file:
        path: "/home/{{ ansible_user }}/.kube"
        owner: "{{ ansible_user }}"
        group: "{{ ansible_user }}"
        recurse: yes

    - name: Ensure Helm is installed
      become_user: "{{ ansible_user }}"
      command: helm version --short
      register: helm_version

    - name: Add Project Calico Helm repository
      become_user: "{{ ansible_user }}"
      kubernetes.core.helm_repository:
        name: projectcalico
        repo_url: "https://docs.tigera.io/calico/charts"

    - name: Create tigera-operator namespace
      become_user: "{{ ansible_user }}"
      kubernetes.core.k8s:
        name: tigera-operator
        api_version: v1
        kind: Namespace
        state: present
        kubeconfig: "/home/{{ ansible_user }}/.kube/config"

    - name: Ensure Calico is installed via Helm
      become_user: "{{ ansible_user }}"
      kubernetes.core.helm:
        name: calico
        chart_ref: projectcalico/tigera-operator
        release_namespace: tigera-operator
        chart_version: v3.29.1
        state: present  # Ensures the release is installed or upgraded
        kubeconfig: "/home/{{ ansible_user }}/.kube/config"

    - name: Add Nvidia Helm repo
      become_user: "{{ ansible_user }}"
      command: helm repo add nvidia https://helm.ngc.nvidia.com/nvidia

    - name: Update Helm repo
      become: yes
      become_user: "{{ ansible_user }}"
      command: helm repo update

Join workers

This Ansible playbook automates the process of joining worker nodes to an existing Kubernetes cluster by retrieving and executing the kubeadm join command.

Retrieve kubeadm join Command from the Control Plane
- Runs kubeadm token create --print-join-command on a control plane node to generate a join command.
- Stores the generated command as an Ansible fact for later use.
Join Worker Nodes to the Cluster
- Executes the stored join command on all worker nodes.
- Ensures the worker nodes only run the command if /etc/kubernetes/kubelet.conf does not exist (to prevent rejoining).

join_workers.yaml

---
- name: Get kubeadm join command from control-plane and join worker nodes
  hosts: control_plane
  become: yes
  gather_facts: no
  tasks:
    - name: Generate kubeadm join command
      command: kubeadm token create --print-join-command
      register: join_command_raw

    - name: Set join command as a fact
      set_fact:
        kubeadm_join_cmd: "{{ join_command_raw.stdout }}"
      delegate_to: "{{ inventory_hostname }}"
      delegate_facts: true

- name: Join worker nodes to Kubernetes cluster
  hosts: workers
  become: yes
  gather_facts: no
  tasks:
    - name: Run kubeadm join command
      shell: "{{ hostvars[groups['control_plane'][0]]['kubeadm_join_cmd'] }}"
      args:
        creates: /etc/kubernetes/kubelet.conf

Add Nvidia Operators

This Ansible playbook automates the installation of NVIDIA GPU and Network Operators in a Kubernetes cluster, specifically targeting the control plane nodes.

Key Actions:

Copy Configuration Files
- Transfers Network Operator Helm values (network-operator-values.yaml) to the control plane.
- Copies a CUDA test deployment manifest (cuda-deployment.yaml) for GPU workload testing.
Check Helm Repository List
- Runs helm repo list to verify Helm repository configurations.
- Logs the repository list for debugging purposes.
Install NVIDIA GPU Operator
- Deploys the NVIDIA GPU Operator using Helm with version v24.6.2.
- Creates the nvidia-gpu-operator namespace if it does not exist.
- Disables the installation of GPU drivers and toolkit (driver.enabled=false, toolkit.enabled=false).
- Waits for the installation to complete before proceeding.

setup_operators.yaml

---
- name: Install GPU and Network operators
  hosts: control_plane
  gather_facts: no
  become_user: ubuntu
  tasks:

    - name: Copy over Network Operator values.yaml
      copy:
        src: network-operator-values.yaml
        dest: /tmp/network-operator-values.yaml
        mode: '0755'

    - name: Copy over test deployment
      copy:
        src: cuda-deployment.yaml
        dest: /tmp/cuda-deployment.yaml
        mode: '0755'

    - name: Check Helm repo list
      command: helm repo list
      register: helm_repo_list
      ignore_errors: true
    - debug:
        var: helm_repo_list.stdout_lines

    - name: Install NVIDIA Network Operator
      shell: >
        helm install network-operator nvidia/network-operator \
          -n nvidia-network-operator \
          --create-namespace \
          --version v24.1.0 \
          -f /tmp/network-operator-values.yaml \
          --wait

    - name: Install NVIDIA GPU Operator
      shell: >
        helm install gpu-operator nvidia/gpu-operator \
          --version v24.6.2 \
          -n nvidia-gpu-operator \
          --create-namespace \
          --set driver.enabled=false \
          --set toolkit.enabled=false \
          --wait

Deploy Prometheus and Grafana

This Ansible playbook automates the installation and configuration of GPU and Network monitoring components in a Kubernetes cluster, specifically targeting the control plane nodes. The key steps are:

Add Prometheus Helm Repository
- Adds the Prometheus Community Helm Chart Repository to the system.
Install Prometheus
- Creates the monitoring namespace if it does not exist.
- Deploys Prometheus and related monitoring components using the kube-prometheus-stack Helm chart.
Deploy DCGM Exporter ServiceMonitor
- Defines a ServiceMonitor resource for NVIDIA's DCGM Exporter, which collects GPU metrics.
- Ensures the Prometheus Operator can scrape GPU metrics from the exporter every 30 seconds.

deploy_prometheus.yaml

---
- name: Install Prometheus
  hosts: control_plane
  gather_facts: no
  become_user: ubuntu
  tasks:
    - name: Add Prometheus Helm repository
      become_user: "{{ ansible_user }}"
      kubernetes.core.helm_repository:
        name: prometheus-community
        repo_url: https://prometheus-community.github.io/helm-charts

    - name: Install Prometheus
      shell: >
        kubectl create namespace monitoring || true &&
        helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring

    - name: Deploy DCGM Exporter ServiceMonitor
      kubernetes.core.k8s:
        state: present
        namespace: monitoring
        definition:
          apiVersion: monitoring.coreos.com/v1
          kind: ServiceMonitor
          metadata:
            name: dcgm-exporter
            namespace: monitoring
            labels:
              release: prometheus
          spec:
            selector:
              matchLabels:
                app.kubernetes.io/name: dcgm-exporter
            endpoints:
              - port: metrics
                path: /metrics
                interval: 30s

Add Port Forwards

This Ansible playbook automates the configuration of systemd services to establish persistent port forwarding for Prometheus and Grafana in a Kubernetes cluster. Here’s a breakdown of its functionality:

prometheus_port_forwards.yaml

- name: Setup Kubernetes Port Forwarding Services
  hosts: control_plane
  become: yes

  tasks:
    - name: Ensure kubectl is installed
      command: which kubectl
      register: kubectl_installed
      failed_when: kubectl_installed.rc != 0

    - name: Create Prometheus Port Forwarding service
      copy:
        dest: /etc/systemd/system/prometheus-port-forward.service
        content: |
          [Unit]
          Description=Kubernetes Port Forward for Prometheus
          After=network.target

          [Service]
          ExecStart=/usr/bin/kubectl port-forward svc/prometheus-kube-prometheus-prometheus -n monitoring 9090
          Restart=always
          User=ubuntu
          StandardOutput=syslog
          StandardError=syslog
          SyslogIdentifier=prometheus-port-forward

          [Install]
          WantedBy=multi-user.target

    - name: Create Grafana Port Forwarding service
      copy:
        dest: /etc/systemd/system/grafana-port-forward.service
        content: |
          [Unit]
          Description=Kubernetes Port Forward for Grafana
          After=network.target

          [Service]
          ExecStart=/usr/bin/kubectl port-forward svc/prometheus-grafana -n monitoring 3000:80
          Restart=always
          User=ubuntu
          StandardOutput=syslog
          StandardError=syslog
          SyslogIdentifier=grafana-port-forward

          [Install]
          WantedBy=multi-user.target

    - name: Reload systemd daemon
      command: systemctl daemon-reload

    - name: Enable and start Prometheus Port Forwarding service
      systemd:
        name: prometheus-port-forward
        enabled: yes
        state: started

    - name: Enable and start Grafana Port Forwarding service
      systemd:
        name: grafana-port-forward
        enabled: yes
        state: started

Setup Kube Proxy

Add the config file:

kube-proxy-config.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: kube-proxy
  namespace: kube-system
data:
  config.conf: |
    apiVersion: kubeproxy.config.k8s.io/v1alpha1
    kind: KubeProxyConfiguration
    bindAddress: 0.0.0.0
    metricsBindAddress: "0.0.0.0:10249"

And then the playbook: deploy_kube_proxy.yaml

---
- name: Deploy kube-proxy
  hosts: control_plane
  gather_facts: no
  become_user: ubuntu
  tasks:
    - name: Copy kube-proxy config to the target machine
      copy:
        src: kube-proxy-config.yaml
        dest: /tmp/kube-proxy-config.yaml
        mode: '0644'

    - name: Apply the new kube-proxy config
      command: kubectl apply -f /tmp/kube-proxy-config.yaml
      register: kube_proxy_result
      changed_when: "'configured' in kube_proxy_result.stdout"

    - name: Restart kube-proxy DaemonSet
      command: kubectl rollout restart daemonset/kube-proxy -n kube-system
      when: kube_proxy_result.changed

Deploy your cluster

Once the files above have been copied, you should be able to simply run the main.yamlscript to deploy:

./main.yaml

And then once you've run your playbooks, you can login to the control plane and run this command to create an SSH tunnel and get the Grafana admin password, which is auto-generated:

tunnel.sh

#!/bin/bash

INV_FILE="hosts.ini"
HOST=$(awk '/\[control_plane\]/ {getline; print $1}' "$INV_FILE")

ssh -fN -L 9090:localhost:9090 -L 3000:localhost:3000 ubuntu@${HOST}
pass_cmd="kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 --decode"
pass=$(ssh ubuntu@${HOST} "$pass_cmd")
echo "Grafana username: admin"
echo "Grafana password: ${pass}"

You can then navigate to a browser with localhost:9090to access the Prometheus API, and then localhost:3000for the Grafana server.

PreviousAttach Storage NextAPI

Last updated 4 months ago