.. index:: Microsoft Azure/AKS, Azure .. _azure: Microsoft Azure/AKS =================================== The following walks you through my journey of setting up a CoCalc cluster on `Microsoft Azure AKS`_. This guide was written in November 2023. Feel free to deviate at any point from this guide to fit your needs. .. _Microsoft Azure AKS: https://azure.microsoft.com/en-us/services/kubernetes-service/ Resource groups ------------------------------------ I've created a new one called "cocalc-onprem". Kubernetes Cluster -------------------------------------- To get started, we setup a :term:`Kubernetes` cluster. In "Kubernetes services" → "Create" → "Create Kubernetes cluster" The overall goal is to setup two node pools: * "services" 2x a small 2 CPU cores and ~16GB RAM for Kubernetes itself and CoCalc services, * "projects": 1x or more for the CoCalc projects, with 4 CPU cores and more memory. .. note:: All nodes must run Linux and have x86 CPUs. Basics ^^^^^^^^^^ +--------------------------------------+-----------------------------------------------------+ | Parameter | Value | +======================================+=====================================================+ | Subscription/Resource Group | the usual, and "cocalc-onprem" | +--------------------------------------+-----------------------------------------------------+ | Cluster preset | "Dev/Test" (:term:`YMMV`) | +--------------------------------------+-----------------------------------------------------+ | Cluster name | ``cocalc1`` | +--------------------------------------+-----------------------------------------------------+ | Region | ``East US`` | +--------------------------------------+-----------------------------------------------------+ | Availability | ``1, 2, 3`` | +--------------------------------------+-----------------------------------------------------+ | AKS pricing | ``Free`` | +--------------------------------------+-----------------------------------------------------+ | Kubernetes version | ``1.26.6`` (as of 2023-11-06, that's the default) | +--------------------------------------+-----------------------------------------------------+ | Automatic upgrade | I picked the recommended one, | | | i.e. ``"Enabled with patches"`` | +--------------------------------------+-----------------------------------------------------+ | Authentication and Authorization | Local accounts with Kubernetes RBAC | +--------------------------------------+-----------------------------------------------------+ Node Pools ^^^^^^^^^^^^^ **Service node pool** That's the first "default" pool of nodes, where also system services for Kubernetes will run. I clicked on "agentpool" and renamed it to "services". +--------------------------------------+-----------------------------------------------------+ | Parameter | Value | +======================================+=====================================================+ | Mode | ``System`` | +--------------------------------------+-----------------------------------------------------+ | OS SKU | ``Ubuntu Linux`` | | | I don't know if NFS mounts work with ``Azure Linux``| | | Something to experiment later on. | +--------------------------------------+-----------------------------------------------------+ | Availability | ``1, 2, 3`` | +--------------------------------------+-----------------------------------------------------+ | VM Size | Filtered for 2 vCPU (x86/64), 16 GB RAM and | | | The cheapest one was ``A2m_v2``. | | | A better choice might be 4 vCPU and 16GB RAM. | +--------------------------------------+-----------------------------------------------------+ | Scale method | ``manual`` | +--------------------------------------+-----------------------------------------------------+ | Node count | ``2`` | +--------------------------------------+-----------------------------------------------------+ | Max pods per node | 110 (default) | +--------------------------------------+-----------------------------------------------------+ | Public IP per node | ``No`` | +--------------------------------------+-----------------------------------------------------+ | Kubernetes Label | ``cocalc-role=services`` | +--------------------------------------+-----------------------------------------------------+ **Project node pool** Clicking on "+ add node pool" +--------------------------------------+-----------------------------------------------------+ | Parameter | Value | +======================================+=====================================================+ | Name | ``projects`` | +--------------------------------------+-----------------------------------------------------+ | Mode | ``User`` | +--------------------------------------+-----------------------------------------------------+ | OS SKU | ``Ubuntu Linux`` | +--------------------------------------+-----------------------------------------------------+ | Availability | ``1, 2, 3`` | +--------------------------------------+-----------------------------------------------------+ | Spot Instances | ``Yes`` (cheaper, but randomly interrupts projects, | | | and you have to make sure you have enough quotas) | +--------------------------------------+-----------------------------------------------------+ | Spot Type | ``Capacity only`` | +--------------------------------------+-----------------------------------------------------+ | Spot Policy | ``Delete`` | +--------------------------------------+-----------------------------------------------------+ | Spot VM Size | ``D4s_v3`` (4 vCPU (x86/64) and 16GB RAM) | | | | | | A better choice might be 4 vCPU and 32 GB RAM. | +--------------------------------------+-----------------------------------------------------+ | Public IP per node | ``No`` | +--------------------------------------+-----------------------------------------------------+ | Scale method | ``manual`` | +--------------------------------------+-----------------------------------------------------+ | Node count | ``1`` | +--------------------------------------+-----------------------------------------------------+ | Node drain timeout | ``5`` (rather impatient) | +--------------------------------------+-----------------------------------------------------+ | Kubernetes Label | ``cocalc-role=projects`` | +--------------------------------------+-----------------------------------------------------+ | Kubernetes Taints | .. list-table:: | | | | | | * - **Key** | | | - **Value** | | | - **Effect** | | | * - cocalc-projects-init | | | - false | | | - NoExecute | | | * - cocalc-projects | | | - init | | | - NoSchedule | | | | +--------------------------------------+-----------------------------------------------------+ Networking ^^^^^^^^^^^^^ +--------------------------------------+-----------------------------------------------------+ | Parameter | Value | +======================================+=====================================================+ | Private cluster | no (but maybe you know how to set this up) | +--------------------------------------+-----------------------------------------------------+ | Network configuration | ``kubenet`` (default) | +--------------------------------------+-----------------------------------------------------+ | Bring your own virtual network | ``Yes`` (which opened up a few defaults) | | | | | | - Virtual network: ``cocalc-onprem-vnet`` | | | - Subnet: ``default`` (a ``/16``) | +--------------------------------------+-----------------------------------------------------+ | DNS name prefix | ``cocalc1-dns`` (default) | +--------------------------------------+-----------------------------------------------------+ | Network policy | ``Calico`` (default) | +--------------------------------------+-----------------------------------------------------+ Integrations ^^^^^^^^^^^^^ - I disabled "Defender" - No container registry (images are external, but maybe you'll mirror them internally) - Azure Monitor: ``Off`` (I don't know the pricing of this, and I bet this can be enabled later on) - Alerting: kept the defaults - Azure policy: Disabled, since I don't know what this is Advanced ^^^^^^^^^^^^^ I kept the defaults. In particular, it tells me to setup a specifically named infrastructure resource group. Ok. Tags ^^^^^^^^^^^^^ none Validation/Creation ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ "Validation in progress": took a minute or so. Creation as well. Connecting ^^^^^^^^^^^^^ I was then able to open the cloud shell in Azure's web interface (bash) and run:: az aks get-credentials --resource-group cocalc-onprem --name cocalc1 Which added credentials to my ``~/.kube/config`` file, and I was able to run ``kubectl get nodes`` and see the three nodes and ``kubectl get pods -A`` and see the system pods running. Success! .. code:: bash [ ~ ]$ kubectl get nodes NAME STATUS ROLES AGE VERSION aks-projects-24080581-vmss000000 Ready agent 34m v1.26.6 aks-services-47947981-vmss000002 Ready agent 49m v1.26.6 aks-services-47947981-vmss000003 Ready agent 49m v1.26.6 [ ~ ]$ k get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE aks-command command-8a584164af9543f6ae7da87fde59b667 0/1 Completed 0 4m3s calico-system calico-kube-controllers-679bc4d8d7-8g2sw 1/1 Running 0 26h calico-system calico-node-7csgp 1/1 Running 0 34m calico-system calico-node-f9jhp 1/1 Running 0 49m calico-system calico-node-kbsww 1/1 Running 0 49m calico-system calico-typha-77669b8d96-6pf4r 1/1 Running 0 34m calico-system calico-typha-77669b8d96-dqmcq 1/1 Running 0 26h kube-system cloud-node-manager-4dhz4 1/1 Running 0 34m kube-system cloud-node-manager-bbfhv 1/1 Running 0 49m kube-system cloud-node-manager-gzkfk 1/1 Running 0 49m [...] Ref: `Connect to an AKS cluster `_ .. _aks-namespace-cocalc: Namespace "cocalc" ^^^^^^^^^^^^^^^^^^^^^^^^^ The :term:`Namespace` throughout this documentation is ``cocalc``. Here, we create it and switch to it. .. code:: kubectl create namespace cocalc kubectl config set-context --current --namespace=cocalc Tweaking AKS behavior -------------------------------------- .. warning:: For reasons I don't understand, AKS disallows changing node taints of the project pool. This renders the :ref:`prepull` feature useless, since it relies changing taints. A "trick" is to disable a hook via:: kubectl get ValidatingWebhookConfiguration aks-node-validating-webhook -o yaml | sed -e 's/\(objectSelector: \){}/\1{"matchLabels": {"disable":"true"}}/g' | kubectl apply -f - Ref.: `GitHub issue 2934 `_ PostgreSQL -------------------------------------- We use the "Azure Database for PostgreSQL Flexible Server" service to create a small database. The "flexible server" variant seems the be the newer one, while the non-flexible one is deprecated. Below are just the bare minimum parameters I chose for testing. For production, a small default choice is probably fine. +--------------------------------------+-------------------------------------------------------------------------------+ | Parameter | Value | +======================================+===============================================================================+ | Subscription and Resource group | same as above: "cocalc-onprem" | +--------------------------------------+-------------------------------------------------------------------------------+ | Name | ``cocalc1`` | +--------------------------------------+-------------------------------------------------------------------------------+ | Region | ``East US`` (same as the K8S cluster) | +--------------------------------------+-------------------------------------------------------------------------------+ | Version | ``15`` (default) | +--------------------------------------+-------------------------------------------------------------------------------+ | Workload | ``Development`` | | | In general, the load should not exceed 1 core | | | and fit within a few GB of RAM. | +--------------------------------------+-------------------------------------------------------------------------------+ | High availability | ``No``, though :term:`YMMV` | +--------------------------------------+-------------------------------------------------------------------------------+ | Authentication | ``PostgreSQL only (or both)`` | | | | | | - Username: ``cocalc`` | | | - Password: ``[secret]`` | +--------------------------------------+-------------------------------------------------------------------------------+ | Networking | - Private access (should be more secure, you do not need a public IP as well) | | | - Virtual network: existing one, same as the cluster above | | | - Subnet: Created a new one: ``db``, ``10.225.0.0/24`` | | | I wasn't able to do this from here, but clicked on | | | "Manage selected Virtual Network" and from there added that Subnet. | | | There, I also delegated it to ``Microsoft.DBforPostgreSQL/flexibleServers`` | +--------------------------------------+-------------------------------------------------------------------------------+ | Security | defaults | +--------------------------------------+-------------------------------------------------------------------------------+ | Tags | empty | +--------------------------------------+-------------------------------------------------------------------------------+ | Review + Create | Price is about $16 per month | +--------------------------------------+-------------------------------------------------------------------------------+ Once it has been created, in "Settings/Connect" it told me that ``PGHOST`` is ``cocalc1.postgres.database.azure.com`` and so on. Set those parameters in :ref:`deployment configuration` under ``global.database`` to tell CoCalc how to connect to the database. .. note:: You might wonder why I created a new subnet. I tried to setup a private link to the database, but I just got an error, that this kind of sub-resource is not supported. In turn, this ``db`` subnet is not used elsewhere. The "magic" seems to be this subnet delegation. .. warning:: There is no SSL encryption, hence you have to disable it. For that, open the just created database, Settings/**Server Parameters** and flip ``require_secure_transport`` to ``Off``, and save it. Ref: * `Microsoft: PostgreSQL flexible server parameters `_ * `StackOverflow: no-pg-hba-conf-entry-for-host `_ Storage/Files -------------------------------------- The goal of this aspect is to create a managed shared file-system, which we'll mount into the Kubernetes pods as a :term:`ReadWriteMany` :term:`NFS` file-system. Any storage solution that supports this should work. This here uses the `Azure Files`_ solution to set up `Azure NFS`_. .. note:: Alternatively to the setup below, you can also run your own NFS server: `Use a Linux NFS Server with AKS`_. If you go down this route, continue here where the PV/PVC setup is explained :ref:`aks-storage-pv-pvc`. Basics ^^^^^^^^^^^^^ To get started: "Storage accounts" → "Create" +--------------------------------------+-------------------------------------------------------------------------------+ | Parameter | Value | +======================================+===============================================================================+ | Subscription and Resource group | same as above: "cocalc-onprem" | +--------------------------------------+-------------------------------------------------------------------------------+ | Name | ``cocalc1`` (globally unique) | +--------------------------------------+-------------------------------------------------------------------------------+ | Region | ``East US`` (same as the cluster and DB) | +--------------------------------------+-------------------------------------------------------------------------------+ | Performance | ``Premium`` (required for NFS) | +--------------------------------------+-------------------------------------------------------------------------------+ | Premium account type | ``File shares`` | +--------------------------------------+-------------------------------------------------------------------------------+ | Redundancy | ``Locally-redundant storage (LRS)`` | +--------------------------------------+-------------------------------------------------------------------------------+ Advanced ^^^^^^^^^^^^^ I don't know much about the options, hence I kept them as they are Networking ^^^^^^^^^^^^^ * Network connectivity: Disable public access / enable private access * Add private endpoint: - Name: ``cocalccloud1files`` - Storage sub-resource: ``file`` - Virtual network: picked the one of the K8S cluster - Subnet: ``default``, i.e. the same as the K8S cluster above - Integrated with private DNS zone: ``yes`` * Microsoft network routing (the default) Data protection and Encryption ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Kept as it is. Although, later once deployed, "Secure transfer required" must be disabled, because NFS does not support that. (You can change this later in Settings → Configuration). Tags: none File share ^^^^^^^^^^^^^ After this was deployed, there is a menu entry "File share" to create a new one. +--------------------------------------+-------------------------------------------------------------------------------+ | Parameter | Value | +======================================+===============================================================================+ | Name | ``cocalc-onprem-1`` | +--------------------------------------+-------------------------------------------------------------------------------+ | Provisioned capacity | ``100 GB`` (minimum) | +--------------------------------------+-------------------------------------------------------------------------------+ | Protocol | ``NFS`` | +--------------------------------------+-------------------------------------------------------------------------------+ | Root Squash | ``No`` | +--------------------------------------+-------------------------------------------------------------------------------+ Ref: `Mount NFS Azure file share on Linux `_ I got an error about "Secure transfer required" and disabled it. Once that was done, that NFS File share panel showed me instructions how to install the ``nfs-common`` package in Linux and then told me how to mount this share. Looks good! Network access ^^^^^^^^^^^^^^^^^^^ Make sure there is a private endpoint for the file share. Otherwise, opened that ``cocalc-onprem-1`` file share and added a private endpoint, where it allows me to setup access: +--------------------------------------+---------------------------------------------------------------------------------------+ | Parameter | Value | +======================================+=======================================================================================+ | Name | ``cocalc-onprem-files`` | +--------------------------------------+---------------------------------------------------------------------------------------+ | Interface | same as above with ``-nic`` appended | +--------------------------------------+---------------------------------------------------------------------------------------+ | Resource | ``file`` | +--------------------------------------+---------------------------------------------------------------------------------------+ | Virtual Network | the one of the cluster, and the subnet of the aks cluster | +--------------------------------------+---------------------------------------------------------------------------------------+ | Dynamically allocate IP address | ``Yes`` (although, statically might be better) | +--------------------------------------+---------------------------------------------------------------------------------------+ | Private DNS | ``Yes`` | +--------------------------------------+---------------------------------------------------------------------------------------+ | Tags | none | +--------------------------------------+---------------------------------------------------------------------------------------+ I opened that new "Private endpoint" and under Settings/DNS configuration, I saw there is an internal IP address and under FQDN there is ``cocalccloud1.file.core.windows.net``. In the next step, that's the name of the NFS server. When this doesn't work, I got this error when trying to mount from within a pod in the kubernetes cluster:: mount.nfs: access denied by server while mounting cocalccloud1.file.core.windows.net:/cocalccloud1/cocalc-onprem-1 What helped me, essentially: - `Mount NFS in a Virtual Machine `_ - `Private Endpoints `_ .. _aks-storage-pv-pvc: Kubernetes PV/PVC ^^^^^^^^^^^^^^^^^^^^^^^^^ Once we have the NFS server running, we create three :term:`PersistentVolumes ` and corresponding :term:`PersistentVolumeClaims `. We expand on the second part of `Use a Linux NFS Server with AKS`_. Specific to CoCalc, we need two PVC's for the data of all projects and the globally shared software. We mount them from subdirectories of the NFS file share (doesn't need to be the case, but why not...), which will be ``/data`` and ``/software``. The names used here are the default values – :ref:`otherwise specify them `. .. warning:: Since we setup the PVCs on our own, you have to tell CoCalc to not create them. That's the ``storage.create: false`` setting in :ref:`config-storage`. The way I created this is by setting up a third PV/PVC pair called ``root-data``. Then, I run small setup :term:`Job `, which creates these subdirectories and fixes the ownership and permissions. Make sure you deploy this in :ref:`your namespace `. .. note:: You have to change the NFS server name and path to match your setup. We start by defining the :term:`PVs `: download :download:`pv-nfs.yaml <../_files/pv-nfs.yaml>`, edit it, and then add it via ``kubectl apply -f pv-nfs.yaml``. To figure out server and path, open the file share and look at the "Connect from Linux" information. There is a line for the mount command. This is composed of:: [FQDN of server]:/[storage account name]/[file share name] Below in ``pv-nfs.yaml``, the ``server`` is the FQDN of the server. The ``path`` is the remainder, i.e. ``/[storage account name]/[file share name]`` for the "root" PV, and the others have additional subdirectories appended, i.e. ``.../data`` and ``.../software``. .. literalinclude:: ../_files/pv-nfs.yaml :language: yaml :linenos: Next up, we define corresponding :term:`PVCs `: download :download:`pvc-cocalc.yaml <../_files/pvc-cocalc.yaml>`, edit it, and then add it via ``kubectl apply -f pvc-cocalc.yaml``. Both names match the default values – :ref:`otherwise specify them `. .. literalinclude:: ../_files/pvc-cocalc.yaml :language: yaml :linenos: Finally, we run the following :term:`Kubernetes Job ` to create these subdirectories and fix the permissions. Download :download:`storage-setup-job.yaml <../_files/storage-setup-job.yaml>` and run it via ``kubectl apply -f storage-setup-job.yaml``. Once it worked – check this via ``kubectl describe job storage-setup-job`` – you can delete the job via ``kubectl delete -f storage-setup-job.yaml``. You can also check its log via ``kubectl log [jobs's pod name]``: it should say something like ``setting up data and software done``. This also confirms, that the NFS server is working and can be mounted from within the cluster. .. literalinclude:: ../_files/storage-setup-job.yaml :language: yaml :linenos: .. note:: If you ever have the need to do manual actions on the files, comment/uncomment the command lines to run ``sleep infinity`` and uncomment the ``securityContext`` section to run under a specific user id. Deploy the job, it will continue to run, get the pod and exec into it via: ``kubectl exec -it [pod name] -- /bin/bash`` The root mount point is in ``/nfs``. Afterwards, delete the job again. Ingress -------------------------------------- We also need a way to route internet traffic via a :term:`LoadBalancer` to the corresponding service in our Kubernetes cluster. This follows the documentation for an `unmanaged NGINX Ingress controller `_ and loads the configuration from the ``ingress-nginx/values.yaml`` file, included in this repository. To deploy this, make sure you're in :ref:`your namespace `, then install the :term:`PriorityClass ` as explained in the ``ingress/README.md`` and then deploy it. Regrading the HELM chart version number below: please double check with what is listed in the :doc:`../setup` notes – might be outdated. .. code:: bash kubectl apply -f ingress-nginx/priority-class.yaml helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx helm repo update helm install ingress-nginx ingress-nginx/ingress-nginx \ --version 4.8.3 \ --set controller.service.annotations."service\.beta\.kubernetes\.io/azure-load-balancer-health-probe-request-path"=/healthz \ -f ingress-nginx/values.yaml Then, I checked if two controller pods are running ``kubectl get pods -n cocalc`` and also confirmed there is a :term:`LoadBalancer` with an external IP: .. code:: $ kubectl get svc ingress-nginx-controller NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE ingress-nginx-controller LoadBalancer 10.0.193.93 XX.XXX.XXX.XX 80:30323/TCP,443:30228/TCP 5m35s ... and I also checked, that two ingress controller pods are running: .. code:: $ kubectl get deploy ingress-nginx-controller NAME READY UP-TO-DATE AVAILABLE AGE ingress-nginx-controller 2/2 2 2 72s Certificate Manager -------------------------------------- Afterwards, install the certificate manager according to the notes in ``letsencrypt/README.md`` or do your own setup. More details in `Azure/TLS/cert-manager `_. Final step is to register that external IP address in your DNS server. That DNS name is then used in the :ref:`deployment configuration`, under ``global.dns``. (I also found notes about "tagging" that IP address, and then add a CNAME record to your DNS provider. That's more robust in case the LoadBalancer is re-created and the IP address changes.) Next steps ... --------------- Ok. At that point we have a cluster, a database, and an NFS file share with two pre-configured subdirectories. The next steps are * :ref:`private-docker-registry` setup to let the cluster know about the credentials for the docker registry, * Setup the :ref:`database password as a secret `, * and follow the :doc:`../deployment` notes to configure everything else and to actually install CoCalc itself. Here is a starting point for your :ref:`my-values.yaml ` configuration. Download :download:`azurecalc.yaml <../_files/azurecalc.yaml>` and edit it. .. literalinclude:: ../_files/azurecalc.yaml :language: yaml :linenos: .. _Azure Files: https://azure.microsoft.com/en-us/products/storage/files/ .. _Azure NFS: https://learn.microsoft.com/en-us/azure/storage/files/files-nfs-protocol .. _Use a Linux NFS Server with AKS: https://learn.microsoft.com/en-us/azure/aks/azure-nfs-volume