Microsoft Azure/AKS¶
The following walks you through my journey of setting up a CoCalc cluster on Microsoft Azure AKS. This guide was written in November 2023. Feel free to deviate at any point from this guide to fit your needs.
Resource groups¶
I’ve created a new one called “cocalc-onprem”.
Kubernetes Cluster¶
To get started, we setup a Kubernetes cluster. In “Kubernetes services” → “Create” → “Create Kubernetes cluster”
The overall goal is to setup two node pools:
“services” 2x a small 2 CPU cores and ~16GB RAM for Kubernetes itself and CoCalc services,
“projects”: 1x or more for the CoCalc projects, with 4 CPU cores and more memory.
Note
All nodes must run Linux and have x86 CPUs.
Basics¶
Parameter |
Value |
---|---|
Subscription/Resource Group |
the usual, and “cocalc-onprem” |
Cluster preset |
“Dev/Test” (YMMV) |
Cluster name |
|
Region |
|
Availability |
|
AKS pricing |
|
Kubernetes version |
|
Automatic upgrade |
I picked the recommended one,
i.e. |
Authentication and Authorization |
Local accounts with Kubernetes RBAC |
Node Pools¶
Service node pool
That’s the first “default” pool of nodes, where also system services for Kubernetes will run. I clicked on “agentpool” and renamed it to “services”.
Parameter |
Value |
---|---|
Mode |
|
OS SKU |
|
Availability |
|
VM Size |
Filtered for 2 vCPU (x86/64), 16 GB RAM and
The cheapest one was |
Scale method |
|
Node count |
|
Max pods per node |
110 (default) |
Public IP per node |
|
Kubernetes Label |
|
Project node pool
Clicking on “+ add node pool”
Parameter |
Value |
|||||||||
---|---|---|---|---|---|---|---|---|---|---|
Name |
|
|||||||||
Mode |
|
|||||||||
OS SKU |
|
|||||||||
Availability |
|
|||||||||
Spot Instances |
|
|||||||||
Spot Type |
|
|||||||||
Spot Policy |
|
|||||||||
Spot VM Size |
A better choice might be 4 vCPU and 32 GB RAM. |
|||||||||
Public IP per node |
|
|||||||||
Scale method |
|
|||||||||
Node count |
|
|||||||||
Node drain timeout |
|
|||||||||
Kubernetes Label |
|
|||||||||
Kubernetes Taints |
|
Networking¶
Parameter |
Value |
---|---|
Private cluster |
no (but maybe you know how to set this up) |
Network configuration |
|
Bring your own virtual network |
|
DNS name prefix |
|
Network policy |
|
Integrations¶
I disabled “Defender”
No container registry (images are external, but maybe you’ll mirror them internally)
Azure Monitor:
Off
(I don’t know the pricing of this, and I bet this can be enabled later on)Alerting: kept the defaults
Azure policy: Disabled, since I don’t know what this is
Advanced¶
I kept the defaults. In particular, it tells me to setup a specifically named infrastructure resource group. Ok.
Validation/Creation¶
“Validation in progress”: took a minute or so. Creation as well.
Connecting¶
I was then able to open the cloud shell in Azure’s web interface (bash) and run:
az aks get-credentials --resource-group cocalc-onprem --name cocalc1
Which added credentials to my ~/.kube/config
file,
and I was able to run kubectl get nodes
and see the three nodes
and kubectl get pods -A
and see the system pods running. Success!
[ ~ ]$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
aks-projects-24080581-vmss000000 Ready agent 34m v1.26.6
aks-services-47947981-vmss000002 Ready agent 49m v1.26.6
aks-services-47947981-vmss000003 Ready agent 49m v1.26.6
[ ~ ]$ k get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
aks-command command-8a584164af9543f6ae7da87fde59b667 0/1 Completed 0 4m3s
calico-system calico-kube-controllers-679bc4d8d7-8g2sw 1/1 Running 0 26h
calico-system calico-node-7csgp 1/1 Running 0 34m
calico-system calico-node-f9jhp 1/1 Running 0 49m
calico-system calico-node-kbsww 1/1 Running 0 49m
calico-system calico-typha-77669b8d96-6pf4r 1/1 Running 0 34m
calico-system calico-typha-77669b8d96-dqmcq 1/1 Running 0 26h
kube-system cloud-node-manager-4dhz4 1/1 Running 0 34m
kube-system cloud-node-manager-bbfhv 1/1 Running 0 49m
kube-system cloud-node-manager-gzkfk 1/1 Running 0 49m
[...]
Namespace “cocalc”¶
The Namespace throughout this documentation is cocalc
.
Here, we create it and switch to it.
kubectl create namespace cocalc
kubectl config set-context --current --namespace=cocalc
Tweaking AKS behavior¶
Warning
For reasons I don’t understand, AKS disallows changing node taints of the project pool. This renders the Prepull feature useless, since it relies changing taints.
A “trick” is to disable a hook via:
kubectl get ValidatingWebhookConfiguration aks-node-validating-webhook -o yaml | sed -e 's/\(objectSelector: \){}/\1{"matchLabels": {"disable":"true"}}/g' | kubectl apply -f -
Ref.: GitHub issue 2934
PostgreSQL¶
We use the “Azure Database for PostgreSQL Flexible Server” service to create a small database. The “flexible server” variant seems the be the newer one, while the non-flexible one is deprecated. Below are just the bare minimum parameters I chose for testing. For production, a small default choice is probably fine.
Parameter |
Value |
---|---|
Subscription and Resource group |
same as above: “cocalc-onprem” |
Name |
|
Region |
|
Version |
|
Workload |
|
High availability |
|
Authentication |
|
Networking |
|
Security |
defaults |
Tags |
empty |
Review + Create |
Price is about $16 per month |
Once it has been created, in “Settings/Connect” it told me that PGHOST
is cocalc1.postgres.database.azure.com
and so on.
Set those parameters in deployment configuration under global.database
to tell CoCalc how to connect to the database.
Note
You might wonder why I created a new subnet.
I tried to setup a private link to the database, but I just got an error, that this kind of sub-resource is not supported.
In turn, this db
subnet is not used elsewhere.
The “magic” seems to be this subnet delegation.
Warning
There is no SSL encryption, hence you have to disable it.
For that, open the just created database, Settings/Server Parameters and flip require_secure_transport
to Off
, and save it.
Ref:
Storage/Files¶
The goal of this aspect is to create a managed shared file-system, which we’ll mount into the Kubernetes pods as a ReadWriteMany NFS file-system. Any storage solution that supports this should work. This here uses the Azure Files solution to set up Azure NFS.
Note
Alternatively to the setup below, you can also run your own NFS server: Use a Linux NFS Server with AKS.
If you go down this route, continue here where the PV/PVC setup is explained Kubernetes PV/PVC.
Basics¶
To get started: “Storage accounts” → “Create”
Parameter |
Value |
---|---|
Subscription and Resource group |
same as above: “cocalc-onprem” |
Name |
|
Region |
|
Performance |
|
Premium account type |
|
Redundancy |
|
Advanced¶
I don’t know much about the options, hence I kept them as they are
Networking¶
Network connectivity: Disable public access / enable private access
Add private endpoint:
Name:
cocalccloud1files
Storage sub-resource:
file
Virtual network: picked the one of the K8S cluster
Subnet:
default
, i.e. the same as the K8S cluster aboveIntegrated with private DNS zone:
yes
Microsoft network routing (the default)
Data protection and Encryption¶
Kept as it is.
Although, later once deployed, “Secure transfer required” must be disabled, because NFS does not support that. (You can change this later in Settings → Configuration).
Tags: none
Network access¶
Make sure there is a private endpoint for the file share.
Otherwise, opened that cocalc-onprem-1
file share and added a private endpoint, where it allows me to setup access:
Parameter |
Value |
---|---|
Name |
|
Interface |
same as above with |
Resource |
|
Virtual Network |
the one of the cluster, and the subnet of the aks cluster |
Dynamically allocate IP address |
|
Private DNS |
|
Tags |
none |
I opened that new “Private endpoint” and under Settings/DNS configuration,
I saw there is an internal IP address and under FQDN there is cocalccloud1.file.core.windows.net
.
In the next step, that’s the name of the NFS server.
When this doesn’t work, I got this error when trying to mount from within a pod in the kubernetes cluster:
mount.nfs: access denied by server while mounting cocalccloud1.file.core.windows.net:/cocalccloud1/cocalc-onprem-1
What helped me, essentially:
Kubernetes PV/PVC¶
Once we have the NFS server running, we create three PersistentVolumes and corresponding PersistentVolumeClaims. We expand on the second part of Use a Linux NFS Server with AKS.
Specific to CoCalc, we need two PVC’s for the data of all projects and the globally shared software.
We mount them from subdirectories of the NFS file share (doesn’t need to be the case, but why not…),
which will be /data
and /software
.
The names used here are the default values – otherwise specify them.
Warning
Since we setup the PVCs on our own, you have to tell CoCalc to not create them.
That’s the storage.create: false
setting in Storage.
The way I created this is by setting up a third PV/PVC pair called root-data
.
Then, I run small setup Job,
which creates these subdirectories and fixes the ownership and permissions.
Make sure you deploy this in your namespace.
Note
You have to change the NFS server name and path to match your setup.
We start by defining the PVs:
download pv-nfs.yaml
, edit it, and then add it via kubectl apply -f pv-nfs.yaml
.
To figure out server and path, open the file share and look at the “Connect from Linux” information. There is a line for the mount command. This is composed of:
[FQDN of server]:/[storage account name]/[file share name]
Below in pv-nfs.yaml
, the server
is the FQDN of the server.
The path
is the remainder, i.e.
/[storage account name]/[file share name]
for the “root” PV,
and the others have additional subdirectories appended, i.e. .../data
and .../software
.
1apiVersion: v1
2kind: PersistentVolume
3metadata:
4 name: cocalccloud1-root
5 labels:
6 type: nfs
7 aspect: root
8spec:
9 capacity:
10 storage: 100Gi
11 accessModes:
12 - ReadWriteMany
13 nfs:
14 server: "cocalccloud1.file.core.windows.net" # your NFS server
15 path: "/cocalccloud1/cocalccloud1" # file share without a subdirectory
16---
17apiVersion: v1
18kind: PersistentVolume
19metadata:
20 name: cocalccloud1-data
21 labels:
22 type: nfs
23 aspect: data
24spec:
25 capacity:
26 storage: 100Gi
27 accessModes:
28 - ReadWriteMany
29 nfs:
30 server: "cocalccloud1.file.core.windows.net" # your NFS server
31 path: "/cocalccloud1/cocalccloud1/data" # the file share name and subdirectory
32 mountOptions:
33 - noacl
34 - noatime
35 - nodiratime
36 - acregmin=30 # bit of a tradeoff between performance and consistency
37---
38apiVersion: v1
39kind: PersistentVolume
40metadata:
41 name: cocalccloud1-software
42 labels:
43 type: nfs
44 aspect: software
45spec:
46 capacity:
47 storage: 100Gi
48 accessModes:
49 - ReadWriteMany
50 nfs:
51 server: "cocalccloud1.file.core.windows.net" # your NFS server
52 path: "/cocalccloud1/cocalccloud1/software" # the file share name and subdirectory
53 mountOptions:
54 - noacl
55 - noatime
56 - nodiratime
57 - acregmin=600 # we only expect rare changes
Next up, we define corresponding PVCs:
download pvc-cocalc.yaml
, edit it, and then add it via kubectl apply -f pvc-cocalc.yaml
.
Both names match the default values – otherwise specify them.
1apiVersion: v1
2kind: PersistentVolumeClaim
3metadata:
4 name: root-data # only used by the setup Job
5spec:
6 accessModes:
7 - ReadWriteMany
8 storageClassName: ""
9 resources:
10 requests:
11 storage: 1Gi
12 selector:
13 matchLabels:
14 type: nfs
15 aspect: root
16---
17apiVersion: v1
18kind: PersistentVolumeClaim
19metadata:
20 name: projects-data # default name
21spec:
22 accessModes:
23 - ReadWriteMany
24 storageClassName: ""
25 resources:
26 requests:
27 storage: 1Gi
28 selector:
29 matchLabels:
30 type: nfs
31 aspect: data
32---
33apiVersion: v1
34kind: PersistentVolumeClaim
35metadata:
36 name: projects-software # default name
37spec:
38 accessModes:
39 - ReadWriteMany
40 storageClassName: ""
41 resources:
42 requests:
43 storage: 1Gi
44 selector:
45 matchLabels:
46 type: nfs
47 aspect: software
Finally, we run the following Kubernetes Job to create these subdirectories and fix the permissions.
Download storage-setup-job.yaml
and run it via kubectl apply -f storage-setup-job.yaml
.
Once it worked – check this via kubectl describe job storage-setup-job
–
you can delete the job via kubectl delete -f storage-setup-job.yaml
.
You can also check its log via kubectl log [jobs's pod name]
: it should say
something like setting up data and software done
.
This also confirms, that the NFS server is working and can be mounted from within the cluster.
1apiVersion: batch/v1
2kind: Job
3metadata:
4 name: "storage-setup-job"
5spec:
6 backoffLimit: 1
7 template:
8 spec:
9 restartPolicy: Never
10 containers:
11 - name: "storage-setup-job"
12 image: "ubuntu:24.04"
13 command:
14 - bash
15 - "-c"
16 # make sure the directories exist and set the permissions to 2001:2001 – the UID:GID of CoCalc users operating on these files
17 - "cd /nfs && mkdir -p data software && chown -R 2001:2001 data software && chmod a+rwx data software && echo 'setting up data and software done'"
18 #- "sleep infinity"
19 imagePullPolicy: IfNotPresent
20 resources:
21 requests:
22 cpu: 100m
23 memory: 256Mi
24 volumeMounts:
25 - mountPath: /nfs
26 name: root-data
27
28 ## uncomment this to run exactly as the user of CoCalc projects and associated services
29 #securityContext:
30 # runAsGroup: 2001
31 # runAsUser: 2001
32
33 dnsPolicy: ClusterFirst
34 terminationGracePeriodSeconds: 30
35 volumes:
36 - name: root-data
37 persistentVolumeClaim:
38 claimName: root-data
Note
If you ever have the need to do manual actions on the files,
comment/uncomment the command lines to run sleep infinity
and uncomment the securityContext
section to run under a specific user id.
Deploy the job, it will continue to run, get the pod and exec into it via:
kubectl exec -it [pod name] -- /bin/bash
The root mount point is in /nfs
.
Afterwards, delete the job again.
Ingress¶
We also need a way to route internet traffic via a LoadBalancer to the corresponding service in our Kubernetes cluster.
This follows the documentation for an unmanaged NGINX Ingress controller
and loads the configuration from the ingress-nginx/values.yaml
file, included in this repository.
To deploy this,
make sure you’re in your namespace,
then install the PriorityClass as explained in the ingress/README.md
and then deploy it.
Regrading the HELM chart version number below:
please double check with what is listed in the Setup notes – might be outdated.
kubectl apply -f ingress-nginx/priority-class.yaml
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm install ingress-nginx ingress-nginx/ingress-nginx \
--version 4.8.3 \
--set controller.service.annotations."service\.beta\.kubernetes\.io/azure-load-balancer-health-probe-request-path"=/healthz \
-f ingress-nginx/values.yaml
Then, I checked if two controller pods are running kubectl get pods -n cocalc
and also confirmed there is a LoadBalancer with an external IP:
$ kubectl get svc ingress-nginx-controller
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ingress-nginx-controller LoadBalancer 10.0.193.93 XX.XXX.XXX.XX 80:30323/TCP,443:30228/TCP 5m35s
… and I also checked, that two ingress controller pods are running:
$ kubectl get deploy ingress-nginx-controller
NAME READY UP-TO-DATE AVAILABLE AGE
ingress-nginx-controller 2/2 2 2 72s
Certificate Manager¶
Afterwards, install the certificate manager according to the notes in letsencrypt/README.md
or do your own setup.
More details in Azure/TLS/cert-manager.
Final step is to register that external IP address in your DNS server.
That DNS name is then used in the deployment configuration, under global.dns
.
(I also found notes about “tagging” that IP address, and then add a CNAME record to your DNS provider. That’s more robust in case the LoadBalancer is re-created and the IP address changes.)
Next steps …¶
Ok. At that point we have a cluster, a database, and an NFS file share with two pre-configured subdirectories.
The next steps are
Docker registry setup to let the cluster know about the credentials for the docker registry,
Setup the database password as a secret,
and follow the Deployment notes to configure everything else and to actually install CoCalc itself.
Here is a starting point for your my-values.yaml configuration. Download azurecalc.yaml
and edit it.
1global:
2 dns: &DNS "your.domain.tld" # EDIT THIS
3
4 kubectl: "1.28" # enter it as a string, not a floating point number
5
6 imagePullSecrets:
7 - name: regcred
8
9 database:
10 host: "cocalc1.postgres.database.azure.com"
11 user: "cocalc"
12 database: "cocalc"
13
14 setup_admin:
15 email: "[email protected]"
16 # password: "[secret]" # pass in the real password via $ helm [...] --set global.setup_admin.password=[password], you can change it later
17 name: "Your Name"
18
19 setup_registration_token: "[secret token]" # pass in the real token via $ helm [...] --set global.setup_registration_token=[token]
20
21 ingress:
22 class: "nginx"
23 cert_manager:
24 issuer: "letsencrypt-prod"
25 tls:
26 - hosts:
27 - *DNS
28 secretName: cocalc-tls
29
30 ssh_gateway:
31 enabled: false # Note: on the very first helm deployment, it must be disabled. Then you can enable it.
32
33 # All settings have to match with the keys in the site settings config, see
34 # https://github.com/sagemathinc/cocalc/blob/master/src/packages/util/db-schema/site-defaults.ts
35 settings:
36 site_name: "AzureCalc"
37 site_description: "I live in an Azure Datacenter!"
38 organization_name: "[your organization]"
39 organization_email: &EMAIL "[email protected]"
40 organization_url: ""
41 terms_of_service_url: ""
42 help_email: *EMAIL
43 splash_image: ""
44 logo_square: "[URL to a png or jpeg]"
45 logo_rectangular: "[URL to a png or jpeg]"
46 # This activates sharing files (public or semi-public)
47 share_server: "yes"
48 index_info_html: |
49 ## Welcome to Azure Calc
50
51 This is a test instance of CoCalc running in an Azure Datacenter.
52
53 imprint: |
54 # Imprint
55 policies: |
56 # Policies
57 pii_retention: "3 month"
58 anonymous_signup: "no"
59 email_enabled: "no"
60 #verify_emails: "yes"
61 #email_backend: "smtp"
62 #email_smtp_server: "[EMAIL SERVER]"
63 #email_smtp_from: "[email protected]"
64 #email_smtp_login: "[EMAIL LOGIN NAME]"
65 # set the SMTP password either via the var below or via the admin UI
66 #email_smtp_password: "[secret]"
67 email_smtp_secure: "yes" # usually yes, and with port 465
68 email_smtp_port: "465"
69
70 # CGroup quotas for a project, out of the box
71 # e.g. '{"internet":true,"idle_timeout":3600,"mem":1000,"cpu":1,"cpu_oc":10,"mem_oc":5}'
72 default_quotas: '{"internet":true,"idle_timeout":1800,"mem":2000,"cpu":1,"cpu_oc":20,"mem_oc":10}'
73
74storage:
75 create: false
76
77manage:
78 prepull:
79 enabled: true
80
81 timeout_pending_projects_min: 15
82
83 resources:
84 requests:
85 cpu: 100m
86 memory: 256Mi
87
88 project:
89 dedicatedProjectNodesTaint: "cocalc-projects"
90 dedicatedProjectNodesLabel: "cocalc-role"
91
92 # if projects are on a spot instance, AKS adds its own taint. We have to ignore it.
93 # kubernetes.azure.com/scalesetpriority=spot:NoSchedule
94 extraTaintTolerations:
95 - key: "kubernetes.azure.com/scalesetpriority"
96 value: "spot"
97 effect: "NoSchedule"
98
99static:
100 replicaCount: 2
101
102hub:
103 resources:
104 requests:
105 cpu: 100m
106 memory: 256Mi
107
108 multiple_replicas:
109 websocket: 2
110 proxy: 2
111 next: 2
112 api: 1