Amazon AWS/EKS

This guide helps you setting up CoCalc OnPrem on AWS. It will use it’s EKS Kubernetes service to run CoCalc OnPrem.

Note

As of 2022-07-03, there is currently no out-of-the-box support for EKS. The following are notes based on the experience of setting everything up. This certainly assumes you have experience with AWS and Kubernetes. Some details could be out of date, but the general idea should still be valid.

There is also a guide for setting up CoCalc OnPrem on Google GCP/GKE.

This also assumes you looked into the general documentation for the CoCalc OnPrem HELM deployment In particular, how to setup your own values.yaml file to overwrite configuration values, know how to setup a secret storing the PostgreSQL database password, etc.

For more details look into Setup.

Note

All settings are mainly recommendations – feel free to look into the them in more detail, adjust them to your needs, etc. If something is actually required, it is explicitly mentioned. Often, you can change the settings later on as well.

The specific parameters are meant to get started with a small cluster. You can scale up later by changing the nodes types to be larger and CoCalc’s configuration parameters for the HELM charts.

EKS configuration

Setup your EKS cluster and make sure you can communicate with the cluster via your local kubectl client, etc. E.g. run this to get started:

aws eks --region [your region] update-kubeconfig --name [name of cluster]

Node Groups

It’s not strictly necessary, but it makes a lot of sense to configure EKS to run two groups of nodes:

Here is a minimal example to get started:

Pool: Service

This pool runs the service nodes run hubs, manage, static, etc. Initially, two small nodes should be fine. If you have to scale up the number of services due to accommodating many users, you might need larger nodes or more of them. Read Scaling for more details.

Parameter

Value

Instance type

t3.medium (or t3a.medium), spot price (?)

CPU: must be at least 2 cores, x86/64

NOTE: “t3” might be a bad choice, because there is a low limit of IPs per node. Also, https://docs.aws.amazon.com/eks/latest/userguide/security-groups-for-pods.html is not supported for t3 nodes (but not used at all, something to explore later on)

Disk

50GiB

Kubernetes label

cocalc-role=services (that’s key=value)

Scaling

2/2/2 (i.e. you have two such nodes running, and down the road this won’t change)

Pool: Project

These project nodes accommodate the projects of your users. They should be configured to have a certain taint and labels, right when they’re created. The main idea is to keep them separate from the service nodes, and to be able to scale them to match demand.

Parameter

Value

Instance type

min. t3.medium, spot price

CPU: must be at least 4 cores, x86/64

NOTE: talk to your users how much CPU and Memory they need and select something suitable. You’ll probably need more memory than CPU, since CPU is elastic and memory is not. Also, don’t mix different sized nodes in the same project pool, to avoid uneven resource usage.

Disk

100GiB (the project image is large, and we might have to store more than one at the same time!)

Scaling

1/2/1 or whatever you need

Kubernetes label

cocalc-role=projects (that’s key=value)

Kubernetes Taints

To make the prepull service work, set these taints:

Key

Value

Effect

cocalc-projects-init

false

NoExecute

cocalc-projects

init

NoSchedule

NOTE: these taints are only necessary if you run the prepull service. Otherwise, remove those taints – the project will not run on these nodes then!

Note

Regarding the prepull service: it is not strictly necessary. You have to activate it by setting it to true in your my-values.yaml config file.

The taints above signal the prepull service, that the node was not yet initialized (the daemon set will start pods on such nodes) and once the prepull pod is done it changes the taints to allow regular projects to run on these nodes and also removes itself from that node. If you need to audit what prepull does (might be wise, since it needs cluster wide permissions to change the node taints), please check the included prepull.py script.

Storage/EFS

The projects and some services need access to a storage volume, which allows ReadWriteMany. Commonly, this could be done via an NFS Server, but with AWS there is EKS – much better at a comparable price! To get EKS running in your EKS cluster, follow the instructions. In particular, I had to install eksctl, install an “OIDC” provider, then create a service account, etc.

Next step was to install the EFS driver via HELM, and actually create an EFS filesystem, give it access to all subnets (in my case there were 3), create a mount target, etc.

Now the important part: this EFS filesystem’s “access point” is only for the root user, by default. To make this work with CoCalc’s services, it must be for user/group with ID 2001:2001. To accomplish this, create a new StorageClass (you can choose the basePath as you wish, keeps this instance of CoCalc separate from other instances or other data you have on EFS):

  1. Create a file sc-2001.yaml with the following content:

    kind: StorageClass
    apiVersion: storage.k8s.io/v1
    metadata:
      name: "efs-2001"                    # customizable
    provisioner: efs.csi.aws.com
    parameters:
      provisioningMode: efs-ap
      fileSystemId: fs-[INSERT ID]        # from the EFS console
      directoryPerms: "700"
      uid: "2001"
      gid: "2001"
      basePath: "/cocalc1"                # customizable, should be an empty directory
    
  2. Apply: kubectl apply -f sc-2001.yaml.

  3. Check: kubectl get sc should list efs-2001.

  4. Edit your my-values.yaml file: in the section for storage, enter this to reference the new StorageClass:

storage:
  class: "efs-2001"                    # references the name from above
  size:
    software: 10Gi
    data: 10Gi

which in turn will create two PersistentVolume + Claims as required. Size doesn’t matter, it’s unlimited.

Additional hints:

  1. You can change the Reclaim Policy to Retain, such that files aren’t accidentally deleted if these PVs are removed. See https://kubernetes.io/docs/tasks/administer-cluster/change-pv-reclaim-policy/

  2. Set the life-cycle management of EFS to move unused files to long term (cheaper) storage and back if they’re accessed again. e.g.:

    • Transition into IA: 60 days since last access

    • Transition out of IA: On first access

Note: completely independent of the above, you can use other storage solutions as well. For that, you have to create PVCs yourself, which will must expose a ReadWriteMany filesystem. In the CoCalc deployment, you have to configure the names of these PVCs under global: {storage: {...}} and disable creating them automatically storage: {create: false}. See ../cocalc/values.yaml for more information.

Database / RDS PostgreSQL

You could either run your own PostgreSQL server, or use the one from AWS: RDS PostgreSQL. Version 13 should be OK, you can also go ahead and use version 14.

Basically, the EKS cluster must be able to access the database (networking setup, security groups) and the database password will be stored in a Kubernetes secret. (see cocalc/values.yamlglobal.database.secretName)

Refer to the general instructions for the database how to do this, i.e. kubectl create secret generic PostgreSQL-password --from-literal=postgresql-password=$PASSWORD should do the trick.

Docs that might help:

AWS Security Groups

At this point, your service consists of a database, the EKS cluster (with its nodes and own VPC network), and the EFS filesystem. However, by default AWS isolates everything from each other. You have to make sure that there is a suitable setup of Security Groups that allows the EKS nodes to access the database and the EFS filesystem. This guide doesn’t contain a full description of how to do this, and this certainly depends on your overall usage of AWS. The common symptom is that Pods in EKS can’t access the database or the EFS filesystem, hence you see timeout errors trying to connect, etc. EFS manifests in pods not being able to initialize, can’t attach the volumes, etc., while the database manifests in the logs of “hub-websocket” pods. (it is responsible for setting up all tables/schemas in the database, hence this is the one to check first)

Notes:

Ingress/Networking

In the CoCalc HELM deployment, there are two ingress.yaml configurations, which are designed for K8S’s nginx ingress controller. The directory ingress-nginx/ has more details.

But just deploying it is not enough: the nginx ingress controller needs to be able to install a LoadBalancer. That’s done via an AWS Load Balancer Controller.

Once everything is running, you can check up on the Load Balancer via the AWS console: EC2 (new experience) → Load Balancing → Load balancer.

There, in the Basic Configuration, you see the DNS name - that’s the same you get via kubectl get -A svc

Once you have that (lengthy) automatically generated DNS name, copy it and setup your own sub-domain in your DNS provider. Basically add a CNAME entry to point to this DNS name.

What’s unclear to me, this did create a “classic” (deprecated) load balancer. Why not a more modern L4 network load balancer? Must be caused by whatever the load balancer controller is supposed to do.

Ref: