Amazon AWS/EKS¶
This guide helps you setting up CoCalc OnPrem on AWS. It will use it’s EKS Kubernetes service to run CoCalc OnPrem.
Note
As of 2022-07-03, there is currently no out-of-the-box support for EKS. The following are notes based on the experience of setting everything up. This certainly assumes you have experience with AWS and Kubernetes. Some details could be out of date, but the general idea should still be valid.
There is also a guide for setting up CoCalc OnPrem on Google GCP/GKE.
This also assumes you looked into the general documentation for the CoCalc OnPrem HELM deployment In particular, how to setup your own values.yaml file to overwrite configuration values, know how to setup a secret storing the PostgreSQL database password, etc.
For more details look into Setup.
Note
All settings are mainly recommendations – feel free to look into the them in more detail, adjust them to your needs, etc. If something is actually required, it is explicitly mentioned. Often, you can change the settings later on as well.
The specific parameters are meant to get started with a small cluster. You can scale up later by changing the nodes types to be larger and CoCalc’s configuration parameters for the HELM charts.
EKS configuration¶
Setup your EKS cluster and make sure you can communicate with the cluster via your local kubectl client, etc. E.g. run this to get started:
aws eks --region [your region] update-kubeconfig --name [name of cluster]
Node Groups¶
It’s not strictly necessary, but it makes a lot of sense to configure EKS to run two groups of nodes:
Here is a minimal example to get started:
Pool: Service¶
This pool runs the service nodes run hubs, manage, static, etc. Initially, two small nodes should be fine. If you have to scale up the number of services due to accommodating many users, you might need larger nodes or more of them. Read Scaling for more details.
Parameter |
Value |
---|---|
Instance type |
CPU: must be at least 2 cores, x86/64 NOTE: “t3” might be a bad choice, because there is a low limit of IPs per node. Also, https://docs.aws.amazon.com/eks/latest/userguide/security-groups-for-pods.html is not supported for t3 nodes (but not used at all, something to explore later on) |
Disk |
50GiB |
Kubernetes label |
|
Scaling |
2/2/2 (i.e. you have two such nodes running, and down the road this won’t change) |
Pool: Project¶
These project nodes accommodate the projects of your users. They should be configured to have a certain taint and labels, right when they’re created. The main idea is to keep them separate from the service nodes, and to be able to scale them to match demand.
Parameter |
Value |
|||||||||
---|---|---|---|---|---|---|---|---|---|---|
Instance type |
min. CPU: must be at least 4 cores, x86/64 NOTE: talk to your users how much CPU and Memory they need and select something suitable. You’ll probably need more memory than CPU, since CPU is elastic and memory is not. Also, don’t mix different sized nodes in the same project pool, to avoid uneven resource usage. |
|||||||||
Disk |
100GiB (the project image is large, and we might have to store more than one at the same time!) |
|||||||||
Scaling |
1/2/1 or whatever you need |
|||||||||
Kubernetes label |
|
|||||||||
Kubernetes Taints |
To make the prepull service work, set these taints:
NOTE: these taints are only necessary if you run the prepull service. Otherwise, remove those taints – the project will not run on these nodes then! |
Note
Regarding the prepull service: it is not strictly necessary.
You have to activate it by setting it to true
in your my-values.yaml config file.
The taints above signal the prepull service, that the node was not
yet initialized (the daemon set will start pods on such nodes) and
once the prepull pod is done it changes the taints to allow
regular projects to run on these nodes and also removes
itself from that node. If you need to audit what prepull does
(might be wise, since it needs cluster wide permissions to change
the node taints), please check the included prepull.py
script.
Storage/EFS¶
The projects and some services need access to a storage volume,
which allows ReadWriteMany.
Commonly, this could be done via an NFS Server,
but with AWS there is EKS – much better at a comparable price!
To get EKS running in your EKS cluster,
follow the instructions.
In particular, I had to install eksctl
, install an “OIDC” provider, then create a service account, etc.
Next step was to install the EFS driver via HELM, and actually create an EFS filesystem, give it access to all subnets (in my case there were 3), create a mount target, etc.
Now the important part: this EFS filesystem’s “access point” is only for the root user, by default.
To make this work with CoCalc’s services,
it must be for user/group with ID 2001:2001.
To accomplish this, create a new StorageClass
(you can choose the basePath
as you wish, keeps this instance of CoCalc separate from other instances or other data you have on EFS):
Create a file
sc-2001.yaml
with the following content:kind: StorageClass apiVersion: storage.k8s.io/v1 metadata: name: "efs-2001" # customizable provisioner: efs.csi.aws.com parameters: provisioningMode: efs-ap fileSystemId: fs-[INSERT ID] # from the EFS console directoryPerms: "700" uid: "2001" gid: "2001" basePath: "/cocalc1" # customizable, should be an empty directory
Apply:
kubectl apply -f sc-2001.yaml
.Check:
kubectl get sc
should listefs-2001
.Edit your my-values.yaml file: in the section for storage, enter this to reference the new
StorageClass
:
storage:
class: "efs-2001" # references the name from above
size:
software: 10Gi
data: 10Gi
which in turn will create two PersistentVolume
+ Claims
as required. Size doesn’t matter, it’s unlimited.
Additional hints:
You can change the
Reclaim Policy
toRetain
, such that files aren’t accidentally deleted if these PVs are removed. See https://kubernetes.io/docs/tasks/administer-cluster/change-pv-reclaim-policy/Set the life-cycle management of EFS to move unused files to long term (cheaper) storage and back if they’re accessed again. e.g.:
Transition into IA: 60 days since last access
Transition out of IA: On first access
Note: completely independent of the above, you can use other storage
solutions as well. For that, you have to create PVCs yourself, which
will must expose a ReadWriteMany filesystem.
In the CoCalc deployment, you have to configure the names
of these PVCs under global: {storage: {...}}
and disable creating
them automatically storage: {create: false}
.
See ../cocalc/values.yaml
for more information.
Database / RDS PostgreSQL¶
You could either run your own PostgreSQL server, or use the one from AWS: RDS PostgreSQL. Version 13 should be OK, you can also go ahead and use version 14.
Basically, the EKS cluster must be able to access the database
(networking setup, security groups) and the database password will be
stored in a Kubernetes secret. (see cocalc/values.yaml
→
global.database.secretName
)
Refer to the general instructions for the database how to do this,
i.e. kubectl create secret generic PostgreSQL-password --from-literal=postgresql-password=$PASSWORD
should do the trick.
Docs that might help:
AWS Security Groups¶
At this point, your service consists of a database, the EKS cluster (with its nodes and own VPC network), and the EFS filesystem. However, by default AWS isolates everything from each other. You have to make sure that there is a suitable setup of Security Groups that allows the EKS nodes to access the database and the EFS filesystem. This guide doesn’t contain a full description of how to do this, and this certainly depends on your overall usage of AWS. The common symptom is that Pods in EKS can’t access the database or the EFS filesystem, hence you see timeout errors trying to connect, etc. EFS manifests in pods not being able to initialize, can’t attach the volumes, etc., while the database manifests in the logs of “hub-websocket” pods. (it is responsible for setting up all tables/schemas in the database, hence this is the one to check first)
Notes:
EKS vs. EFS security groups: this is from a workshop, maybe it helps
Ingress/Networking¶
In the CoCalc HELM deployment, there are two ingress.yaml
configurations, which are designed for K8S’s nginx ingress controller.
The directory ingress-nginx/
has more details.
But just deploying it is not enough: the nginx ingress controller needs to be able to install a LoadBalancer
.
That’s done via an AWS Load Balancer Controller.
Once everything is running, you can check up on the Load Balancer via the AWS console: EC2 (new experience) → Load Balancing → Load balancer.
There, in the Basic Configuration, you see the DNS name - that’s the
same you get via kubectl get -A svc
Once you have that (lengthy) automatically generated DNS name, copy it
and setup your own sub-domain in your DNS provider. Basically add a
CNAME
entry to point to this DNS name.
What’s unclear to me, this did create a “classic” (deprecated) load balancer. Why not a more modern L4 network load balancer? Must be caused by whatever the load balancer controller is supposed to do.
Ref: