.. index:: Amazon AWS/EKS .. _aws-eks: Amazon AWS/EKS ============================= This guide helps you setting up CoCalc OnPrem on :term:`AWS`. It will use it's `EKS`_ Kubernetes service to run CoCalc OnPrem. .. note:: As of 2022-07-03, there is currently no out-of-the-box support for `EKS`_. The following are notes based on the experience of setting everything up. This certainly assumes you have experience with :term:`AWS` and :term:`Kubernetes`. Some details could be out of date, but the general idea should still be valid. There is also a guide for setting up CoCalc OnPrem on :doc:`gke`. This also assumes you looked into the general documentation for the CoCalc OnPrem HELM deployment In particular, how to setup your :ref:`own values.yaml ` file to overwrite configuration values, know how to setup a secret storing the PostgreSQL database password, etc. For more details look into :doc:`../setup`. .. include:: ../_shared/settings-recommendations.rst EKS configuration ----------------- Setup your EKS cluster and make sure you can communicate with the cluster via your local kubectl client, etc. E.g. run this to get started:: aws eks --region [your region] update-kubeconfig --name [name of cluster] Node Groups ----------------- It's not strictly necessary, but it makes a lot of sense to configure EKS to run :ref:`two groups of nodes `: Here is a minimal example to get started: Pool: **Service** ~~~~~~~~~~~~~~~~~~ This pool runs the :ref:`service nodes ` run hubs, manage, static, etc. Initially, two small nodes should be fine. If you have to scale up the number of services due to accommodating many users, you might need larger nodes or more of them. Read :doc:`../ops/scaling` for more details. +-----------------------+------------------------------------------------------------------------------------------+ | Parameter | Value | +=======================+==========================================================================================+ | Instance type | ``t3.medium`` (or ``t3a.medium``), spot price (?) | | | | | | CPU: must be at least 2 cores, x86/64 | | | | | | NOTE: "t3" might be a bad choice, because there is a low limit of IPs per node. | | | Also, https://docs.aws.amazon.com/eks/latest/userguide/security-groups-for-pods.html | | | is not supported for t3 nodes (but not used at all, something to explore later on) | +-----------------------+------------------------------------------------------------------------------------------+ | Disk | 50GiB | +-----------------------+------------------------------------------------------------------------------------------+ | Kubernetes label | ``cocalc-role=services`` (that's ``key=value``) | +-----------------------+------------------------------------------------------------------------------------------+ | Scaling | 2/2/2 (i.e. you have two such nodes running, and down the road this won't change) | +-----------------------+------------------------------------------------------------------------------------------+ Pool: **Project** ~~~~~~~~~~~~~~~~~~ These project nodes accommodate the :ref:`projects of your users `. They should be configured to have a certain taint and labels, right when they're created. The main idea is to keep them separate from the service nodes, and to be able to scale them to match demand. +-----------------------+------------------------------------------------------------------------------------------+ | Parameter | Value | +=======================+==========================================================================================+ | Instance type | min. ``t3.medium``, spot price | | | | | | CPU: must be at least 4 cores, x86/64 | | | | | | NOTE: talk to your users how much CPU and Memory they need and select something suitable.| | | You'll probably need more memory than CPU, since CPU is elastic and memory is not. | | | Also, don't mix different sized nodes in the same project pool, to avoid uneven | | | resource usage. | +-----------------------+------------------------------------------------------------------------------------------+ | Disk | 100GiB (the project image is large, | | | and we might have to store more than one at the same time!) | +-----------------------+------------------------------------------------------------------------------------------+ | Scaling | 1/2/1 or whatever you need | +-----------------------+------------------------------------------------------------------------------------------+ | Kubernetes label | ``cocalc-role=projects`` (that's ``key=value``) | +-----------------------+------------------------------------------------------------------------------------------+ | Kubernetes Taints | To make the prepull service work, set these taints: | | | | | | .. list-table:: | | | | | | * - **Key** | | | - **Value** | | | - **Effect** | | | * - cocalc-projects-init | | | - false | | | - NoExecute | | | * - cocalc-projects | | | - init | | | - NoSchedule | | | | | | | | | NOTE: these taints are only necessary if you run the :ref:`prepull service `. | | | Otherwise, remove those taints – the project will not run on these nodes then! | +-----------------------+------------------------------------------------------------------------------------------+ .. note:: Regarding the :ref:`prepull service `: it is not strictly necessary. You have to activate it by setting it to ``true`` in your :ref:`my-values.yaml ` config file. The taints above signal the :ref:`prepull service `, that the node was not yet initialized (the daemon set will start pods on such nodes) and once the prepull pod is done it changes the taints to allow regular projects to run on these nodes and also removes itself from that node. If you need to audit what prepull does (might be wise, since it needs cluster wide permissions to change the node taints), please check the included ``prepull.py`` script. .. _eks-efs: Storage/EFS ----------- The projects and some services need access to a storage volume, which allows :term:`ReadWriteMany`. Commonly, this could be done via an :term:`NFS` Server, but with AWS there is EKS – much better at a `comparable price `_! To get EKS running in your EKS cluster, `follow the instructions `_. In particular, I had to install ``eksctl``, install an "OIDC" provider, then create a service account, etc. Next step was to install the EFS driver via HELM, and actually create an EFS filesystem, give it access to all subnets (in my case there were 3), create a mount target, etc. **Now the important part**: this EFS filesystem's "access point" is only for the root user, by default. To make this work with :ref:`CoCalc's services `, **it must be for user/group with ID 2001:2001**. To accomplish this, create a new ``StorageClass`` (you can choose the ``basePath`` as you wish, keeps this instance of CoCalc separate from other instances or other data you have on EFS): 1. Create a file ``sc-2001.yaml`` with the following content:: kind: StorageClass apiVersion: storage.k8s.io/v1 metadata: name: "efs-2001" # customizable provisioner: efs.csi.aws.com parameters: provisioningMode: efs-ap fileSystemId: fs-[INSERT ID] # from the EFS console directoryPerms: "700" uid: "2001" gid: "2001" basePath: "/cocalc1" # customizable, should be an empty directory 2. Apply: ``kubectl apply -f sc-2001.yaml``. 3. Check: ``kubectl get sc`` should list ``efs-2001``. 4. Edit your :ref:`my-values.yaml ` file: in the section for storage, enter this to reference the new ``StorageClass``: :: storage: class: "efs-2001" # references the name from above size: software: 10Gi data: 10Gi which in turn will create two ``PersistentVolume`` + ``Claims`` as required. Size doesn't matter, it's unlimited. Additional hints: 1. You can change the ``Reclaim Policy`` to ``Retain``, such that files aren't accidentally deleted if these PVs are removed. See https://kubernetes.io/docs/tasks/administer-cluster/change-pv-reclaim-policy/ 2. Set the life-cycle management of EFS to move unused files to long term (cheaper) storage and back if they're accessed again. e.g.: - Transition into IA: 60 days since last access - Transition out of IA: On first access .. include:: ../_shared/custom-pvc.rst Database / RDS PostgreSQL ------------------------- You could either run your own PostgreSQL server, or use the one from AWS: `RDS PostgreSQL `__. Version 13 should be OK, you can also go ahead and use version 14. Basically, the EKS cluster must be able to access the database (networking setup, security groups) and the database password will be stored in a Kubernetes secret. (see ``cocalc/values.yaml`` → ``global.database.secretName``) Refer to the general instructions for the database how to do this, i.e. ``kubectl create secret generic PostgreSQL-password --from-literal=postgresql-password=$PASSWORD`` should do the trick. Docs that might help: - https://dev.to/bensooraj/accessing-amazon-rds-from-aws-eks-2pc3 AWS Security Groups ------------------- At this point, your service consists of a database, the EKS cluster (with its nodes and own VPC network), and the EFS filesystem. However, by default AWS isolates everything from each other. You have to make sure that there is a suitable setup of `Security Groups `__ that allows the EKS nodes to access the database and the EFS filesystem. This guide doesn't contain a full description of how to do this, and this certainly depends on your overall usage of AWS. The common symptom is that Pods in EKS can't access the database or the EFS filesystem, hence you see timeout errors trying to connect, etc. EFS manifests in pods not being able to initialize, can't attach the volumes, etc., while the database manifests in the logs of "hub-websocket" pods. (it is responsible for setting up all tables/schemas in the database, hence this is the one to check first) Notes: - `EKS vs. EFS security groups `__: this is from a workshop, maybe it helps Ingress/Networking ------------------ In the CoCalc HELM deployment, there are two ``ingress.yaml`` configurations, which are designed for `K8S's nginx ingress controller `__. The directory ``ingress-nginx/`` has more details. But just deploying it is not enough: the nginx ingress controller needs to be able to install a ``LoadBalancer``. That's done via an `AWS Load Balancer Controller `_. Once everything is running, you can check up on the Load Balancer via the AWS console: EC2 (new experience) → Load Balancing → Load balancer. There, in the Basic Configuration, you see the DNS name - that's the same you get via ``kubectl get -A svc`` Once you have that (lengthy) automatically generated DNS name, copy it and setup your own sub-domain in your DNS provider. Basically add a ``CNAME`` entry to point to this DNS name. What's unclear to me, this did create a "classic" (deprecated) load balancer. Why not a more modern L4 network load balancer? Must be caused by whatever the load balancer controller is supposed to do. Ref: - https://kubernetes.github.io/ingress-nginx/deploy/#aws - https://aws.amazon.com/blogs/opensource/network-load-balancer-nginx-ingress-controller-eks/ .. _EKS: https://aws.amazon.com/eks/ .. _EFS: https://aws.amazon.com/efs/ .. _RDS PostgreSQL: https://aws.amazon.com/rds/postgresql/ .. _Security Groups: https://docs.aws.amazon.com/vpc/latest/userguide/VPC_SecurityGroups.html