Troubleshooting¶
Cannot “Connect”¶
When working with CoCalc’s Single Page Application, there is a “Connectivity indicator” at the top right, looking like a WiFi symbol. It surfaces the status of an underlying WebSocket connection. Feel free to click on it to open a dialog with more details. If there are problems, this indicator will turn red or even say “disconnected”.
There are various reasons why someone is not able to connect. The most common issue could be a firewall blocking the connection, or a proxy (VPN?) that is not configured correctly.
Here is an extensive collection of connectivity issues. This ranges from a flaky local router, a browser (or even computer) that needs to be restarted, up to plugins/extensions interfering with the website itself. In particular, it’s important to figure out, if WebSocket connections do work.
Besides that, it might be a good idea to just refresh the page. The UI should open up again, and all opened projects and files should reappear.
You can also open the browser’s developer console (F12) to see if there are any errors in the JavaScript console.
On the server side, the logs of hub-websocket
(see Hubs) could reveal issues as well.
Project does not “Start”¶
There are various reasons why a project fails to start.
Note
[UUID]
refers to the UUID identifying each project.
Check if there exists a pod named
project-[UUID]
in your namespace. If not, themanage-action
service (see Manage) did not create it.What does its log say?
kubectl logs manage-action-...
Filter by the UUID to see log lines related to the project.
Maybe it’s a good idea to just kill all manage services:
kubectl rollout restart deploy -l group=manage
they’ll restart and then check the logs again. The “start” request for the given project will probably have timed out – hence you have to use the UI to start the project again.
Connect to the database and reset the project(s) state/status entries in the database:
UPDATE projects SET status=null, state=null WHERE project_id='[UUID]';
and then try to start the project again.
If the project’s pod exists and is not in state “Running”, check what
kubectl describe pod project-[UUID]
has to say. At the very bottom is a list of actions/events taken by the kubelet on the given node to setup the pod.
Did the “home” filesystem not mount? Check what command it tried (a mount command) to run, what errors there are, etc. This is the part where a sub-directory of the
data-project
PVC must be mounted on the node and made available to the project’s pod.Is the node, where the project tries to start, healthy? It could very well be that some projects run fine, while there is a problem specific to a node. So, check your node setup, permissions, maybe kill the node and create a new one.
Of course, there could be other issues. Most of them are probably Leaky Abstractions of the underlying node, networking, etc. – so, check the logs of the kubelet, the node in
/var/log/*
,dmesg -T
for the Linux kernel, etc.
If the project pod is in state “Running”:
Check the project’s log. It starts with an initialization sequence (a Bash script
/cocalc/init/init.sh
is run) and then starts the actual project “server” (a Node.js process).If the initialization worked, the hub try to connect to the project server to establish the communication. So, the first important log message is that incoming connection. If it fails, networking issue inside the cluster?!
There could also be a problem of your web browser (web client) to connect to the service. That would look as if it is attempting to connect but does not succeed. Check the browser’s developer console for errors, and also check the logs of
hub-websocket
andhub-proxy
.
Project’s home dir permission issues¶
Due to fsGroup securityContext does not apply to nfs mount, it might happen that the mounted project directories do not have proper access (UID:GID must be 2001:2001).
It’s possible to enable an Init Container for ech project pod, such that the ownerships are fixed upon each start of a project.
See note about fixing permissions for more details.
Project pod “hangs” upon termination¶
The first step is to check what
kubectl describe pod project-[UUID]
has to say.If there is a Datastore sidecar, leftover FUSE mountpoints might be a source of problems. Although there is a termination task to unmount and
fusermount -u
all mountpoints, it might be broken due to a faulty tool or other circumstances. The only solution is to force-delete the project pod, because Kubernetes on its own will not be able to fix that mount issue.
Jupyter Kernel does not start¶
A Python-based kernel creates a history in an sqlite database. This could lead to issues with an underlying NFS file-system. See Custom Jupyter Kernels for how to disable this by adding a config parameter to the kernel’s
argv:[...]
array in thekernel.json
file.In a Terminal, run
jupyter console --kernel=[kernelname]
to see if the kernel starts. In the terminal you might see more logging details by adding--debug
as well.There might also be a fluke with how the project started the kernel and no proper cleanup after it terminated or crashed. Looking in the project’s log might help, but in general it might be a good idea to just restart the project (Project Settings → Project Control) and refresh CoCalc’s frontend page in the web-browser.
Database¶
To connect to the database, you can either connect directly
or you could use the included /database/db-shell.sh
script.
This script deploys a pod in the cluster’s namespace,
and then connects to the database.
You need to know the host IP or name, user, and database.
The password is parsed from the secret.
That way, you also confirm that accessing the database from
within the cluster works.
Interruptions¶
Right after starting the cluster or every time the database was stopped, the hubs try to reconnect. This could fail or take a long time to resolve (a health check should fail, causing the pod to restart).
However, it’s completely fine to just restart all hub and manage related pods.
kubectl rollout restart deploy -l group=manage
kubectl rollout restart deploy -l group=hub
Lockups¶
In rare circumstances, or when services were disrupted, it might happen that the database is locked up. This means there are hanging queries, which mutually want to modify or delete rows or tables. They can’t continue because they wait on each other.
To resolve this in a safe way, you can create a view of all active locks, and then cancel/terminate the connection.
CREATE OR REPLACE VIEW public.active_locks AS
SELECT t.schemaname,
t.relname,
l.locktype,
l.page,
l.virtualtransaction,
l.pid,
l.mode,
l.granted
FROM pg_locks l
JOIN pg_stat_all_tables t ON l.relation = t.relid
WHERE t.schemaname <> 'pg_toast'::name AND t.schemaname <> 'pg_catalog'::name
ORDER BY t.schemaname, t.relname;
Then run:
SELECT * FROM active_locks;
and you see a table like that – note, all PIDs are the same!
schemaname | relname | locktype | page | virtualtransaction | pid | mode | granted
------------+----------------------+----------+------+--------------------+-------+------------------+---------
public | accounts | relation | | 10/7853190 | 37038 | AccessShareLock | t
public | compute_images | relation | | 10/7853190 | 37038 | AccessShareLock | t
public | hub_servers | relation | | 10/7853190 | 37038 | RowExclusiveLock | t
public | hub_servers | relation | | 10/7853190 | 37038 | AccessShareLock | t
public | registration_tokens | relation | | 10/7853190 | 37038 | AccessShareLock | t
public | registration_tokens | relation | | 10/7853190 | 37038 | RowShareLock | t
public | server_settings | relation | | 10/7853190 | 37038 | AccessShareLock | t
public | stats | relation | | 10/7853190 | 37038 | AccessShareLock | t
public | system_notifications | relation | | 10/7853190 | 37038 | AccessShareLock | t
(9 rows)
Then try to cancel – or if necessary, more forcefully terminate – the connection:
SELECT pg_cancel_backend('37038');
SELECT pg_terminate_backend('37038');
Run the view again, and you should see that the locks are gone.
Kubernetes Errors¶
HELM upgrade timeout¶
No problem. The default is 5 minutes. Increase it by adding this timeout parameter to increase it to 15 minutes:
helm upgrade --timeout 15m ...
and run it again. HELM upgrades are idempotent and it’s safe to run them multiple times.
You have to check the status of the K8S workloads, though, but most likely pulling the new images takes longer than the default timeout.
The pod state ContainerCreating
is a good indicator for that, and if you check the details, the log says something about pulling image
.
selfLink was empty¶
If projects fail with selfLink was empty, can't make reference
,
your kubectl version and your cluster’s version are probably not compatible.
Check what kubectl version
has to say.
Evicted due to DiskPressure¶
Since the project images are large, you might encounter evicted project or Prepull pods due to DiskPressure. Kubernetes will clean up the local docker image registry, but as long as there are pods running, it will not be able to free up enough space.
A strategy to fix this:
Edit the
project-image
ConfigMap to the new project image tag:kubectl edit cm project-image
and then set
data.tag
to the newproject-[timestamp]
value.Delete all running projects:
kubectl delete pod --wait=false -l run=project
The idea is that the next project that starts will pull the new image, and the old ones can be garbage collected.
Long term, you need more disk space on your nodes.