Troubleshooting

Introduction to Troubleshooting

Troubleshooting the Element Installer comes down to knowing a little bit about kubernetes and how to check the status of the various resources. This guide will walk you through some of the initial steps that you'll want to take when things are going wrong.

install.sh problems

Sometimes there will be problems when running the ansible-playbook portion of the installer. When this happens, you can increase the verbosity of ansible logging by editing .ansible.rc in the installer directory and setting:

export ANSIBLE_DEBUG=true
export ANSIBLE_VERBOSITY=4

and re-running the installer. This will generate quite verbose output, but that typically will help pinpoint what the actual problem with the installer is.

Problems post-installation

Checking Pod Status and Getting Logs

In general, a well-functioning Element stack has at it's minimum the following containers (or pods in kubernetes language) running:

[user@element2 ~]$ kubectl get pods -n element-onprem
NAME                                        READY   STATUS    RESTARTS      AGE
instance-synapse-main-0                     1/1     Running   4 (27h ago)   6d21h
postgres-0                                  1/1     Running   2 (27h ago)   6d21h
app-element-web-688489b777-v7l2m            1/1     Running   6 (27h ago)   6d22h
server-well-known-55bdb6b66-m8px6           1/1     Running   2 (27h ago)   6d21h
instance-synapse-haproxy-554bd57975-z2ppv   1/1     Running   3 (27h ago)   6d21h

The above kubectl get pods -n element-onprem is the first place to start. You'll notice in the above, all of the pods are in the Running status and this indicates that all should be well. If the state is anything other than "Running" or "Creating", then you'll want to grab logs for those pods. To grab the logs for a pod, run:

kubectl logs -n element-onprem <pod name>

replacing <pod name> with the actual pod name. If we wanted to get the logs from synapse, the specific syntax would be:

kubectl logs -n element-onprem instance-synapse-main-0

and this would generate logs similar to:

 2022-05-03 17:46:33,333 - synapse.util.caches.lrucache - 154 - INFO - LruCache._expire_old_entries-2887 - Dropped 0 items from caches
2022-05-03 17:46:33,375 - synapse.storage.databases.main.metrics - 471 - INFO - generate_user_daily_visits-289 - Calling _generate_user_daily_visits
2022-05-03 17:46:58,424 - synapse.metrics._gc - 118 - INFO - sentinel - Collecting gc 1
2022-05-03 17:47:03,334 - synapse.util.caches.lrucache - 154 - INFO - LruCache._expire_old_entries-2888 - Dropped 0 items from caches
2022-05-03 17:47:33,333 - synapse.util.caches.lrucache - 154 - INFO - LruCache._expire_old_entries-2889 - Dropped 0 items from caches
2022-05-03 17:48:03,333 - synapse.util.caches.lrucache - 154 - INFO - LruCache._expire_old_entries-2890 - Dropped 0 items from caches

Again, for every pod not in the Running or Creating status, you'll want to use the above procedure to get the logs for Element to look at.

If you don't have any pods in the element-onprem namespace as indicated by running the above command, then you should run:

[user@element2 ~]$ kubectl get pods -A
NAMESPACE            NAME                                         READY   STATUS    RESTARTS       AGE
container-registry   registry-5f697bb7df-dbzpq                    1/1     Running   6 (27h ago)    6d22h
kube-system          dashboard-metrics-scraper-69d9497b54-hdrdq   1/1     Running   6 (27h ago)    6d22h
kube-system          hostpath-provisioner-7764447d7c-jckkc        1/1     Running   11 (17h ago)   6d22h
element-onprem       instance-synapse-main-0                      1/1     Running   4 (27h ago)    6d22h
element-onprem       postgres-0                                   1/1     Running   2 (27h ago)    6d22h
element-onprem       app-element-web-688489b777-v7l2m             1/1     Running   6 (27h ago)    6d22h
element-onprem       server-well-known-55bdb6b66-m8px6            1/1     Running   2 (27h ago)    6d21h
kube-system          calico-kube-controllers-6966456d6b-x4scn     1/1     Running   6 (27h ago)    6d22h
element-onprem       instance-synapse-haproxy-554bd57975-z2ppv    1/1     Running   3 (27h ago)    6d21h
kube-system          calico-node-l28tp                            1/1     Running   6 (27h ago)    6d22h
kube-system          coredns-64c6478b6c-h5jp4                     1/1     Running   6 (27h ago)    6d22h
ingress              nginx-ingress-microk8s-controller-n6wmk      1/1     Running   6 (27h ago)    6d22h
operator-onprem      osdk-controller-manager-5f9d86f765-t2kn9     2/2     Running   9 (17h ago)    6d22h
kube-system          metrics-server-679c5f986d-msfc5              1/1     Running   6 (27h ago)    6d22h
kube-system          kubernetes-dashboard-585bdb5648-vrn42        1/1     Running   10 (17h ago)   6d22h

This is the output from a healthy system, but if you have any of these pods not in the Running or Creating state, then please gather logs using the following syntax:
```
kubectl logs -n <namespace> <pod name>
```

So to gather logs for the kubernetes ingress, you would run:

kubectl logs -n ingress nginx-ingress-microk8s-controller-n6wmk

and you would see logs similar to:

I0502 14:15:08.467258       6 leaderelection.go:248] attempting to acquire leader lease ingress/ingress-controller-leader...
I0502 14:15:08.467587       6 controller.go:155] "Configuration changes detected, backend reload required"
I0502 14:15:08.481539       6 leaderelection.go:258] successfully acquired lease ingress/ingress-controller-leader
I0502 14:15:08.481656       6 status.go:84] "New leader elected" identity="nginx-ingress-microk8s-controller-n6wmk"
I0502 14:15:08.515623       6 controller.go:172] "Backend successfully reloaded"
I0502 14:15:08.515681       6 controller.go:183] "Initial sync, sleeping for 1 second"
I0502 14:15:08.515705       6 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"ingress", Name:"nginx-ingress-microk8s-controller-n6wmk", UID:"548d9478-094e-4a19-ba61-284b60152b85", APIVersion:"v1", ResourceVersion:"524688", FieldPath:""}): type: 'Normal' reason: 'RELOAD' NGINX reload triggered due to a change in configuration

Again, for all pods not in the Running or Creating state, please use the above method to get log data to send to Element.

Other Commands of Interest

Some other commands that may yield some interesting data while troubleshooting are:

Show all persistent volumes and persistent volume claims for the element-onprem namespace:

kubectl get pv -n element-onprem

This will give you output similar to:

NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                               STORAGECLASS        REASON   AGE
pvc-9fc3bc29-2e5d-4b88-a9cd-a4c855352404   20Gi       RWX            Delete           Bound    container-registry/registry-claim   microk8s-hostpath            55d
synapse-media                              50Gi       RWO            Delete           Bound    element-onprem/synapse-media        microk8s-hostpath            7d
postgres                                   5Gi        RWO            Delete           Bound    element-onprem/postgres             microk8s-hostpath            7d

Show the synapse configuration:

kubectl describe cm -n element-onprem instance-synapse-shared

and this will return output similar to:

send_federation: True
start_pushers: True
turn_allow_guests: true
turn_shared_secret: n0t4ctuAllymatr1Xd0TorgSshar3d5ecret4obvIousreAsons
turn_uris:
- turns:turn.matrix.org?transport=udp
- turns:turn.matrix.org?transport=tcp
turn_user_lifetime: 86400000

Show the Element Web configuration:

kubectl describe cm -n element-onprem app-element-web

and this will return output similar to:

config.json:
----
{
    "default_server_config": {
        "m.homeserver": {
            "base_url": "https://synapse2.local",
            "server_name": "local"
        } 
  },
  "dummy_end": "placeholder",
  "integrations_jitsi_widget_url": "https://dimension.element2.local/widgets/jitsi",
  "integrations_rest_url": "https://dimension.element2.local/api/v1/scalar",
  "integrations_ui_url": "https://dimension.element2.local/element",
  "integrations_widgets_urls": [
      "https://dimension.element2.local/widgets"
  ]
}

Show the nginx configuration for Element Web: (If using nginx as your ingress controller in production or using the PoC installer.)

kubectl describe cm -n element-onprem app-element-web-nginx

and this will return output similar to:

  server {
      listen       8080;

      add_header X-Frame-Options SAMEORIGIN;
      add_header X-Content-Type-Options nosniff;
      add_header X-XSS-Protection "1; mode=block";
      add_header Content-Security-Policy "frame-ancestors 'self'";
      add_header X-Robots-Tag "noindex, nofollow, noarchive, noimageindex";

      location / {
          root   /usr/share/nginx/html;
          index  index.html index.htm;

          charset utf-8;
      }
  }

Check list of active kubernetes events:
```
kubectl get events -A
```
You will see a list of events or the message No resources found.

Show the state of services in the element-onprem namespace:

kubectl get services -n element-onprem

This should return output similar to:

NAME                             TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                    AGE
postgres                         ClusterIP   10.152.183.47    <none>        5432/TCP                   6d23h
app-element-web                  ClusterIP   10.152.183.60    <none>        80/TCP                     6d23h
server-well-known                ClusterIP   10.152.183.185   <none>        80/TCP                     6d23h
instance-synapse-main-headless   ClusterIP   None             <none>        80/TCP                     6d23h
instance-synapse-main-0          ClusterIP   10.152.183.105   <none>        80/TCP,9093/TCP,9001/TCP   6d23h
instance-synapse-haproxy         ClusterIP   10.152.183.78    <none>        80/TCP                     6d23h

Show the status of the stateful sets in the element-onprem namespace:

kubectl get sts -n element-onprem

This should return output similar to:

NAME                    READY   AGE
postgres                1/1     6d23h
instance-synapse-main   1/1     6d23h

Show deployments in the element-onprem namespace:

kubectl get deploy -n element-onprem

This will return output similar to:

NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
app-element-web            1/1     1            1           6d23h
server-well-known          1/1     1            1           6d23h
instance-synapse-haproxy   1/1     1            1           6d23h

Show the status of all namespaces:

kubectl get namespaces

which will return output similar to:

NAME                 STATUS   AGE
kube-system          Active   20d
kube-public          Active   20d
kube-node-lease      Active   20d
default              Active   20d
ingress              Active   6d23h
container-registry   Active   6d23h
operator-onprem      Active   6d23h
element-onprem       Active   6d23h

Destroy the micro8ks setup

If you wish to start over, you can reset the microk8s setup by doing:
```
microk8s.reset --destroy-storage
```
WARNING: This will destroy all of your microk8s containers and storage. Use with caution.