EMS Knowledge Base

The knowledge base for all Element Matrix Services provided products.

I can't upload files after updating to 0.6.1

Issue

Environment

Resolution

To resolve this issue, recursively change the permissions of the directory configured in parameters.yml as media_host_data_path. For this example, in paramters.yml, we have:

media_host_data_path: "/mnt/data"

and a quick ls on this path shows the 991 ownership:

$ ls -l /mnt/
total 4
drwxr-xr-x 3 991 991 4096 Apr 27 13:20 data

To fix this, run:

sudo chown 10991:991 -R /mnt/data

afterwards, ls should show the 10991 ownership:

$ ls -l /mnt/
total 4
drwxr-xr-x 3 10991 991 4096 Apr 27 13:20 data

and now you should be able to upload files again.

Root Cause

In this case, the installation started with 0.5.3 and in 0.6.0, we changed the UIDs that synapse runs as in order to avoid conflicting with any potential system UID. Previously, the UID was 991, but we moved to 10991. As such, this breaks permissions on the existing synapse_media directory.

You may see an error similar to this one in your synapse logs, which can be obtained by running kubectl logs -n element-onprem instance-synapse-main-0:

2022-04-27 13:28:02,521 - synapse.http.server - 100 - ERROR - POST-59388 - Failed handle request via 'UploadResource': <XForwardedForRequest at 0x7f9aa49f9e20 method='POST' uri='/_matrix/media/r0/upload' clientproto='HTTP/1.1' site='8008'>
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/synapse/http/server.py", line 269, in _async_render_wrapper
    callback_return = await self._async_render(request)
  File "/usr/local/lib/python3.9/site-packages/synapse/http/server.py", line 297, in _async_render
    callback_return = await raw_callback_return
  File "/usr/local/lib/python3.9/site-packages/synapse/rest/media/v1/upload_resource.py", line 96, in _async_render_POST
    content_uri = await self.media_repo.create_content(
  File "/usr/local/lib/python3.9/site-packages/synapse/rest/media/v1/media_repository.py", line 178, in create_content
    fname = await self.media_storage.store_file(content, file_info)
  File "/usr/local/lib/python3.9/site-packages/synapse/rest/media/v1/media_storage.py", line 92, in store_file
    with self.store_into_file(file_info) as (f, fname, finish_cb):
  File "/usr/local/lib/python3.9/contextlib.py", line 119, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.9/site-packages/synapse/rest/media/v1/media_storage.py", line 135, in store_into_file
    os.makedirs(dirname, exist_ok=True)
  File "/usr/local/lib/python3.9/os.py", line 215, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/usr/local/lib/python3.9/os.py", line 225, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/media/media_store/local_content/PQ'

synapse-haproxy container in CrashLoopBackOff state

Issue

We are seeing

[karl1@element ~]$ kubectl get pods -n element-onprem
NAME                                        READY   STATUS             RESTARTS   AGE
server-well-known-8c6bd8447-fts78           1/1     Running            2          39h
app-element-web-c5bd87777-745gh             1/1     Running            2          39h
postgres-0                                  1/1     Running            2          39h
instance-synapse-haproxy-5b4b55fc9c-jv7pp   0/1     CrashLoopBackOff   40         39h
instance-synapse-main-0                     1/1     Running            6          39h

and the synapse-haproxy container never leaves the CrashLoopBackOff state.

Environment

Resolution

Add the following lines to /etc/security/limits.conf:

*              soft    nofile  4096
*              hard    nofile  10240

and reboot the box. After a reboot, the microk8s environment will come back up and the synapse-haproxy container should run without error.

Root Cause

Check the logs of synapse-haproxy with this command:

kubectl logs -n element-onprem instance-synapse-haproxy-5b4b55fc9c-jv7pp

You will want to replace the instance name with your specific instance. See if you have this message:

'[haproxy.main()] Cannot raise FD limit to 80034, limit 65536.'

If so, you have run out of open file descriptors and as such the container cannot start.

How do I register a new admin user after install?

Issue

I've just installed Element Enterprise On-Premise and would like to set up an admin user.

Environment

Resolution

Run the following command, substituting <registration_shared_secret> with what you specified in secrets.yml:

kubectl exec -n element-onprem -it pods/instance-synapse-main-0 -- register_new_matrix_user -u <username>  -p <password> -a -c /config/homeserver.yaml -k <registration_shared_secret>

Alternatively, you can browse to element web and register your first user there.

Root Cause

Element Enterprise does not automatically create any users for you and so you may register a user using the command in the resolution.

After upgrading to 1.0.0, postgres-0 is in CrashLoopBackOff state

Issue

Environment

Resolution

To fix this issue, first read the root cause and issue sections and double check that this is your issue. The resolution is to delete the sts, pvc, and pv for postgres, the empty data directory and then re-run the installer. These steps WILL destroy any existing Postgresql data, which in the ephemeral case (that this issue decsribes) is none.

To find where the data directory is, run:

kubectl describe pv postgres | grep -i path

This will show output similar to:

StorageClass:      microk8s-hostpath
    Type:          HostPath (bare host directory volume)
    Path:          /mnt/data/synapse-postgres
    HostPathType:  

From here, we can see that /mnt/data/synapse-postgres is where postgres is trying to initiate the database. Let's take a look at that directory:

[user@element2 element-enterprise-installer-1.0.0]$ sudo ls -l /mnt/data/synapse-postgres/
total 0
drwx------. 2 systemd-coredump input 6 Apr 26 15:13 data
[user@element2 element-enterprise-installer-1.0.0]$ sudo ls -l /mnt/data/synapse-postgres/data
total 0

As you can see, we have the data directory and it is empty. Make a note of this directory for later.

Now we need to remove the pvc and the pv. If you really do have just an empty data directory, there is no need to make a backup. If you have more than data in your postgres pv path, you will want to STOP AND MAKE A BACKUP OF THAT PATH'S CONTENTS.

Now, to delete the PVC, you will need two terminals. In one terminal, you will run:

kubectl delete pvc -n element-onprem postgres

You will notice that this command just sits there waiting once run. In another terminal, run this command:

kubectl delete pod -n element-onprem postgres-0

As soon as the pod is deleted, you should notice that the kubectl delete pvc command also completes. At this point, we need to now delete the pv:

kubectl delete pv -n element-onprem postgres

Now it is time to remove the sts for postgres:

kubectl delete sts -n element-onprem postgres

Remove the data directory:

sudo rm -r /mnt/data/synapse-postgres/data

Now re-run the installer. Once the installer is re-run, you should have a working postgresql. You should notice a running pod in kubectl get pods -n element-onprem:

postgres-0                                  1/1     Running   0              2m11s

and your /mnt/data/synapse-postgres directory should have entries similar to:

drwx------. 6 systemd-coredump input    54 May  6 10:14 base
drwx------. 2 systemd-coredump input  4096 May  6 10:15 global
drwx------. 2 systemd-coredump input     6 May  6 10:14 pg_commit_ts
drwx------. 2 systemd-coredump input     6 May  6 10:14 pg_dynshmem
-rw-------. 1 systemd-coredump input  4782 May  6 10:14 pg_hba.conf
-rw-------. 1 systemd-coredump input  1636 May  6 10:14 pg_ident.conf
drwx------. 4 systemd-coredump input    68 May  6 10:14 pg_logical
drwx------. 4 systemd-coredump input    36 May  6 10:14 pg_multixact
drwx------. 2 systemd-coredump input     6 May  6 10:14 pg_notify
drwx------. 2 systemd-coredump input     6 May  6 10:14 pg_replslot
drwx------. 2 systemd-coredump input     6 May  6 10:14 pg_serial
drwx------. 2 systemd-coredump input     6 May  6 10:14 pg_snapshots
drwx------. 2 systemd-coredump input     6 May  6 10:14 pg_stat
drwx------. 2 systemd-coredump input    63 May  6 10:15 pg_stat_tmp
drwx------. 2 systemd-coredump input    18 May  6 10:14 pg_subtrans
drwx------. 2 systemd-coredump input     6 May  6 10:14 pg_tblspc
drwx------. 2 systemd-coredump input     6 May  6 10:14 pg_twophase
-rw-------. 1 systemd-coredump input     3 May  6 10:14 PG_VERSION
drwx------. 3 systemd-coredump input    60 May  6 10:14 pg_wal
drwx------. 2 systemd-coredump input    18 May  6 10:14 pg_xact
-rw-------. 1 systemd-coredump input    88 May  6 10:14 postgresql.auto.conf
-rw-------. 1 systemd-coredump input 28156 May  6 10:14 postgresql.conf
-rw-------. 1 systemd-coredump input    36 May  6 10:14 postmaster.opts
-rw-------. 1 systemd-coredump input    94 May  6 10:14 postmaster.pid

Finally, restart the synapse pod by doing:

kubectl delete pod -n element-onprem instance-synapse-main-0

Wait for that pod to restart and be completely running again. Verify with kubectl get pods -n element-onprem that you have a line similar to:

instance-synapse-main-0                     1/1     Running   0              2m36s

Root Cause

In 0.6.1, we had a bug which caused the included postgresql database to not get written to disk and thus it did not survive restarts. The bug has been fixed in 1.0.0, however, prior versions of the installer did get as far as writing a data directory into the postgresql storage set up by microk8s. As such, postgres finds this directory on start up and fails to init a new database with the specific log mentioned in the Issue section.

If you do not have this specific error, please do not run the steps in the Resolution section of this knowledge base solution.

After an install, I only have the postgres-0 pod!

Issue

Environment

Resolution

Root Cause

The reason that this is happening is under certain scenarios, microk8s fails to load the br_netfilter kernel module and this allows the calico networking to fall back to user space routing, which fails to work in this environment and causes the calico-kube-controllers pod to not start, which cascades into the rest of the stack not really coming up. More on this specific issue can be seen here: https://github.com/canonical/microk8s/issues/3085. The microk8s team does expect to release a fix and we will work to incorporate it in the future.

Getting a 502 Bad Gateway Error When Accessing Element Web

Issue

Environment

Resolution

sudo firewall-cmd --add-service={http,https} --permanent
sudo firewall-cmd --add-masquerade --permanent
sudo firewall-cmd --reload

Root Cause

By default, firewalld does not allow masquerading (Network Address Translation, NAT) through the firewall. This causes all sorts of trouble with doing the NAT required to access pods in microk8s.

url.js:354 error starting dimension

Issue

Starting matrix-dimension
url.js:354
      this.auth = decodeURIComponent(rest.slice(0, atSign));
                  ^

URIError: URI malformed
    at decodeURIComponent (<anonymous>)
    at Url.parse (url.js:354:19)
    at Object.urlParse [as parse] (url.js:157:13)
    at new Sequelize (/home/node/matrix-dimension/node_modules/sequelize/dist/lib/sequelize.js:1:1292)
    at new Sequelize (/home/node/matrix-dimension/node_modules/sequelize-typescript/dist/sequelize/sequelize/sequelize.js:16:9)
    at new _DimensionStore (/home/node/matrix-dimension/build/app/db/DimensionStore.js:42:30)
    at Object.<anonymous> (/home/node/matrix-dimension/build/app/db/DimensionStore.js:106:26)
    at Module._compile (internal/modules/cjs/loader.js:1072:14)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:1101:10)
    at Module.load (internal/modules/cjs/loader.js:937:32)

Environment

Resolution

Ensure that you do not have any % characters in your PostgreSQL password. Once you have removed any % characters from your PostgreSQL password, please update your configuration files and re-run the installer.

Root Cause

Dimension does not properly encode the % for it's Postgresql connection URL and this triggers the above error.

Installer fails on enabling addons

Issue

The installer is stating that it's failed and I'm seeing messages like:

skipping: [localhost] => (item=host-access) 
changed: [localhost] => (item=ingress)
FAILED - RETRYING: [localhost]: enable addons (3 retries left).
FAILED - RETRYING: [localhost]: enable addons (2 retries left).
FAILED - RETRYING: [localhost]: enable addons (1 retries left).
failed: [localhost] (item=metrics-server) => {"ansible_loop_var": "item", "attempts": 3, "changed": true, "cmd": ["/snap/bin/microk8s.enable", "metrics-server"], "delta": "0:00:09.568390", "end": "2022-04-13 12:08:41.833858", "item": {"enabled": true, "name": "metrics-server"}, "msg": "non-zero return code", "rc": -15, "start": "2022-04-13 12:08:32.265468", "stderr": "Warning: apiregistration.k8s.io/v1beta1 APIService is deprecated in v1.19+, unavailable in v1.22+; use apiregistration.k8s.io/v1 APIService", "stderr_lines": ["Warning: apiregistration.k8s.io/v1beta1 APIService is deprecated in v1.19+, unavailable in v1.22+; use apiregistration.k8s.io/v1 APIService"], "stdout": "Enabling Metrics-Server\nclusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader unchanged\nclusterrolebinding.rbac.authorization.k8s.io/metrics-server:system:auth-delegator unchanged\nrolebinding.rbac.authorization.k8s.io/metrics-server-auth-reader unchanged\napiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io unchanged\nserviceaccount/metrics-server unchanged\ndeployment.apps/metrics-server unchanged\nservice/metrics-server unchanged\nclusterrole.rbac.authorization.k8s.io/system:metrics-server unchanged\nclusterrolebinding.rbac.authorization.k8s.io/system:metrics-server unchanged\nclusterrolebinding.rbac.authorization.k8s.io/microk8s-admin unchanged", "stdout_lines": ["Enabling Metrics-Server", "clusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader unchanged", "clusterrolebinding.rbac.authorization.k8s.io/metrics-server:system:auth-delegator unchanged", "rolebinding.rbac.authorization.k8s.io/metrics-server-auth-reader unchanged", "apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io unchanged", "serviceaccount/metrics-server unchanged", "deployment.apps/metrics-server unchanged", "service/metrics-server unchanged", "clusterrole.rbac.authorization.k8s.io/system:metrics-server unchanged", "clusterrolebinding.rbac.authorization.k8s.io/system:metrics-server unchanged", "clusterrolebinding.rbac.authorization.k8s.io/microk8s-admin unchanged"]}
skipping: [localhost] => (item=rbac) 
changed: [localhost] => (item=registry)

Environment

Resolution

Re-run the installer until these errors clear and all of the microk8s addons are enabled.

Root Cause

There is a microk8s timing issue that we have not quite figured out.