Synapse database troubleshooting

Room Retention policy enabled causes Synapse database to consume a lot of disk space

  1. Run the following command against synapse postgres database : \d+. On an installer-managed postgresql, you can access psql command using : kubectl exec -it -n element-onprem synapse-postgres-0 -- bash -c 'psql "dbname=$POSTGRES_DB user=$POSTGRES_USER password=$POSTGRES_PASSWORD"'
  2. Check the space taken by the table state_groups_state. For example, here it's consuming 540 GB : public | state_groups_state | table | synapse_user | permanent | 244 GB |
  3. If you have Room retention policy enabled, there's a bug which causes some state groups to be orphaned, and as a consequence they are not cleaned up from the database automatically.
  4. Follow the instruction from the page synapse-find-unreferenced-state-groups. The tool is available for download in the following link rust-synapse-find-unreferenced-state-groups.
  5. On Standalone and Installer-managed postgresql database, you can use the following script to do it automatically. It's going to involve a downtime because Synapse is stopped before cleaning up orphaned state groups. Please make sure that you have appropriate disk space before running the script because the script generates a backup of the database before cleaning up the tables.. If you need to restore, the command will be kubectl exec -it pods/synapse-postgres-0 -n element-onprem -- psql "dbname=synapse user=synapse_user password=$POSTGRES_PASSWORD" < /path/to/dump.sql :
#!/bin/sh
set -e

echo "Stopping Operator..."
kubectl scale deploy/element-operator-controller-manager -n operator-onprem --replicas=0
echo "Stopping Synapse..."
kubectl delete synapse/first-element-deployment -n element-onprem

while kubectl get statefulsets -n element-onprem --no-headers -o custom-columns=":metadata.labels" | grep -q "matrix-server"; do
  echo "Waiting for synapse StatefulSets to be deleted..."
  sleep 2
done

echo "Forwarding postgresql port..."
kubectl port-forward pods/synapse-postgres-0 -n element-onprem   15432:5432 &
port_forward_pid=$!
sleep 1s

POSTGRES_PASSWORD=`echo $(kubectl get secrets/first-element-deployment-synapse-secrets -n element-onprem -o yaml | grep postgresPassword | cut -d ':' -f2) | base64 -d`
POSTGRES_USER=synapse_user
POSTGRES_DB=synapse

echo "Find unreferenced state groups..."
./rust-synapse-find-unreferenced-state-groups -p postgres://$POSTGRES_USER:$POSTGRES_PASSWORD@localhost:15432/$POSTGRES_DB -o ./sgs.txt
kill -9 $port_forward_pid

echo "Copy unreferenced state groups list to postgres pod..."
kubectl cp ./sgs.txt -n element-onprem synapse-postgres-0:/tmp/sgs.txt

echo "Backing up postgres database..."
kubectl exec -it pods/synapse-postgres-0 -n element-onprem -- bash -c 'pg_dump "dbname=$POSTGRES_DB user=$POSTGRES_USER password=$POSTGRES_PASSWORD"' > backup-unreferenced-state-groups-$(date '+%Y-%m-%d-%H:%M:%S').sql

echo "Cleanup postgres database..."
kubectl exec -it pods/synapse-postgres-0 -n element-onprem -- psql "dbname=$POSTGRES_DB user=$POSTGRES_USER password=$POSTGRES_PASSWORD" -c "CREATE TEMPORARY TABLE unreffed(id BIGINT PRIMARY KEY); COPY unreffed FROM '/tmp/sgs.txt' WITH (FORMAT 'csv'); DELETE FROM state_groups_state WHERE state_group IN (SELECT id FROM unreffed); DELETE FROM state_group_edges WHERE state_group IN (SELECT id FROM unreffed); DELETE FROM state_groups WHERE id IN (SELECT id FROM unreffed);"
echo "Starting Operator..."
kubectl scale deploy/element-operator-controller-manager -n operator-onprem --replicas=1


Running the script should look like this :

 bash cleanup-sgs.sh 
Stopping Operator...
deployment.apps/element-operator-controller-manager scaled
Stopping Synapse...
synapse.matrix.element.io "first-element-deployment" deleted
Waiting for synapse StatefulSets to be deleted...
Waiting for synapse StatefulSets to be deleted...
Waiting for synapse StatefulSets to be deleted...
Waiting for synapse StatefulSets to be deleted...
Waiting for synapse StatefulSets to be deleted...
Waiting for synapse StatefulSets to be deleted...
Waiting for synapse StatefulSets to be deleted...
Forwarding postgresql port...
Forwarding from 127.0.0.1:15432 -> 5432
Forwarding from [::1]:15432 -> 5432
Find unreferenced state groups...
Handling connection for 15432
  [0s] 741 rows retrieved
Fetched 725 state groups from DB
Total state groups: 725
Found 2 unreferenced groups
Copy unreferenced state groups list to postgres pod...
Defaulted container "postgres" out of: postgres, postgres-init-password, postgres-exporter
cleanup-sgs.sh: line 28: 428645 Killed                  kubectl port-forward pods/synapse-postgres-0 -n element-onprem 15432:5432
Backing up postgres database...
Cleanup postgres database...
Defaulted container "postgres" out of: postgres, postgres-init-password, postgres-exporter
DELETE 2
Starting Operator...
deployment.apps/element-operator-controller-manager scaled

Revision #9
Created 5 July 2024 14:37:00 by Gaël Goinvic
Updated 5 July 2024 16:14:44 by Gaël Goinvic