Virtuozzo Storage critical error: Unable connect to cluster, timeout (30 sec) expired – Virtuozzo Technical Support

Symptoms

The following error occurs on an attempt to call vstorage top:

Unable connect to cluster, timeout (30 sec) expired

This error should be treated with caution, improper actions on Virtuozzo Storage that experience this error may cause data loss.

Cause

This is a standard error that would appear under any of these conditions:

MDS quorum is lost.
The particular node where you are running 'vstorage top' is completely cut out of the Storage network and cannot reach the Master MDS of the Virtuozzo Storage cluster.

When there is no quorum of MDS servers, the cluster considers this situation critical. All Virtuozzo Storage clients (mountpoints) become suspended, with all read and write operations frozen. It is important to make sure that there are enough (at least a half and one out of all registered) MDS servers that are running without issues.

Resolution

1. On nodes with MDSes, check that connection to other MDSes is ESTABLISHED, MDSes works on port 2510:

# netstat -an |grep 2510
tcp        0      0 10.0.100.11:2510        0.0.0.0:*               LISTEN
tcp        0      0 10.0.100.11:2510        10.0.100.11:37004       ESTABLISHED
tcp        0     48 10.0.100.11:2510        10.0.100.13:38502       ESTABLISHED
tcp        0      0 10.0.100.11:37004       10.0.100.11:2510        ESTABLISHED
tcp        0     48 10.0.100.11:2510        10.0.100.12:39572       ESTABLISHED

Make sure that the Storage network is operational, VHS7 nodes in the Storage cluster can ping each other.

2. On nodes with MDSes, check that the systemd service of 'vstorage-mdsd' is running.

# systemctl status vstorage-mdsd*
● vstorage-mdsd.vhs-democluster.1.service - vstorage-mdsd(/vstorage/mds)
   Loaded: loaded (/usr/lib/systemd/system/vstorage-mdsd.vhs-democluster.1.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2022-04-20 18:49:39 +03; 3 months 26 days ago
 Main PID: 579867 (mds1)
   CGroup: /vstorage.slice/vstorage-services.slice/vstorage-mdsd.vhs-democluster.1.service
           └─579867 /usr/bin/mdsd -r /vstorage/mds --log-dir /vstorage/mds/logs --log-prefix mds --log-compre...

Apr 20 18:49:34 av.vhs-node1 systemd[1]: Starting vstorage-mdsd(/vstorage/mds)...
Apr 20 18:49:34 av.vhs-node1 vstorage-service[579798]: Starting mds service vhs-democluster:/vstorage/mds
Apr 20 18:49:34 av.vhs-node1 vstorage-service[579798]: mds started with pid 579867
Apr 20 18:49:39 av.vhs-node1 systemd[1]: Started vstorage-mdsd(/vstorage/mds).

Among the top reasons why the 'vstorage-mdsd' service is down and cannot be started is the lack of free space on disks. E.g. lack of free disk space on the root partition:

# df -h /
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2              30G   30G     0 100% /

Examples of error messages that can be found at the end of the log file of the MDS server:

Inability to bind to a socket or to replay journal on (re)starting the service:

06-06-13 20:06:41.099 Replaying journal ...
06-06-13 20:06:41.108 rjournal: replaying snapshot /pstorage/PCLUSTER-mds/journal.1.sn
06-06-13 20:06:41.108 Global: 06-06-13 20:06:41 MDS Default lease timeout is set to 60000 msec
06-06-13 20:06:41.108 Global: 06-06-13 20:06:41 MDS Default replication level is set to 1..1/3 QoS=0 Gr=1
06-06-13 20:06:41.108 Global: 06-06-13 20:06:41 MDS Minimum replicas set to 1, which means that cluster can be unable to survive loss of a single machine!
06-06-13 20:06:41.108 added paxos node 1, addr 10.13.96.62:2510
06-06-13 20:06:41.108 paxos node 1 became active
06-06-13 20:06:41.867 Fatal: can't set local address 10.13.96.62:2510, err 98 (Address already in use)

Lack of free disk space on file creation on (re)starting the service:

# tail /var/log/pstorage/PCLUSTER/mds-3/fatal.log 
01-09-13 13:01:03.376 failed to close compressed log file /var/log/pstorage/PCLUSTER/mds-3/mds.log.gz: No space left on device
01-09-13 13:01:03.388 mdsd #7 reports hard error (134 / SIGABRT)
01-09-13 13:01:46.105 failed to close compressed log file /var/log/pstorage/PCLUSTER/mds-3/mds.log.gz: No space left on device
01-09-13 13:01:46.105 failed to close compressed log file /var/log/pstorage/PCLUSTER/mds-3/mds.log.gz:

Symptoms

Cause

Resolution

Related articles