Symptoms
The following error occurs on an attempt to call vstorage top
:
Unable connect to cluster, timeout (30 sec) expired
This error should be treated with caution, improper actions on Virtuozzo Storage that experience this error may cause data loss.
Cause
This is a standard error that would appear under any of these conditions:
- MDS quorum is lost.
- The particular node where you are running 'vstorage top' is completely cut out of the Storage network and cannot reach the Master MDS of the Virtuozzo Storage cluster.
When there is no quorum of MDS servers, the cluster considers this situation critical. All Virtuozzo Storage clients (mountpoints) become suspended, with all read and write operations frozen. It is important to make sure that there are enough (at least a half and one out of all registered) MDS servers that are running without issues.
Resolution
1. On nodes with MDSes, check that connection to other MDSes is ESTABLISHED
, MDSes works on port 2510:
# netstat -an |grep 2510 tcp 0 0 10.0.100.11:2510 0.0.0.0:* LISTEN tcp 0 0 10.0.100.11:2510 10.0.100.11:37004 ESTABLISHED tcp 0 48 10.0.100.11:2510 10.0.100.13:38502 ESTABLISHED tcp 0 0 10.0.100.11:37004 10.0.100.11:2510 ESTABLISHED tcp 0 48 10.0.100.11:2510 10.0.100.12:39572 ESTABLISHED
Make sure that the Storage network is operational, VHS7 nodes in the Storage cluster can ping each other.
2. On nodes with MDSes, check that the systemd service of 'vstorage-mdsd' is running.
# systemctl status vstorage-mdsd* ● vstorage-mdsd.vhs-democluster.1.service - vstorage-mdsd(/vstorage/mds) Loaded: loaded (/usr/lib/systemd/system/vstorage-mdsd.vhs-democluster.1.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2022-04-20 18:49:39 +03; 3 months 26 days ago Main PID: 579867 (mds1) CGroup: /vstorage.slice/vstorage-services.slice/vstorage-mdsd.vhs-democluster.1.service └─579867 /usr/bin/mdsd -r /vstorage/mds --log-dir /vstorage/mds/logs --log-prefix mds --log-compre... Apr 20 18:49:34 av.vhs-node1 systemd[1]: Starting vstorage-mdsd(/vstorage/mds)... Apr 20 18:49:34 av.vhs-node1 vstorage-service[579798]: Starting mds service vhs-democluster:/vstorage/mds Apr 20 18:49:34 av.vhs-node1 vstorage-service[579798]: mds started with pid 579867 Apr 20 18:49:39 av.vhs-node1 systemd[1]: Started vstorage-mdsd(/vstorage/mds).
Among the top reasons why the 'vstorage-mdsd' service is down and cannot be started is the lack of free space on disks. E.g. lack of free disk space on the root partition:
# df -h / Filesystem Size Used Avail Use% Mounted on /dev/sda2 30G 30G 0 100% /
Examples of error messages that can be found at the end of the log file of the MDS server:
-
Inability to bind to a socket or to replay journal on (re)starting the service:
06-06-13 20:06:41.099 Replaying journal ... 06-06-13 20:06:41.108 rjournal: replaying snapshot /pstorage/PCLUSTER-mds/journal.1.sn 06-06-13 20:06:41.108 Global: 06-06-13 20:06:41 MDS Default lease timeout is set to 60000 msec 06-06-13 20:06:41.108 Global: 06-06-13 20:06:41 MDS Default replication level is set to 1..1/3 QoS=0 Gr=1 06-06-13 20:06:41.108 Global: 06-06-13 20:06:41 MDS Minimum replicas set to 1, which means that cluster can be unable to survive loss of a single machine! 06-06-13 20:06:41.108 added paxos node 1, addr 10.13.96.62:2510 06-06-13 20:06:41.108 paxos node 1 became active 06-06-13 20:06:41.867 Fatal: can't set local address 10.13.96.62:2510, err 98 (Address already in use)
-
Lack of free disk space on file creation on (re)starting the service:
# tail /var/log/pstorage/PCLUSTER/mds-3/fatal.log 01-09-13 13:01:03.376 failed to close compressed log file /var/log/pstorage/PCLUSTER/mds-3/mds.log.gz: No space left on device 01-09-13 13:01:03.388 mdsd #7 reports hard error (134 / SIGABRT) 01-09-13 13:01:46.105 failed to close compressed log file /var/log/pstorage/PCLUSTER/mds-3/mds.log.gz: No space left on device 01-09-13 13:01:46.105 failed to close compressed log file /var/log/pstorage/PCLUSTER/mds-3/mds.log.gz: