Discussion:
[ovirt-users] Storage latency message
Chris Adams
2017-04-11 12:57:03 UTC
Permalink
I've been getting an occasional message like:

Storage domain hosted_storage experienced a high latency of
5.26121 seconds from host node3.

I'm not sure what is causing them though. I look at my storage
(EqualLogic iSCSI SAN) and storage network switches and don't see any
issues.

When the above message was logged, node3 was not hosting the engine
(doesn't even have engine HA installed), nor was it the SPM, so why
would it have even been accessing the hosted_storage domain?

This is with oVirt 4.1.
--
Chris Adams <***@cmadams.net>
Yaniv Kaul
2017-04-12 06:48:30 UTC
Permalink
Post by Chris Adams
Storage domain hosted_storage experienced a high latency of
5.26121 seconds from host node3.
I'm not sure what is causing them though. I look at my storage
(EqualLogic iSCSI SAN) and storage network switches and don't see any
issues.
When the above message was logged, node3 was not hosting the engine
(doesn't even have engine HA installed), nor was it the SPM, so why
would it have even been accessing the hosted_storage domain?
All hosts are monitoring their access to all storage domains in the data
center.
Y.
Post by Chris Adams
This is with oVirt 4.1.
--
_______________________________________________
Users mailing list
http://lists.ovirt.org/mailman/listinfo/users
Chris Adams
2017-04-13 13:03:20 UTC
Permalink
Post by Yaniv Kaul
Post by Chris Adams
Storage domain hosted_storage experienced a high latency of
5.26121 seconds from host node3.
I'm not sure what is causing them though. I look at my storage
(EqualLogic iSCSI SAN) and storage network switches and don't see any
issues.
When the above message was logged, node3 was not hosting the engine
(doesn't even have engine HA installed), nor was it the SPM, so why
would it have even been accessing the hosted_storage domain?
All hosts are monitoring their access to all storage domains in the data
center.
Okay. Is there any more information about what this message actually
means though? Is it read latency, write latency, a particular VM, etc.?

I can't find any issue at the network or SAN level, nor any load events
that correlate with the times oVirt logs the latency messages.
--
Chris Adams <***@cmadams.net>
Nir Soffer
2017-04-13 13:21:23 UTC
Permalink
Post by Chris Adams
Post by Yaniv Kaul
Post by Chris Adams
Storage domain hosted_storage experienced a high latency of
5.26121 seconds from host node3.
I'm not sure what is causing them though. I look at my storage
(EqualLogic iSCSI SAN) and storage network switches and don't see any
issues.
When the above message was logged, node3 was not hosting the engine
(doesn't even have engine HA installed), nor was it the SPM, so why
would it have even been accessing the hosted_storage domain?
All hosts are monitoring their access to all storage domains in the data
center.
Okay. Is there any more information about what this message actually
means though? Is it read latency, write latency, a particular VM, etc.?
I can't find any issue at the network or SAN level, nor any load events
that correlate with the times oVirt logs the latency messages.
Ovirt is reading 4k from the metadata special volume every 10 secods. If
the read takes more than 5 seconds, you will see this warning in engine
event log.

Maybe your storage or the host was overloaded at that time (e.g. vm backup)?

Nir


--
Post by Chris Adams
_______________________________________________
Users mailing list
http://lists.ovirt.org/mailman/listinfo/users
Chris Adams
2017-04-18 14:42:01 UTC
Permalink
Post by Nir Soffer
Ovirt is reading 4k from the metadata special volume every 10 secods. If
the read takes more than 5 seconds, you will see this warning in engine
event log.
Maybe your storage or the host was overloaded at that time (e.g. vm backup)?
I don't see any evidence that the storage was having any problem. The
times the message gets logged are not at any high-load times either
(either scheduled backups or just high demand).

I wrote a perl script to replicate the check, and I ran it on a node in
maintenance mode (so no other traffic on the node). My script opens a
block device with O_DIRECT, reads the first 4K, and closes it, reporting
the time. I do see some latency jumps with that check, but not on the
raw block device, just the LV.

By that I mean I'm running it on two devices: the multipath device that
is the PV and the metadata LV. The multipath device latency is pretty
stable, running around 0.3 to 0.5ms. The LV latency is higher (just a
little normally) but has a higher variability and spikes to 50-125ms (at
the same time that reading the multipath device took under 0.5ms).

Seems like this might be a problem somewhere in the Linux logical volume
layer, not the block or network layer (or with the network/storage
itself).
--
Chris Adams <***@cmadams.net>
Loading...