-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add linux pressure stall metrics #125
add linux pressure stall metrics #125
Conversation
90cfd0a
to
2dd921d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the maximum expected value of name,sys.pressure.full,:eq,:sum
for a given instance be 1 second/second? If so, then this makes sense as we should be able to reason about the value as a percentage of time stalled. Otherwise I'm not sure how we would reason about it.
lib/pressure_stall_test.cc
Outdated
{"sys.pressure.some|count|cpu", 10}, | ||
{"sys.pressure.some|count|io", 10}, | ||
{"sys.pressure.some|count|memory", 10}, | ||
{"sys.pressure.full|count|io", 20}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On real data should full always be less than or equal to some?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this should be the common case. The test data was artificially picked just to ensure that it is parsed correctly.
I will update these values so that some
goes to 1
and full
goes to 0.5
, so it's less of a surprise.
https://docs.kernel.org/accounting/psi.html#pressure-interface
To the best of my understanding, I believe that this is the case. We take the |
I wrote a small Python script to monitor pressure stall values, as a way to preview data values. On a few EC2 systems, the #!/usr/bin/env python
# purpose: calculate the pressure stall time in seconds, every minute, so that we can understand
# the behavior of these values as they are recorded into metrics.
#
# See https://docs.kernel.org/accounting/psi.html for more details on Pressure Stall Information (PSI)
import argparse
import json
import time
from threading import Thread
MICROS = 1000 * 1000
def parse_args():
parser = argparse.ArgumentParser('Monitor pressure stall statistics')
parser.add_argument('-c', '--container', action='store_true', help='/sys/fs/cgroup')
parser.add_argument('-i', '--instance', action='store_true', help='/proc/pressure')
args = parser.parse_args()
if not (args.instance or args.container) or (args.instance and args.container):
parser.error('Must choose either --container or --instance')
return args
def parse_pressure_stall(lines):
result = {'some': None, 'full': None}
for line in lines:
line = line.split(' ')
usec = int(line[-1].split('=')[-1])
result[line[0]] = usec / MICROS
return result
def monotonic_delta(iteration, category, parsed, last_value, stall):
if iteration == 0:
last_value[category] = parsed
else:
stall[category]['some'] = round(parsed['some'] - last_value[category]['some'], 4)
stall[category]['full'] = round(parsed['full'] - last_value[category]['full'], 4)
last_value[category]['some'] = parsed['some']
last_value[category]['full'] = parsed['full']
def read_instance_pressure_stall(iteration, last_value, stall):
for category in ['io', 'memory']:
with open(f'/proc/pressure/{category}', 'r') as f:
parsed = parse_pressure_stall(f.readlines())
monotonic_delta(iteration, category, parsed, last_value, stall)
def read_container_pressure_stall(iteration, last_value, stall):
for category in ['cpu', 'io', 'memory']:
with open(f'/sys/fs/cgroup/{category}.pressure', 'r') as f:
parsed = parse_pressure_stall(f.readlines())
monotonic_delta(iteration, category, parsed, last_value, stall)
def monitor_pressure_stall(args):
iteration = 0
last_value = {
'cpu': {'some': None, 'full': None},
'io': {'some': None, 'full': None},
'memory': {'some': None, 'full': None}
}
stall = {
'cpu': {'some': None, 'full': None},
'io': {'some': None, 'full': None},
'memory': {'some': None, 'full': None}
}
while True:
print(f'---- iteration {iteration} ----')
if args.instance:
read_instance_pressure_stall(iteration, last_value, stall)
if iteration != 0:
print(f'instance stall={json.dumps(stall, indent=2)}')
if args.container:
read_container_pressure_stall(iteration, last_value, stall)
if iteration != 0:
print(f'container stall={json.dumps(stall, indent=2)}')
iteration += 1
time.sleep(60)
if __name__ == '__main__':
args = parse_args()
print('BEGIN monitoring pressure stall statistics')
t = Thread(daemon=True, target=monitor_pressure_stall, args=([args]))
t.start()
try:
t.join()
except KeyboardInterrupt:
print('\nEND monitoring pressure stall statistics') |
This change adds the following new metrics, which can be used to provide feedback on where a system is currently constrained. The metrics are collected for both EC2 instances and Titus containers, except the `full:cpu` metric, which is meaningless on EC2 instances. EC2 instances: ``` name=sys.pressure.some,id=[cpu|io|memory] counter unit=seconds/second name=sys.pressure.full,id=[io|memory] counter unit=seconds/second ``` Titus comtainers: ``` name=sys.pressure.some,id=[cpu|io|memory] counter unit=seconds/second name=sys.pressure.full,id=[cpu|io|memory] counter unit=seconds/second ``` https://docs.kernel.org/accounting/psi.html#pressure-interface > The "some" line indicates the share of time in which at least some tasks are > stalled on a given resource. > The "full" line indicates the share of time in which all non-idle tasks are > stalled on a given resource simultaneously. In this state actual CPU cycles > are going to waste, and a workload that spends extended time in this state > is considered to be thrashing. > The total absolute stall time (in us) is tracked and exported as well, to > allow detection of latency spikes which wouldn't necessarily make a dent in > the time averages, or to average trends over custom time frames. The `total` stall time is a monotonic counter which is collected, transformed into a base unit of seconds, and reported to the backend as a rate-per-second.
00e3b03
to
bf33b0b
Compare
This change adds the following new metrics, which can be used to provide
feedback on where a system is currently constrained. The metrics are
collected for both EC2 instances and Titus containers, except the
full:cpu
metric, which is meaningless on EC2 instances.
EC2 instances:
Titus comtainers:
https://docs.kernel.org/accounting/psi.html#pressure-interface
The
total
stall time is a monotonic counter which is collected, transformedinto a base unit of seconds, and reported to the backend as a rate-per-second.