Graviton Performance Runbook toplevel
This section describes multiple different optimization suggestions to try on Graviton2 instances to attain higher performance for your service. Each sub-section defines some optimization recommendations that can help improve performance if you see a particular signature after measuring the performance using the previous checklists.
- On C/C++ applications,
-flto
,-Os
, and Feedback Directed Optimization can help with code layout using GCC. - On Java,
-XX:-TieredCompilation
,-XX:ReservedCodeCacheSize
and-XX:InitialCodeCacheSize
can be tuned to reduce the pressure the JIT places on the instruction footprint. The JDK defaults to setting up a 256MB region by default for the code-cache which over time can fill, become fragmented, and live code may become sparse.- We recommend setting the code cache initially to:
-XX:-TieredCompilation -XX:ReservedCodeCacheSize=64M -XX:InitialCodeCacheSize=64M
and then tuning the size up or down as required. - Experiment with setting
-XX:+TieredCompilation
to gain faster start-up time and better optimized code. - When tuning the code JVM code cache, watch for
code cache full
error messages in the logs indicating that the cache has been set too small. A full code cache can lead to worse performance.
- We recommend setting the code cache initially to:
A TLB (translation lookaside buffer) is a cache that holds recent virtual address to physical address translations for the CPU to use. Making sure this cache never misses can improve application performance.
- Enable Transparent Huge Pages (THP)
echo always > /sys/kernel/mm/transparent_hugepage/enabled
- If your application can use pinned hugepages because it uses mmap directly, try reserving huge pages directly via the OS. This can be done by two methods.
- At runtime:
sysctl -w vm.nr_hugepages=X
- At boot time by specifying on the kernel command line in
/etc/default/grub
:hugepagesz=2M hugepages=512
- At runtime:
- If you need to port an optimized routine that uses x86 vector instruction instrinsics to Graviton’s vector instructions (called NEON instructions), you can use the SSE2NEON library to assist in the porting. While SSE2NEON won’t produce optimal code, it generally gets close enough to reduce the performance penalty of not using the vector intrinsics.
- For additional information on the vector instructions used on Graviton
- Look for specialized back-off routines for custom locks tuned using x86
PAUSE
or the equivalent x86rep; nop
sequence. Graviton2 should use a singleISB
instruction as a drop in replacement, for an example and explanation see recent commit to the Wired Tiger storage layer. - If a locking routine tries to acquire a lock in a fast path before forcing the thread to sleep via the OS to wait, try experimenting with modifying the fast path to attempt the fast path a few additional times before executing down the slow path. An example of this from the Finagle code-base where on Graviton2 we will spin longer for a lock before sleeping.
- If you do not intend to run your application on Graviton1, try compiling your code on GCC using
-march=armv8.2-a
instead of using-moutline-atomics
to reduce overhead of using synchronization builtins.
- Check ENA device tunings with
ethtool -c ethN
whereN
is the device number and checkAdaptive RX
setting. By default on instances without extra ENI’s this will beeth0
.- Set to
ethtool -C ethN adpative-rx off
for a latency sensitive workload - ENA tunings via
ethtool
can be made permanent by editing/etc/sysconfig/network-scripts/ifcfg-ethN
files.
- Set to
- Disable
irqbalance
from dynamically moving IRQ processing between vCPUs and set dedicated cores to process each IRQ. Example script below:
# Assign eth0 ENA interrupts to the first N-1 cores
systemctl stop irqbalance
irqs=$(grep "eth0-Tx-Rx" /proc/interrupts | awk -F':' '{print $1}')
cpu=0
for i in $irqs; do
echo $cpu > /proc/irq/$i/smp_affinity_list
let cpu=${cpu}+1
done
- Disable Receive Packet Steering (RPS) to avoid contention and extra IPIs.
cat /sys/class/net/ethN/queues/rx-N/rps_cpus
and verify they are set to0
. In general RPS is not needed on Graviton2.- You can try using RPS if your situation is unique. Read the documentation on RPS to understand further how it might help. Also refer to Optimizing network intensive workloads on Amazon EC2 A1 Instances for concrete examples.
- If on Graviton2 metal instances, try disabling the System MMU (Memory Management Unit) to speed up IO handling:
%> cd ~/aws-gravition-getting-started/perfrunbook/utilities
# Configure the SMMU to be off on metal, which is the default on x86.
# Leave the SMMU on if you require the additional security protections it offers.
# Virtualized instances do not expose an SMMU to instances.
%> sudo ./configure_graviton_metal_iommu.sh off
%> sudo shutdown now -r