When I was looking at the glossary of memory management terms, I accidentally discovered the definition of "Pig in the Python (Note: It's a bit like the greedy and insufficient snake swallowing the elephant)" in Chinese, so I came up with this article. On the surface, this term refers to the GC constantly promoting large objects from one generation to another. Doing so is like a python swallowing its prey whole, so that it cannot move while it is digesting.
For the next 24 hours, my mind was filled with images of this suffocating python that I couldn’t get rid of. As psychiatrists say, the best way to relieve fear is to talk it out. Hence this article. But the next story we want to talk about is not python, but GC tuning. I swear to God.
Everyone knows that GC pauses can easily cause performance bottlenecks. Modern JVMs come with advanced garbage collectors when they are released, but from my experience, it is extremely difficult to find the optimal configuration for a certain application. Manual tuning may still have a glimmer of hope, but you have to understand the exact mechanics of the GC algorithm. In this regard, this article will be helpful to you. Below I will use an example to explain how a small change in the JVM configuration affects the throughput of your application.
Example
The application we use to demonstrate the impact of GC on throughput is just a simple program. It contains two threads:
PigEater – It simulates the process of a giant python eating a big fat pig. The code does this by adding 32MB bytes to java.util.List and sleeping for 100ms after each ingestion.
PigDigester – It simulates the process of asynchronous digestion. The code that implements digestion simply sets the list of pigs to empty. Since this is a tiring process, this thread will sleep for 2000ms each time after clearing the reference.
Both threads will run in a while loop, eating and digesting until the snake is full. This would require eating approximately 5,000 pigs.
package eu.plumbr.demo; public class PigInThePython { static volatile List pigs = new ArrayList(); static volatile int pigsEaten = 0; static final int ENOUGH_PIGS = 5000; public static void main(String[] args) throws InterruptedException { new PigEater().start(); new PigDigester().start(); } static class PigEater extends Thread { @Override public void run() { while (true) { pigs.add(new byte[32 * 1024 * 1024]); //32MB per pig if (pigsEaten > ENOUGH_PIGS) return; takeANap(100); } } } static class PigDigester extends Thread { @Override public void run() { long start = System.currentTimeMillis(); while (true) { takeANap(2000); pigsEaten+=pigs.size(); pigs = new ArrayList(); if (pigsEaten > ENOUGH_PIGS) { System.out.format("Digested %d pigs in %d ms.%n",pigsEaten, System.currentTimeMillis()-start); return; } } } } static void takeANap(int ms) { try { Thread.sleep(ms); } catch (Exception e) { e.printStackTrace(); } } }
Now we define the throughput of this system as "the number of pigs that can be digested per second". Considering that a pig is stuffed into this python every 100ms, we can see that the theoretical maximum throughput of this system can reach 10 pigs/second.
GC Configuration Example
Let’s take a look at the performance of using two different configuration systems. Regardless of configuration, the application runs on a dual-core Mac (OS X10.9.3) with 8GB of RAM.
First configuration:
1.4G heap (-Xms4g -Xmx4g)
2. Use CMS to clean up the old generation (-XX:+UseConcMarkSweepGC) and use the parallel collector to clean up New generation (-XX:+UseParNewGC)
3. Allocate 12.5% of the heap (-Xmn512m) to the new generation, and limit the sizes of the Eden area and the Survivor area to be the same.
The second configuration is slightly different:
1.2G heap (-Xms2g -Xms2g)
2. Both the new generation and the old generation use Parellel GC (-XX:+UseParallelGC)
3. Allocate 75% of the heap to the new generation (-Xmn 1536m)
4. Now it’s time to make a bet, which configuration will perform better (that is, how many pigs can be eaten per second, and Remember)? Those who put their chips on the first configuration, you will be disappointed. The results are just the opposite:
1. The first configuration (large heap, large old generation, CMS GC) can eat 8.2 pigs per second
2. The second configuration (small heap, large The new generation (Parellel GC) can eat 9.2 pigs per second
Now let’s look at this result objectively. The allocated resources are 2 times less but the throughput is increased by 12%. This is contrary to common sense, so it is necessary to further analyze what is going on.
Analyzing GC results
The reason is actually not complicated. You only need to carefully look at what the GC is doing when running the test to find the answer. This is where you choose the tool you want to use. With the help of jstat, I discovered the secret behind it. The command probably looked like this:
jstat -gc -t -h20 PID 1s
By analyzing the data, I noticed that configuration 1 experienced 1129 GC cycles (YGCT_FGCT), taking a total of 63.723 seconds:
Timestamp S0C S1C S0U S1U EC EU OC OU PC PU YGC YGCT FGC FGCT GCT 594.0 174720.0 174720.0 163844.1 0.0 174848.0 131074.1 3670016.0 2621693.5 21248.0 2580.9 1006 63.182 116 0.236 63.419 595.0 174720.0 174720.0 163842.1 0.0 174848.0 65538.0 3670016.0 3047677.9 21248.0 2580.9 1008 63.310 117 0.236 63.546 596.1 174720.0 174720.0 98308.0 163842.1 174848.0 163844.2 3670016.0 491772.9 21248.0 2580.9 1010 63.354 118 0.240 63.595 597.0 174720.0 174720.0 0.0 163840.1 174848.0 131074.1 3670016.0 688380.1 21248.0 2580.9 1011 63.482 118 0.240 63.723
The second configuration paused a total of 168 times (YGCT+FGCT) and only took 11.409 seconds.
Timestamp S0C S1C S0U S1U EC EU OC OU PC PU YGC YGCT FGC FGCT GCT 539.3 164352.0 164352.0 0.0 0.0 1211904.0 98306.0 524288.0 164352.2 21504.0 2579.2 27 2.969 141 8.441 11.409 540.3 164352.0 164352.0 0.0 0.0 1211904.0 425986.2 524288.0 164352.2 21504.0 2579.2 27 2.969 141 8.441 11.409 541.4 164352.0 164352.0 0.0 0.0 1211904.0 720900.4 524288.0 164352.2 21504.0 2579.2 27 2.969 141 8.441 11.409 542.3 164352.0 164352.0 0.0 0.0 1211904.0 1015812.6 524288.0 164352.2 21504.0 2579.2 27 2.969 141 8.441 11.409
Considering that the workload in both cases is equal, therefore - in this pig-eating experiment, when the GC does not find long-lived objects, it can clean up garbage objects faster. With the first configuration, the frequency of GC operation will be about 6 to 7 times, and the total pause time will be 5 to 6 times.
Telling this story has two purposes. First and most importantly, I wanted to get this convulsing python out of my mind. Another more obvious gain is that GC tuning is a very skillful experience, and it requires you to have a thorough understanding of the underlying concepts. Although the one used in this article is just a very common application, the different results of the selection will also have a great impact on your throughput and capacity planning. In real-life applications, the difference here will be even greater. So it's up to you, you can master these concepts, or you can just focus on your daily work and let Plumbr figure out the most suitable GC configuration for your needs.
For more articles related to testing the impact of garbage collector GC on throughput in Java, please pay attention to the PHP Chinese website!