Java Garbage Collection Tuning Strategy - An Example

Akash_Agarwal · August 11, 2017, 10:51am

Java Garbage collection is a much talked about topic. For me, its tuning is highly dependent on the use case and run-time conditions. In this article, I will talk about a specific use case and also cover a few basics of Java Heap.

1. Hotspot Heap Structure ^[3]

Let us briefly go through the Java heap structure. Oracle's whitepaper on Hotspot JVM Memory Management is an excellent starting point to become familiar with GC algorithms in Hotspot JVM.

The Java heap is made up of the Perm, Old, and New (sometimes called Young) generations. The New generation is further made up of Eden space where objects are created and Survivor spaces S0 and S1 where they are kept later for a limited number of New generation garbage collection cycles.

Figure: Hotspot Heap Structure

The Young Generation is where all new objects are allocated and aged. When the young generation fills up, this causes a minor garbage collection. Minor collections can be optimized assuming a high object mortality rate. A young generation full of dead objects is collected very quickly. Some surviving objects are aged and eventually move to the old generation.

Stop the World Event - All minor garbage collections are "Stop the World" events. This means that all application threads are stopped until the operation completes. Minor garbage collections are always Stop the World events.

The Old Generation is used to store long surviving objects. Typically, a threshold is set for young generation object and when that age is met, the object gets moved to the old generation. Eventually, the old generation needs to be collected. This event is called a major garbage collection.

Major garbage collections are also Stop the World events. Often a major collection is much slower because it involves all live objects. So for Responsive applications, major garbage collections should be minimized. Also, note that the length of the Stop the World event for a major garbage collection is affected by the kind of garbage collector that is used for the old generation space.

The Permanent generation contains metadata required by the JVM to describe the classes and methods used in the application. The permanent generation is populated by the JVM at runtime based on classes in use by the application. In addition, Java SE library classes and methods may be stored here.

Classes may get collected (unloaded) if the JVM finds they are no longer needed and space may be needed for other classes. The permanent generation is included in a full garbage collection.

That was JVM heap structure and related garbage collection concept in brief. These concepts may undergo few changes with the change in the Java version.

2. Use Case

During higher throughput and bigger data size use cases there is a lot of load on JVM heap. It often happens that JVM dynamically adjusts the size of different areas of JVM heap at runtime based on heuristics. Sometimes these heuristics work well and sometimes they fail badly.

One of the cases I encountered was when the throughput (transactions per second) in a messaging scenario for message size more than 100 KB became unpredictable. On a few occasions, it was X and on other occasions, it became 4X, where X is a number. To find out the problem we looked into many things and lastly turned our focus to GC activities of the load generating Java client. On closely observing GC behavior using the Visual GC plugin of JVisualVM we observed the following. When we got X throughput, the total young GC time was 60 seconds for a 300 seconds test. When we got 3X to 4X throughput, it was 1 to 3 seconds. So, what was happening?

2.1 Optimizing Young Generation: Eden Space and Survivor Space

Let us take a look at a part of the journey of an object:

allocated in Eden space
copied from Eden space to survivor space due to young GC
copied from survivor to (other) survivor space due to young GC (this could happen a few times)
promoted from survivor (or possible Eden) to old space due to young GC (or full GC)

As it was a high load scenario, the amount of Java objects produced was a few GBs per second. So we could expect a lot of movement of objects between sections of JVM heap. In the middle of the test, Survivor size (both the survivor spaces) was increasing, and this was consistent for low-performing scenarios. This was due to JVM auto-adjusting the sizes of heap sections. JVM predicted that the survivor ratio was inadequately sized which resulted in a lot of overflow (promotion activity) to the old generation. As the survivor size increased it took a lot of time to copy from Eden space to survivor space and then in between Survivor 1 to Survivor 0. The copying activity from Eden to Survivor space and copying from one Survivor to another Survivor space was very frequent. This took most of the Minor (Young) GC time.

I looked into the existing GC tuning parameters. I had used 8G of min=max heap memory and New Ratio = 1. As we know, this would allow the young generation to be of a size equal to Old generation. Within the young generation, we have Eden space and 2 survivor spaces. Both the survivor spaces are of the same size and at any time only one of the survivor spaces is used. In our use case, the adjustment between the sizes of Eden space and survivor space was done by JVM based on heuristics, and this proved against the performance. The messaging scenario demanded a lot of short-lived objects created in memory.

Few things became clear from observations:

The size of the Eden space was inadequate as the short-lived objects were copied very often to survivor space
The GC processing in the Eden Space was slow.
It took a lot of time to copy objects to survivor space. This could be a result of point 1 and the bigger size of the survivor space.

The task in hand was to find out:

How much Eden space is required and what is the adequate Survivor Ratio. This will also need adequate young generation size.
Prevent unnecessary promotion to Old generation by having the right combination and size of Eden and survivor spaces.
Young generation GC should be efficient and fast enough.
Old generation GC should be efficient and fast enough.
Adequate old generation size.

As mentioned above, I started with 8G minimum and maximum jvm heap. Using the NewSize parameter, 5GB was allocated to the young generation and the old generation got 3GB. I tried to increase Eden space and decrease survivor space by changing survivor ratio. But somehow, promotion to old generation was often increasing full GC activity. When the size of the old generation was further decreased and the size of the young generation was increased, the frequency of old generation increased. It gave a hint that heap size was not enough. Most of the objects were short-lived and need to be collected in the young generation but when unavoidable promotion to the old generation happens, the frequency of full GC should be acceptable.

It was decided to increase the JVM heap to 12G minimum and maximum. 9GB was allocated to the young generation and 3G to the old generation.

Application behavior guaranteed that most of the objects are short-lived. We observed that objects stayed longer in survivor space after multiple young GC cycles.

We need a brief explanation here.

There are two survivor spaces, say, S0 and S1. New objects are allocated in Eden space. When that's full, you need a GC, this kills stale objects and moves live ones to a survivor space, where they mature for a while before being promoted to the old generation. The next time we run out of Eden space, the next GC comes along and clears out some space in both Eden and survivor space, but the spaces aren't contiguous. So the following happens -

The survivors from Eden have to be fit into the empty spaces in the survivor space that were cleared by the GC. For this JVM shifts all the objects in Eden and survivor space down to eliminate the fragmentation, and move everything from both spaces into a completely separate space-the second survivor space-thus leaving a clean Eden and survivor space where on the next GC the sequence can be repeated.

Survivor size can be calculated as follows. For example, -XX:SurvivorRatio=6 sets the ratio between eden and a survivor space to 1:6. In other words, each survivor space will be one-sixth the size of eden, and thus one-eighth the size of the young generation (not one-seventh, because there are two survivor spaces).

Survivor space was tuned as follows:

Survivor ratio was adjusted to 6. A size smaller than this resulted in increased promotion to the old generation of the objects which should have died in survivor space after a few GC cycles as they were short-lived.
When Survivor ratio was changed to 6(which means a fraction of young generation is allocated to survivor space), Visual GC showed objects survived multiple young GC cycles moving to and fro between two survivor spaces. This led to another tuning parameter. Why do they need to stay so long and then move to the old generation? There is a parameter called as MaxTenuringThreshold. What is this?

During young generation GC, every object is copied. The Object may be copied to one of the survivor spaces (one which is empty before young GC) or to the old generation space. For each object being copied, the GC algorithm increases its age (number of collections survived) and if the age is above the current tenuring threshold it would be copied (promoted) to old space. The Object could also be copied to the old space directly if the survivor space gets full (overflow).

So to summarize, the journey of object follows the following pattern:

allocated in Eden space
copied from Eden space to survivor space due to young GC
copied from survivor to (other) survivor space due to young GC (this could happen few times)
promoted from survivor (or possible Eden) to old space due to young GC (or full GC)
the actual tenuring threshold is dynamically adjusted by JVM, but MaxTenuringThreshold sets an upper limit on it.
If you set MaxTenuringThreshold=0, all objects will be promoted immediately.

^[2] When we logged tenuring stats using -XX:+PrintTenuringDistribution (along with other GC log parameters -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps). This showed its value as 8. We observed that much of our Eden space was evacuated in the young generation collection and very few objects died in the survivor spaces over the ages three to eight. So we reduced the tenuring threshold from 8 to 2 (with option: -XX:MaxTenuringThreshold=2), to reduce the amount of time spent in data copying in young generation collection.

On further observing Visual GC during performance tests it was found that survivor space requirement has further decreased after setting MaxTenuringThreshold and more space can be shifted from survivor space to Eden space. Also, it would reduce copying time into survivor space and give much more time to objects to die in the young generation. I tried with few numbers and finally came up with:

-XX:SurvivorRatio=18

Next thing was to make young and old generation collections more efficient as most of the objects were short-lived.

Figure: A look at VisualGC plugin from JVisualVM

2.2 Shorten the pause times of Old Generation (and Young generation)

I decided to use the Concurrent Mark Sweep (CMS) collector which is designed for applications that prefer shorter garbage collection pauses and that can afford to share processor resources with the garbage collector while the application is running. ^[4] As indicated by its name the CMS collector uses a concurrent approach where most of the work is done by a GC thread that runs concurrently with the worker threads processing user requests. A single normal Old generation stop-the-world GC run is split up into two much shorter stop-the-world pauses plus 5 concurrent phases where worker threads are allowed to go on with their work. The test machine had multiple physical cores of CPU, so this wasn’t a problem. Find a more detailed description of the CMS in the article “Java SE 6 HotSpot Virtual Machine Garbage Collection Tuning”

The CMS collector is activated by

-XX:+UseConcMarkSweepGC

Similar to the other available collectors, the concurrent collector is generational; thus both minor and major collections occur. By default, the CMS collector uses the ParNew collector to execute the New generation collections. ParNewGC uses multiple GC threads for young generation collections. It can be activated by

-XX:+UseParNewGC.

2.3 Optimizing Young Generation Collection further

^{[1] [2]} To reduce the young generation pause duration even further, it was decided to look into options that optimized task binding with GC threads. Parallel copy collector (ParNew), responsible for young collection in CMS, use ParGCCardsPerStrideChunk value to control the granularity of tasks given to GC worker threads and helps get the best performance out of a patch written to optimize card table scan time spent in young generation collection. Old space is broken into strides of equal size and each worker responsible for processing (find dirty pages, find old to young references, copy young objects etc.) a subset of strides. Time to process each stride may vary greatly, so workers may steal work from each other. For that reason, the number of strides should be greater than the number of workers. By default ParGCCardsPerStrideChunk =256 (card is 512 bytes, so it would be 128KiB of heap space per stride) which means that 28GiB heap would be broken into 224 thousand of strides. Provided that a number of parallel GC threads is usually 4 orders of magnitude less, this is probably too many. We decided to continue experiments with stride size 2k as it shows the most consistent improvement for 12GB of heap space and tests across a wide range of message (data) sizes. ParGCCardsPerStrideChunk option is available in all Java 7 HotSpot JVMs and most recent Java 6 JVMs. But this option is classified as diagnostic so you should enable diagnostic options to use it.

-XX:+UnlockDiagnosticVMOptions

-XX:ParGCCardsPerStrideChunk=2048.

The learning here was that the young GC time increases with the increase in the size of old generation.

This was the end of the tuning exercise and we were able to get 300ms of total young GC pause time for a 300s test. Old Generation pause time was also acceptable.

3. Conclusion

It is very interesting to see how ‘circumstances/objective driven GC tuning’ can help achieve our performance goals. The next step in GC tuning always depends on how the previous steps have been carried out. So, we should be careful with false signals which may take our tuning exercise to a completely different path. Sometimes, increased demand of JVM heap space turns out to be a myth and on many occasions, it can be the application’s need for a specific use case. We need to look into the characteristics of the application to be able to tune a particular region of the JVM heap or all the regions in a step-by-step manner.

4. References

[1] Alexey Ragozin, “Secret HotSpot option improving GC pauses on large heaps”, Mar. 28, 2012, http://blog.ragozin.info/2012/03/secret-hotspot-option-improving-gc.html

[2] Swapnil Ghike, “Garbage Collection Optimization for High-Throughput and Low-Latency Java Applications”, Apr. 8, 2014, https://engineering.linkedin.com/garbage-collection/garbage-collection-optimization-high-throughput-and-low-latency-java-applications

[3] Oracle Documentation, http://www.oracle.com/webfolder/technetwork/tutorials/obe/java/gc01/index.html

[4] Dr. Andreas Müller, “Tuning Garbage Collection for Mission-Critical Java Applications”, Mar. 27, 2013, http://blog.mgm-tp.com/2013/03/garbage-collection-tuning/

Further read:
Check out this list of some of the efficient ways in which you could write your code to minimize memory leaks and improve the way your application performs: Best Coding Practices

Read about various kinds of dumps we need to get from the application or hardware environment to do the root cause of the performance issue in the article: Performance Issues - how to take various dumps for analysis

Topic		Replies	Views
Universal Messaging Using high heap memory universal-messaging , Enterprise-Manager	5	1593	February 22, 2023
Applinx Tweaking in Linux Mainframe-Integration , ApplinX	24	11390	April 2, 2021
Insufficient space in Javaheap Former-Crossvision-Products , Service-Orchestrator	4	21232	April 2, 2021
Performance Issues - how to take various dumps for analysis Knowledge base performance , groci , thread-dump , heap-dump , gc-logs	0	5544	September 18, 2017
Mediator running low in memory 2 (Error 503) Former-Crossvision-Products , Service-Orchestrator	5	14181	April 2, 2021