Thursday, November 18, 2010

Batch Fetching, part II - Journey to Big Data

In my last post I investigated different ways to optimize the loading of objects and their relationships. I presented some performance results, but from a very small database with small query result sets. In this post I will investigate how those same query optimization strategies scale to larger data sets.

The previous post only had a database of 12 rows. So for this run I will increase the database size by 5,000 to 60,000 rows. Still not a huge database, but should be big enough to highlight any performance differences in the results. I will also increase the size of the query result set from 6 employees, to 5,000 employees.

The first thing that I noticed in this run was that Oracle has a limit of 1,000 parameters per statement. Since the IN batch fetching binds a large array, and I'm reading 5,000 objects, this limit was exceeded and the run blew up with a database error. The BatchFetchPolicy in EclipseLink accounts for this and defines a size for the max number of ids to include the an IN. The default size limit in EclipseLink was suppose to be 500, but I think I remember increasing it to 100,000 to test something when I was developing the feature, and, well..., I guess never set it back, oops, I will fix this...

EclipseLink defines a JPA Query hint "eclipselink.batch.size" that allows the size to be set. So I will set this to 500 for the test. This means that to read in all of the 5,000 objects, the IN batch fetch will need to execute 10 queries per batch fetched relationship. It will be interesting to see how it compares to the other query optimization techniques.

The run time was also increased to 10 minutes from 1 minute because reading 5,000 objects obviously takes longer than reading 6. Also, since the last run our lab got a new database machine. The old database was running on Linux on an old server machine, and the new database is running on a new Oracle Sun server machine running Linux on a virtualized environment. The client machine is the same, my old desktop running Oracle Sun JDK 1.6 on Windows XP. Both databases were Oracle 11g.

Big Data Results, run 1, simple (fetch address, phoneNumbers)

QueryAverage (queries/10 minutes)%STD%DIF (of standard)
standard274.5%0%
join fetch3070.1%+1037%
batch fetch (JOIN)3100.2%+1048%
batch fetch (EXISTS)3090.1%+1044%
batch fetch (IN)2610.1%+866%

The results show that join fetching and batch fetching have basically equivalent performance, and about 10x better performance than the non-optimize query. IN batch fetching does not perform as well as the others with this larger result set. It performs better than I expected, given it has huge IN statement and has to execute 10 queries per relationship. Note that these results differ from the previous post that showed IN batch fetching performing the best for queries with small result sets.

The second run uses the complex query which fetches 9 different relationships.

Big Data Results, run 2, complex (fetch address, phoneNumbers, projects, manager, managedEmployees, emailAddresses, responsibilities, jobTitle, degrees)

QueryAverage (queries/10 minutes)%STD%DIF (of standard)
standard60.0%0%
join fetch591.5%+383%
batch fetch (JOIN)1250.3%+1983%
batch fetch (EXISTS)1240.3%+1966%
batch fetch (IN)803.2%+1455%

The results show batch fetching having about 2x the performance of join fetching, and 20x the performance of the non-optimize query. Join fetching still performs 10x faster than the non-optimize case, which is different than the small result set run which gave it worse performance than the non-optimized query. IN batch fetching again did not perform as well as JOIN and EXISTS batch fetching, but still out performed join fetching and was 15x faster than the non-optimized query.

Note that these results are not universal. Expect that every database, every machine, every environment, every query, and every object model will give different results. The basics should be the same though, batch fetch should have better performance than non-optimized queries, and better performance than join fetching for objects with complex relationships. IN batch fetching will perform worse for large result sets, but have similar performance for small result sets. Join fetching will perform well for objects with a small number of relationships.

To see how the results differ in different environments, I did a few more runs on different databases. The next run is for a local Oracle 10g database installed on my desktop. This database is slower than the new server, but will not require a network trip since it is on the same machine as the Java client.

Big Data Results, run 3, simple (fetch address, phoneNumbers)

QueryAverage (queries/10 minutes)%STD%DIF (of standard)
standard110.0%0%
join fetch3460.7%+2718%
batch fetch (JOIN)3100.0%+2718%
batch fetch (EXISTS)3430.6%+3018%
batch fetch (IN)2600.1%+2263%

Big Data Results, run 4, complex (fetch address, phoneNumbers, projects, manager, managedEmployees, emailAddresses, responsibilities, jobTitle, degrees)

QueryAverage (queries/10 minutes)%STD%DIF (of standard)
standard10.0%0%
join fetch391.1%+3800%
batch fetch (JOIN)1060.8%+10500%
batch fetch (EXISTS)1130.0%+11200%
batch fetch (IN)361.2%+3500%

These results show similar results for the previous simple run, but about a 30x improvement over the non-optimize query, which is a bigger difference. The JOIN batch fetch did not seem to perform as well as the EXISTS or join fetch.

The complex run only completed a single run in the 10 minutes for the non-optimized query. The JOIN and EXISTS batch fetching performed the best, over 100x faster than the non-optimized query. Join fetching and IN batch fetching did not perform as well, but were both still over 30x faster than the non-optimized query.