Srinivasan's Java Blog: October 2013

This whitepaper throws light on areas of programming where certain simple techniques can be used to expedite the performance in java platform.

Factor for performance is depending on how the data has been processed using what api. Mainly String, Collections objects has been used to process data and Database to store data. This whitepaper discuss about few techniques of handling String, Collections and database usage which can yield performance through CPU usage, memory usage and execution time.

The performance is exhibited through ample illustrated programs. The program was executed in eclipse IDE and monitored using jvisualvm tool provided by Oracle JDK in windows 7 platform.

A simple operation which executes in millisecs may not perform better in multithread or inside a loop. The below techniques throws some light on those operations and solutions to be used as an alternate.

1. String.Split verses String.indexof:

Split api of String is very commonly used. But if this api is used in multithreaded environment it might cause performance bottle necks. Below test is performed to prove that indexof api is better that split api. The test is performed over a 5mb of String which has 50000 comma separated values. Test 1 use the split api to fetch the values from the string. Test 2 uses precompiled regex pattern in the split api. Test 3 uses the indexof and substring api to fetch the values.

Test 1: split api Execution Time - 24256 Sec

Test 2: split api with precompiled regex pattern Execution Time - 23345 Sec

Test 3: indexof api Execution Time - 5678 sec

The above graph shows the String.indexof() has huge performance benefit.

2. Calculating the byte length of large String:

Available java api to calculate the byte length of the String is as below

stringObj.getBytes(CharSet).length

The getBytes api generates the byte array of the string and the length attribute of array is used to get the length of the byte length. There is no other way to just identify the byte length without converting it to byte array. If the intension is just to identify the byte length then the string can be used in chunks which can converted to byte array to get the length. The each chunk byte length can be sum up to get the whole byte length of a string.

The below single threaded test is performed to identify the byte length of 75 mb String. Test 1 just uses the whole string to convert to byte array and identify the length. Test 2 convert the string to small chunks and identify the byte length and sum up to get the total byte length.

Test 1: Execution Time - 642 millisec

Test 2: Execution Time - 468 millisec

The Test 2 performs well and uses less memory. Below in the utility method which is used to calculate the byte length of a string in test 2

static private int getByteLength(String dataString, String CharSet)

throws

UnsupportedEncodingException {

int dataLength = dataString.length();

int index = 0;

int byteLength = 0;

int strChunkSize = 10000;

long start = System.currentTimeMillis();

while(index < dataLength) {

if(index+strChunkSize > dataLength){

strChunkSize = dataLength - index;

}

byteLength += (dataString.substring(index,

index+strChunkSize). getBytes(CharSet).length);

index += strChunkSize;

}

return byteLength;

}

Improvement in performance will be addressed in JDK 8 Release(link below). Till that the above trick might help.

http://openjdk.java.net/jeps/112

"Implement the sun.nio.cs.ArrayDecoder/Encoder API for the most frequently used double-byte charsets to enhance new String(byte[]) and String.getBytes() performance."

3. Using Clob fields in Oracle DB

The clob fields in Oracle DB have to be used very carefully in multithreaded environment. The retrieve and update of these fields are performance bottle necks if multiple threads are accessing it.

Test 1 is done by 100 threads reading the clob field of 8mb of data. Test 2 is done by 100 threads reading the non clob field of less data. The execution time, cpu usage and memory usage are high for accessing the clob field.

Test 1: accessing clob field Execution Time: 58741 milliseconds (100 threads)

Test 2: accessing varchar field Execution Time: 469 milliseconds (100 threads)

Oracle forums say that if the data is bigger that 32 kb in a CLOB field then the time taken to retrieve the data will be more. Better consider not storing the huge data in clob field in DB if you need to access in multithreaded environment and need less execution time.

4. Comparing LinkedHashSet, ArrayList and LinkedList

When we handle very huge number of data, for many reasons we might need to hold that in a collection. Choosing the right collection will definitely add to performance. Let’s take a scenario where we need to maintain the order of insertion and should not allow duplication. After filling up the data in to the collection, we need to pick one by one from the collection for processing. After processing remove the data from the collection. To do this we need to do four operations.

1. add to the collection

2. retrieve from the collection

3. check the collection already has the object before adding it (to avoid duplication)

4. Delete from the collection.

Moreover the collection should maintain the order of insertion. Let’s compare the collection which can do the above operations – ArrayList, LinkedList, LinkedHashSet. Below test is done with 100 threads and each thread accessing one collection and adding 10000 data into it. Before added checks if it already has the data.

Test 1: ArrayList Execution Time: 122714 milliseconds

Test 2: LinkedList Execution Time: 180217 milliseconds

Test 3: LinkedHashSet Execution Time: 7836 milliseconds

Looking at the above graphs LinkedHashSet has very great performance in terms of execution time, memory usage and cpu usage.

The above tests results has been tabulated below

Utility Methods	Memory (mb)	CPU (Avg)	Execution Time(sec)
String Split	800	66	24
Precompiled Regex pattern String Split	450	56	23
String Index of	35	45	5

String.getBytes().length	500	1	0.6
Custom api getByteLength	180	20	0.5

Retrieve Clob field data	35	15	58
Retrieve Non Clob field data	22	8	0.4

Load test on ArrayList	80	76	122
Load test on LinkedList	100	73	180
Load test on LinkedHashSet	30	60	7

5. StringBuilder, StringBuffer, String

Everyone knows that usage of String operations will degrade the performance. Still I am deliberate to put this in my list because it creates lot of difference if we use string objects inside a loop or multiple threads. We are very much used with String objects and it should be changed to StringBuilder or StringBuffer.

Below are few other useful guidelines which might help in performance

Checksum Calculation:

While transferring files we need to make sure that the complete content has to be received by the receiver. For that we might use few algorithms for checking the completeness. Few such algorithms are MD5, CRC32, etc. We can embed calculating this algorithm along with the input stream. The below code will does that.

MessageDigest algorithm = MessageDigest.getInstance("MD5");

FileInputStream fis = new FileInputStream("/data/file/100mbFile.txt");

BufferedInputStream bis = new BufferedInputStream(fis);

DigestInputStream dis = new DigestInputStream(bis, algorithm);

The DigestInputStream object can be used for data processing. The skip method will not calculate the checksum for the bytes which is skipped.

Use UTF-8 charset while converting bytes to string:

To convert bytes to string use desired charset. The widely used charset is UTF-8. You can use the below code to do that. This is just a simple thing but this awareness will solve issues in representing the special chars.

new String(bytes, "UTF-8");

Conclusion:

Choosing right api for data processing will yield better performance and also keep the overall memory footprint especially the old generation as small as possible. Thus makes the application more scalable and reliable.

Features	MongoDb	CouchDb	Riak
Client api support	Yes	Yes (only community support)	Yes
Open-source client libraries	C, C++, C#, Erlang, Java, JavaScript, Node.js, Perl, PHP, Python, Ruby, Scala.	Clojure, Common Lisp, Erlang, Java, JavaScript, Lua, .Net, Node.js, Perl, PHP, PL/SQL, Python, Ruby, Scala	Erlang, Java, PHP, Python, Ruby
Clustering	Master-Slave	Master-Master	Masterless
Sharding	Yes	-	Yes
Environment Supported	Windows, Linux, Mac OS X, Solaris	Windows, Mac OS X, Linux	Debian, Ubuntu, FreeBSD, Mac, Red Hat Enterprise, Fedora, SmartOS, Solaris
Windows Installation	Yes	Yes	No
Doc Link	http://docs.mongodb.org/manual/	http://docs.couchdb.org/en/latest/	http://docs.basho.com/riak/latest/
Replication	Yes	Yes	Yes
Writes allowed on all nodes	NO	Yes	Yes
Clustering/sharding available built-in	Yes	No	Yes
Replication available built-in	Yes	Yes	Yes
CAP Theorem	CP	AP	AP
License	AGPL (Drivers: Apache)	Apache	Apache
Commercial/Enterprise licenses	Yes	No	Yes
Storage Type	Document Oriented (BSON )	Document Oriented (JSON )	Bucket, key/value

Srinivasan's Java Blog

Tuesday, 8 October 2013

Compare MongoDB Vs CouchDB Vs Riak

5 Simple Tips To Achieve Performance in JAVA