This whitepaper throws light on areas of programming where certain simple techniques can be used to expedite the performance in java platform.
Factor for performance is depending on how the data has been processed using what api. Mainly String, Collections objects has been used to process data and Database to store data. This whitepaper discuss about few techniques of handling String, Collections and database usage which can yield performance through CPU usage, memory usage and execution time.
The performance is exhibited through ample illustrated programs. The program was executed in eclipse IDE and monitored using jvisualvm tool provided by Oracle JDK in windows 7 platform.
A simple operation which executes in millisecs may not perform better in multithread or inside a loop. The below techniques throws some light on those operations and solutions to be used as an alternate.
1. String.Split verses String.indexof:
Split api of String is very commonly used. But if this api is used in multithreaded environment it might cause performance bottle necks. Below test is performed to prove that indexof api is better that split api. The test is performed over a 5mb of String which has 50000 comma separated values. Test 1 use the split api to fetch the values from the string. Test 2 uses precompiled regex pattern in the split api. Test 3 uses the indexof and substring api to fetch the values.
Test 1: split api Execution Time - 24256 Sec
Test 2: split api with precompiled regex pattern Execution Time - 23345 Sec
Test 3: indexof api Execution Time - 5678 sec
The above graph shows the String.indexof() has huge performance benefit.
2. Calculating the byte length of large String:
Available java api to calculate the byte length of the String is as below
stringObj.getBytes(CharSet).length
The getBytes api generates the byte array of the string and the length attribute of array is used to get the length of the byte length. There is no other way to just identify the byte length without converting it to byte array. If the intension is just to identify the byte length then the string can be used in chunks which can converted to byte array to get the length. The each chunk byte length can be sum up to get the whole byte length of a string.
The below single threaded test is performed to identify the byte length of 75 mb String. Test 1 just uses the whole string to convert to byte array and identify the length. Test 2 convert the string to small chunks and identify the byte length and sum up to get the total byte length.
Test 1: Execution Time - 642 millisec
Test 2: Execution Time - 468 millisec
The Test 2 performs well and uses less memory. Below in the utility method which is used to calculate the byte length of a string in test 2
static private int getByteLength(String dataString, String
CharSet)
|
throws
|
UnsupportedEncodingException
{
|
int dataLength = dataString.length();
|
int index = 0;
|
int byteLength = 0;
|
int strChunkSize = 10000;
|
long start = System.currentTimeMillis();
|
while(index < dataLength) {
|
if(index+strChunkSize > dataLength){
|
strChunkSize = dataLength - index;
|
}
|
byteLength += (dataString.substring(index,
|
index+strChunkSize). getBytes(CharSet).length);
|
index += strChunkSize;
|
}
|
return byteLength;
|
}
|
Improvement in performance will be addressed in JDK 8 Release(link below). Till that the above trick might help.
http://openjdk.java.net/jeps/112
"Implement the sun.nio.cs.ArrayDecoder/Encoder API for the most frequently used double-byte charsets to enhance new String(byte[])
and String.getBytes()
performance."
3. Using Clob fields in Oracle DB
The clob fields in Oracle DB have to be used very carefully in multithreaded environment. The retrieve and update of these fields are performance bottle necks if multiple threads are accessing it.
Test 1 is done by 100 threads reading the clob field of 8mb of data. Test 2 is done by 100 threads reading the non clob field of less data. The execution time, cpu usage and memory usage are high for accessing the clob field.
Test 1: accessing clob field Execution Time: 58741 milliseconds (100 threads)
Test 2: accessing varchar field Execution Time: 469 milliseconds (100 threads)
Oracle forums say that if the data is bigger that 32 kb in a CLOB field then the time taken to retrieve the data will be more. Better consider not storing the huge data in clob field in DB if you need to access in multithreaded environment and need less execution time.
4. Comparing LinkedHashSet, ArrayList and LinkedList
When we handle very huge number of data, for many reasons we might need to hold that in a collection. Choosing the right collection will definitely add to performance. Let’s take a scenario where we need to maintain the order of insertion and should not allow duplication. After filling up the data in to the collection, we need to pick one by one from the collection for processing. After processing remove the data from the collection. To do this we need to do four operations.
1. add to the collection
2. retrieve from the collection
3. check the collection already has the object before adding it (to avoid duplication)
4. Delete from the collection.
Moreover the collection should maintain the order of insertion. Let’s compare the collection which can do the above operations – ArrayList, LinkedList, LinkedHashSet. Below test is done with 100 threads and each thread accessing one collection and adding 10000 data into it. Before added checks if it already has the data.
Test 1: ArrayList Execution Time: 122714 milliseconds
Test 2: LinkedList Execution Time: 180217 milliseconds
Test 3: LinkedHashSet Execution Time: 7836 milliseconds
Looking at the above graphs LinkedHashSet has very great performance in terms of execution time, memory usage and cpu usage.
The above tests results has been tabulated below
Utility
Methods
|
Memory (mb)
|
CPU (Avg)
|
Execution Time(sec)
|
String Split
|
800
|
66
|
24
|
Precompiled Regex pattern String Split
|
450
|
56
|
23
|
String Index of
|
35
|
45
|
5
|
|
String.getBytes().length
|
500
|
1
|
0.6
|
Custom api getByteLength
|
180
|
20
|
0.5
|
|
Retrieve Clob field data
|
35
|
15
|
58
|
Retrieve Non Clob field data
|
22
|
8
|
0.4
|
|
Load test on ArrayList
|
80
|
76
|
122
|
Load test on LinkedList
|
100
|
73
|
180
|
Load test on LinkedHashSet
|
30
|
60
|
7
|
5. StringBuilder, StringBuffer, String
Everyone knows that usage of String operations will degrade the performance. Still I am deliberate to put this in my list because it creates lot of difference if we use string objects inside a loop or multiple threads. We are very much used with String objects and it should be changed to StringBuilder or StringBuffer.
Below are few other useful guidelines which might help in performance
Checksum Calculation:
While transferring files we need to make sure that the complete content has to be received by the receiver. For that we might use few algorithms for checking the completeness. Few such algorithms are MD5, CRC32, etc. We can embed calculating this algorithm along with the input stream. The below code will does that.
MessageDigest algorithm = MessageDigest.getInstance("MD5");
|
FileInputStream fis = new FileInputStream("/data/file/100mbFile.txt");
|
BufferedInputStream bis = new BufferedInputStream(fis);
|
DigestInputStream dis = new DigestInputStream(bis, algorithm);
|
The DigestInputStream object can be used for data processing. The skip method will not calculate the checksum for the bytes which is skipped.
Use UTF-8 charset while converting bytes to string:
To convert bytes to string use desired charset. The widely used charset is UTF-8. You can use the below code to do that. This is just a simple thing but this awareness will solve issues in representing the special chars.
new String(bytes, "UTF-8");
|
Conclusion:
Choosing right api for data processing will yield better performance and also keep the overall memory footprint especially the old generation as small as possible. Thus makes the application more scalable and reliable.