Tuesday, 8 October 2013

Compare MongoDB Vs CouchDB Vs Riak




I have recently trying to understand different NoSql databases. I found MongoDb, CouchDb & Riak and compared their features. Below is my understanding.



Features
MongoDb
CouchDb
Riak
Client api support
Yes
Yes (only community support)
Yes
Open-source client libraries
C, C++, C#, Erlang, Java, JavaScript, Node.js, Perl, PHP, Python, Ruby, Scala.
Clojure, Common Lisp, Erlang, Java, JavaScript, Lua, .Net, Node.js, Perl, PHP, PL/SQL, Python, Ruby, Scala
Erlang, Java, PHP, Python, Ruby
Clustering
Master-Slave
Master-Master
Masterless
Sharding
Yes
-
Yes
Environment Supported
Windows, Linux, Mac OS X, Solaris
Windows, Mac OS X, Linux
Debian, Ubuntu, FreeBSD, Mac, Red Hat Enterprise, Fedora, SmartOS, Solaris
Windows Installation
Yes
Yes
No
Doc Link
Replication
Yes
Yes
Yes
Writes allowed on all nodes
NO
Yes
Yes
Clustering/sharding available built-in
Yes
No
Yes
Replication available built-in
Yes
Yes
Yes
CAP Theorem
CP
AP
AP
License
AGPL (Drivers: Apache)
Apache
Apache
Commercial/Enterprise licenses
Yes
No
Yes
Storage Type
Document Oriented (BSON )
Document Oriented (JSON )
Bucket, key/value





5 Simple Tips To Achieve Performance in JAVA


This whitepaper throws light on areas of programming where certain simple techniques can be used to expedite the performance in java platform.

Factor for performance is depending on how the data has been processed using what api. Mainly String, Collections objects has been used to process data and Database to store data. This whitepaper discuss about few techniques of handling String, Collections and database usage which can yield performance through CPU usage, memory usage and execution time.

The performance is exhibited through ample illustrated programs. The program was executed in eclipse IDE and monitored using jvisualvm tool provided by Oracle JDK in windows 7 platform.

A simple operation which executes in millisecs may not perform better in multithread or inside a loop. The below techniques throws some light on those operations and solutions to be used as an alternate. 

1. String.Split verses String.indexof: 
         Split api of String is very commonly used. But if this api is used in multithreaded environment it might cause performance bottle necks.  Below test is performed to prove that indexof api is better that split api. The test is performed over a 5mb of String which has 50000 comma separated values. Test 1 use the split api to fetch the values from the string. Test 2 uses precompiled regex pattern in the split api. Test 3 uses the indexof and substring api to fetch the values. 

                                   Test 1: split api                                  Execution Time - 24256 Sec

           
Test 2: split api with precompiled regex pattern       Execution Time - 23345 Sec
         
   Test 3: indexof  api               Execution Time - 5678 sec
                       

The above graph shows the String.indexof() has huge performance benefit.


2. Calculating the byte length of large String:
          Available java api to calculate the byte length of the String is as below

                   stringObj.getBytes(CharSet).length

           The getBytes api generates the byte array of the string and the length attribute of array is used to get the length of the byte length. There is no other way to just identify the byte length without converting it to byte array. If the intension is just to identify the byte length then the string can be used in chunks which can converted to byte array to get the length. The each chunk byte length can be sum up to get the whole byte length of a string. 
The below single threaded test is performed to identify the byte length of 75 mb String. Test 1 just uses the whole string to convert to byte array and identify the length. Test 2 convert the string to small chunks and identify the byte length and sum up to get the total byte length. 

Test 1:          Execution Time - 642 millisec
     
Test 2:               Execution Time - 468 millisec


          The Test 2 performs well and uses less memory. Below in the utility method which is used to calculate the byte length of a string in test 2


static private int getByteLength(String dataString, String CharSet)
                 throws                                                                                                            
                 UnsupportedEncodingException {
    int dataLength = dataString.length();
    int index = 0;
    int byteLength = 0;
    int strChunkSize = 10000;
    long start = System.currentTimeMillis();
    while(index < dataLength) {
       if(index+strChunkSize > dataLength){
            strChunkSize = dataLength - index;
       }
        byteLength += (dataString.substring(index,
                index+strChunkSize). getBytes(CharSet).length);                                          
       index += strChunkSize;
    }
    return byteLength;
}


Improvement in performance will be addressed in JDK 8 Release(link below). Till that the above trick might help. 

http://openjdk.java.net/jeps/112 

                "Implement the sun.nio.cs.ArrayDecoder/Encoder API for the most frequently used double-byte charsets to enhance new String(byte[]) and String.getBytes() performance."



3. Using Clob fields in Oracle DB
          The clob fields in Oracle DB have to be used very carefully in multithreaded environment. The retrieve and update of these fields are performance bottle necks if multiple threads are accessing it. 
          Test 1 is done by 100 threads reading the clob field of 8mb of data. Test 2 is done by 100 threads reading the non clob field of less data. The execution time, cpu usage and memory usage are high for accessing the clob field.


Test 1: accessing clob field       Execution Time: 58741 milliseconds (100 threads)

Test 2: accessing varchar field       Execution Time: 469 milliseconds (100 threads)

          Oracle forums say that if the data is bigger that 32 kb in a CLOB field then the time taken to retrieve the data will be more. Better consider not storing the huge data in clob field in DB if you need to access in multithreaded environment and need less execution time.

4. Comparing LinkedHashSet, ArrayList and  LinkedList
          When we handle very huge number of data, for many reasons we might need to hold that in a collection. Choosing the right collection will definitely add to performance. Let’s take a scenario where we need to maintain the order of insertion and should not allow duplication. After filling up the data in to the collection, we need to pick one by one from the collection for processing. After processing remove the data from the collection. To do this we need to do four operations.  
    1. add to the collection 
    2. retrieve from the collection 
    3. check the collection already has the object before adding it (to avoid duplication)
    4. Delete from the collection. 

          Moreover the collection should maintain the order of insertion. Let’s compare the collection which can do the above operations – ArrayList, LinkedList, LinkedHashSet. Below test is done with 100 threads and each thread accessing one collection and adding 10000 data into it. Before added checks if it already has the data.


Test 1: ArrayList       Execution Time: 122714 milliseconds

Test 2: LinkedList      Execution Time: 180217 milliseconds

Test 3: LinkedHashSet      Execution Time: 7836 milliseconds


         Looking at the above graphs LinkedHashSet has very great performance in terms of execution time, memory usage and cpu usage.

The above tests results has been tabulated below 

Utility Methods
Memory (mb)
CPU (Avg)
Execution Time(sec)
String Split
800
66
24
Precompiled Regex pattern String Split
450
56
23
String Index of
35
45
5

String.getBytes().length
500
1
0.6
Custom api getByteLength
180
20
0.5

Retrieve Clob field data
35
15
58
Retrieve Non Clob field data
22
8
0.4

Load test on ArrayList
80
76
122
Load test on LinkedList
100
73
180
Load test on LinkedHashSet
30
60
7


5. StringBuilder, StringBuffer, String
          Everyone knows that usage of String operations will degrade the performance. Still I am deliberate to put this in my list because it creates lot of difference if we use string objects inside a loop or multiple threads.  We are very much used with String objects and it should be changed to StringBuilder or StringBuffer.

Below are few other useful guidelines which might help in performance

Checksum Calculation:
          While transferring files we need to make sure that the complete content has to be received by the receiver. For that we might use few algorithms for checking the completeness. Few such algorithms are MD5, CRC32, etc. We can embed calculating this algorithm along with the input stream. The below code will does that. 

MessageDigest algorithm = MessageDigest.getInstance("MD5");
FileInputStream fis = new FileInputStream("/data/file/100mbFile.txt");
BufferedInputStream bis = new BufferedInputStream(fis);
DigestInputStream dis = new DigestInputStream(bis, algorithm);

          The DigestInputStream object can be used for data processing. The skip method will not calculate the checksum for the bytes which is skipped. 


Use UTF-8 charset while converting bytes to string:
          To convert bytes to string use desired charset. The widely used charset is UTF-8. You can use the below code to do that. This is just a simple thing but this awareness will solve issues in representing the special chars.
  
new String(bytes, "UTF-8");



Conclusion: 
          
             Choosing right api for data processing will yield better performance and also keep the overall memory footprint especially the old generation as small as possible. Thus makes the application more scalable and reliable.