Wednesday, 9 August 2017

Achieving Performance in Huge Data Transfer via HTTP



This whitepaper throws light on areas of programming where certain techniques can be used to expedite the performance in transferring huge data via HTTP in java platform.

Http has gained momentum in huge data transfer and therefore performance excellence in data handling has become inevitable. The base idea of handling is to load chunks of data in memory to process it and forget it to make the data available for garbage collection as soon as possible. This obviously will improve performance. This white paper discusses about the techniques of handling the data in chunks in Http request and response to yield performance and to highlight the performance benefit.

The performance is exhibited through executing programs which use plain java http server deployed in Jetty (version 7.4.5) servlet container. The program was executed in eclipse IDE and monitored using jvisualvm tool provided by Oracle JDK in windows 7 platform.

Reading HTTP request:

The data will be read from the request to be validated and processed. Reading the whole data from the request will not be a good idea if the data is huge, as we might end up using huge memory which in turn will affect serving other requests as well. This might cause out of memory exception. The better technique will be to use streams to handle the data, get a chunk of data from the request input stream for processing and then read the next chunk. Thus we can avoid storing the huge data in memory. 

To illustrate the above point two test cases has been executed in Jetty Server to exhibit the differences in reading whole data and reading chunks of data from request. The first test case reads the whole data from request and loads it in memory and writes that to a file. The Second test case reads chunks of data from request and writes it to a file in chunks. Same client program has been used to send data to these test cases. The above graph indicates that loading data in chunks has greater performance. 

Test Case 1: load the 200 mb into memory for processing.Execution Time: 8845 millisecond


Test Case 2:Load 5 mb chunks into memory for processing.Execution Time: 2862 millisec


Return Http Response:

To return huge data in response using chunk operations rather returning data as whole will save memory. The Http Response OutputStream object is used to return data in chunks. The data in the file can be read in chunks and written to the response outputStream. All good frameworks will support this chunk operation. 

Two test cases have been executed in Jetty Server to show the performance difference in returning whole data and returning chunks of data through response. The first test case returns the data by loading the whole content in memory. The Second test case returns the data in 5 mb chunks to response. A client program has been used to receive the data sent by these test cases. Both the test cases returns 200 mb data and it’s been monitored by JVisualVM. 

Test Case 1  - Execution Time: 3889 millisecond


Test Case 2   -  Execution Time:  1796 millisecond 


Thus the HTTP request and response can be effectively handled from the server side. Below point explains about client side performance of handing the request and response.


Sending Http Request:

Here comes the scenario how to handle sending huge data from client side. Similar technique of not holding the whole data in memory and sending it in chunks can be used here. The Http Connection object has couple api to do this. The setChunkedStreamingMode and setFixedLengthStreamingMode are those functions which send data in chunks. The setChunkedStreamingMode function is used to enable streaming of a HTTP request body without internal buffering, when the content length is not known in advance.  The setFixedLengthStreamingMode method is used to enable streaming of a HTTP request body without internal buffering, when the content length is known in advance.

First test case loads the whole data in memory and then sends it to server. Second test case user chunk mode to send the data to server. Both the test case uses the same server which receives the data in chunks since its performing better. The cpu usage, memory usage and time taken are captured by monitoring through JVisualVM. 

Test Case 1  - Execution Time:  3916 millisecond

Test Case 2   - Execution Time:  2613 millisecond





Receiving Http Response
Here is the scenario of Client reading a response. The performance has been measured with test case reading and holding the whole 200 mb of data in memory and then writes a file. The other test case which reads data in chunks and writes to a file in chunks.  The performance is better while reading and writing in chunks. 

Test Case 1 -  Execution Time:  8621 millisecond

Test Case 2   - Execution Time:  1645 millisecond



The performance test result has been tabulated below. Loading data in chunk obviously will have better performance. But this is to highlight how the performance difference.


Operation
Loading Data in chunks
Memory (mb)
CPU (Avg)
Execution Time (milli sec)
Server Reading 200 mb of data from request
N
500
40
8845
Y
14
25
2862

Server Returning 200 mb of data to response
N
505
50
3889
Y
12
20
1796

Client Sending 200 mb data to server
N
500
45
3916
Y
11
10
2613


Client Receiving 200 mb data from server
N
490
45
8621
Y
10
35
1645


The following guide lines can also be useful to handle input streams efficiently

Input Stream ReadLine api usage:

In huge data transfer scenarios the readLine api of inputStream object might create performance bottle necks. If the data contains huge data in single line, then the readLine api will end up in storing huge mb of data in memory. The main idea of this whitepaper is to handle the data in chunks. This api does not help in loading in chunks in all scenario.  The better way to handle this situation is to read the data from stream using read method with a buffer size of 100kb to few mb depending upon the max heap size of the application. 

Sequence Input Stream usage:

It’s easy to append or prefix data to input stream without reading whole data from it. Sequence Input Stream will do the job. This might be used to log few chars from a stream and put the read byte back to the stream. This make the stream reusable for other processing without strange logic in business layer.  Below is one example for prefixing the data to stream.

SequenceInputStream sq = new SequenceInputStream(new 
                                                                               ByteArrayInputStream(readBytes), in);
Below is one example for appending data to stream.

SequenceInputStream sq = new SequenceInputStream(in, new 
                                                                               ByteArrayInputStream(bytes));

in = inputStream
readBytes = byte array which has been read already
sq = sequence stream object

The idea of copying the whole data into ByteArrayInputStream to reuse the stream may not be a good idea for large data stream since ByteArrayInputStream will keep the whole data in to memory and may cause out of memory exception.







Conclusion: 
Finally, processing huge data in chunks and forget it as soon as possible will make the data available for garbage collection and yield better performance. This keeps the overall memory footprint and especially the old generation as small as possible. Thus makes the application more scalable, reliable, stable and available.

Saturday, 8 April 2017

Custom Stream Framework – Open Source

     Streams are a powerful tool for transferring huge data with least memory foot print. Stream implemented applications yield better performance and are more scalable. On the flip side Streams are tricky to be used and requires more time and effort whilst the development stage. Therefore it is overlooked at the initial stage of coding and is mostly considered during performance bottle neck phases.
     Considering the difficulty in using Streams in development stage, I have created few Custom Wrapper classes which override the existing roadblocks in using them. It also provides certain common stream utility functions. This topic focus on the usage of these Utilities and not on the implementation details of the wrapper classes.  The source code in available for download ! 

Binary            : StreamFramework.jar

Below utilities are available in the stream framework.
  1. Simple Read - Read operation made simple through InputStreamReadWrapper Class which has additional method called next() which reads the data from stream and returns boolean value to say the data has been read or not. The method getBytes() will return the bytes of data which is read in the next() method.
  2. Limited Read - InputStreamReadWrapper class constructor accepts a long value which is the max data to read from the stream. Also has a method setReadLimit() which does the same job.
  3. Restricted Read Line - InputStreamReadWrapper has readLine(lineLengthToCheck) method which grabs the specified length of data from the stream and checks new line char is available or not. It put back the remaining data after the new line char into the stream. The restricted read line is invented to avoid the memory issues in case of huge data sent in a single line. The getBytes will return the line data read from the stream.
  4. Calculate MD5InputStreamReadWrapper constructor has an option to enable md5 calculation. We can use the method getMessageDigestValue() to get the MD5 value of the data read from the stream.
  5. Convert Data to Stream - InputStreamReadWrapper class constructor also accepts a DataLoader type which helps to convert Data to Stream.
  6. Convert InputStream to GzippedStream – GzippedStream is inherited from InputStreamReadWrapper Class and has the constructor which accepts InputStream. The instantiated object of GzippedStream can be read to get the gzipped data of the InputStream.
  7. Generate GzippedStream while reading InputStream – InputStreamGzipReadWrapper inherited from InputStreamReadWrapper class and has a constructor which accepts InputStream and DataReader custom type. The instantiated object of InputStreamGzipReadWrapper can we read as normal InputStream by the same time we can read the gzipped data from the DataReader custom type class.


Below sections will guide you through each utility in the Stream Framework.


 1. Converting InputStream to InputStreamReadWrapper:

InputStreamReadWrapper is inherited class of InputStream and has more functionality wrapped to make read operations easy to use. Below are the functionalities we get when we instantiate InputStreamReadWrapper class.
  1.      Easy to read data using next(), getBytes() methods.
  2.      Option available to read specified length of data from the Stream
  3.      Get MD5 value after reading the data.

Class Diagram

Read bytes from InputStream:

Below is the simplest code snippet to read the bytes from the input Stream. This involves defining the int and byte array. Also requires a while loop which does assigning read data length and checks the value is not -1. After the data is loaded in the byte array we should read only the bytes till the read length from the byte array.


int read;






final byte[] data = new byte[1024*128];


while ((read = fileInputStream.read(data)) != -1) {

fileOutStream.write(data, 0,read);


}








Read bytes using Stream Framework:

         All the above complexity has been simplified in InputStreamReadWrapper class. This will take care of all the variable instantiation and it works very similar to database result set next() method. InputStreamReadWrapper class next method which loads the next chunk of data in to the byte array and returns boolean  to say that the data is loaded or not. The other method getBytes returns the bytes of data read from the stream.


InputStreamReadWrapper insWrap = new InputStreamReadWrapper(fileInputStream);
while(insWrap.next()) {







fileOutStream.write(insWrap.getBytes());




}












Controlled Read using Stream Framework:

Think of a usage if you want to copy 1000 bytes to one output Stream and remaining bytes to another output stream.


InputStreamReadWrapper insWrap = new InputStreamReadWrapper(fileInputStream, 1000L);
Util.writeTo(insWrap, file1OutStream);





file1OutStream.close();


















insWrap = new InputStreamReadWrapper(fileInputStream);



Util.writeTo(insWrap, file2OutStream);





file2OutStream.close();


















insWrap.close();








file1OutStream.close();







file2OutStream.close();








 2.  Convert Data to InputStream:

     When there is a need to send huge of data to a server or another process we are in a situation to hold that huge data in memory before we start sending it. If we use a PipeInputStream and PipeOutputStream we can write the data at one end and in the other end the data can be read by another process. This Pipe Input and Output stream usage is wrapped inside InputStreamReadWrapper class. This helps not to wait for the full data to be generated and do not need to hold the huge data memory. The data can be written to OutputStream as soon as we have a chunk of data.

                                                         Feature description diagram


The below code snippet explains the sample implementation of converting data to stream. The DataLoader interface has a method load, which needs to be defined in the implementation class and the method parameter OutputStream object is used to write data into the Stream.


DataLoader dataLoader = new DataLoaderImpl();


InputStreamReadWrapper ins = new InputStreamReadWrapper(dataLoader);
Util.writeTo(ins, ffOutStream);




           

class DataLoaderImpl implements DataLoader{



@Override







public void load(OutputStream out) throws IOException {



for (int i = 0; i < 1000; i++) {






out.write(("Test data " + i + "\n").getBytes("UTF-8"));



}









out.close();






}








}










 3. Convert InputStream to GZippedInputStream:

     Converting the InputStream to GZippedInputStream is a great feature. This is done by GZippedStream class which is extended from InputStreamReadWrapper class to have its functions as well. The GzippedStream class constructor uses the PipedInputStream, PipedOutputStream and GZIPOutputStream to do this functionality.

Feature description diagram

Class Diagram
                                                                            
          The below code snippet shows the sample implementation of converting InputStream to GzippedInputStream. GZippedStream class constructor takes the InputStream as parameter and creates PipedOutputStream connected with PipedInputStream. The PipedOutputStream is set to the constructor of GZIPOutputStream. Now if we write data to GZIPOutputStream and read data from PipedInputStream we get the feature what we need. The write and read operation should be done in different threads since to avoid resource lock which is the general rule of using pipe streams.










GZippedStream zs = new GZippedStream(fileInputStream);
Util.writeTo(zs, fileOutStream);










 4. Convert InputStream to GzippedInputStream while reading InputStream:

The feature is when an InputStream is read we get the Gzipped data in parallel to what is read. This is amazingly useful when we need to validate the data but we want to store the content in zipped format. This feature implementation involves pipe streams and threads in the InputStreamGzipReadWrapper class. The InputStreamGzipReadWrapper class extends InputStreamReadWrapper to inherit other functions.

             Feature description diagram   
 Class Diagram

The below code snippet shows the sample implementation of converting InputStream to Gzipped Data InputStream while reading the base InputSream.


DataReader gzipDataReadernew GzipDataReaderImpl();






InputStreamGzipReadWrapper dataInputStream
      = new InputStreamGzipReadWrapper(fileInputStream, gzipDataReader);

while(dataInputStream.next()) {











dataInputStream.getBytes();//ValidateTheData 










}
















class GzipDataReaderImpl implements DataReader {

@Override

public void readBytes(byte[] bytes, int off, int length) {


try {



ffOutStream.write(bytes, off, length);


} catch (IOException e) {



e.printStackTrace();


}






}


}






 5. Usage of Util.WriteTo(InputStream, OutputStream): 

     Just copies the input Stream to OutputStream.

 6. Usage of Util.byteIndexOf(byte[] sourceData, byte[]searchData) api:
       Searches particular bytes of data in the sourceData byte array and returns the first occurrence index value. If the searchData is not available in source data then it returns -1. This saves the cost of converting the bytes to string to check for presence of specific word or string.


 Testing the Performance:

Read Api :
        Just reading the data from the stream is compared below. The execution time, memory usage and cup usage are captured using JVisual VM tool. Since this is a wrapped class of InputStream we can not expect better performance. But this should not degrade the existing InputStream performance. Looking at the below test of reading 300 mb using InputStream and InputStreamReadWrapper, the performance is not degraded. 

       Read 300 mb of file using InputStream                          Execution Time: 6067

Read 300 mb of file using InputStreamReadWrapper               Execution Time: 6164


Conclusion:

     This stream framework helps in reducing the coding effort in using Streams. Anyone can play with Stream as like they want and exploit the features.