Monday, September 15, 2014

Parsing/Loading huge XML & managing memory in perl XML:simple

Purpose: 
To parse "huge" XML file into perl data structure using XMLIn(). This article does not intend to teach XML or XMLIn(). This focuses only in memory management & efficiency when using XMLIn() and Twig. The same ideas can be applied in Java or C++ too. 


Here we have taken 235 MB data file named "prices.xml" as example from original 900 MB file, since the mighty unix system ran out of memory and terminated process!! and cannot be used as benchmark for comparison. Since the XMLIn() could not complete the parsing for 900MB file, this study is not just a improvement but an "essential" fine tuning of code, without which the production will fail.

Intro: Perl’s XML::Simple is a simple module (as the name suggests), that has member function XMLIn() to parse an XML file and convert it into a very easy to operate perl hash data structure. Though it is easy to use the hash once returned, it takes a long time and lot of memory before returning. So let's make our hands dirty and analyze right away:

When ran with this 235MB prices.xml file, XMLIn() did not return back. After waiting for a while, Instead of interrupting with ^c, I ran memory & system monitoring commands to see if it was really running fine or hanging. When I ran ‘top’ command, It showed two things. One good. The other bad.
The good: it clearly showed most activity happening in my perl program. (So may be not hanging):

The Bad: The process started in few MBs, reached 320MB in 1 minute, 630 MB in 2 minutes and so on. In about 6 minutes, the XMLIn consumed about 1.8 GB of resident and virtual memory and was still running for a few more seconds. This is not just a memory problem but it would cause swap-in, swap-out of large virtual memory (since Unix is time sharing system) which is time consuming and as well can slow down entire unix system.

So this article is about how to make this efficient.. interested ? read on..

The prices.xml file has several million rows. Loading this into memory using XMLIn() is not easy because of several tasks performed during loading: 

  1. parsing, splitting each piece of millions of points in the data
  2. defining sibling/child/parent relationships, annotations in them
  3. storing it in a tree like structure with no loss of data with ability reproduce the original file when used in XMLOut etc etc.
So it needed 1.8GB virtual memory, as the top command showed. When the function returned the memory usage dropped to normal. Let us investigate the data in prices.xml:

The file looks like this:
<MktData Date="2014-09-17" … … … … ….. >
<Sym="IBM"  MatDt="2015-01-17" Price="52.80" … … … … ….. >
<Sym="AAPL" MatDt="2015-01-17" Price="101.53" … … … … ….. >
….   Several millions of rows like this.. and then
<Sym="TSLA" MatDt="2015-01-17" Price="282.17" … … … … ….. >
</MktData >

As we see the snap shot of datafile, it is a rather almost flat simple structure. Except the first and last lines, all lines are linear, representing exactly one element and all data of that single record/element is specified in the same line. This qualifies this file to be operated in line by line or element by element. Let us see how it works out:


one at a time please:

So we observed that, since every line contains one "xml-record", it renders the file to be processed line by line possible. So I was almost sure that line by line processing will use less memory even before testing it. Reason: when XMLIn works and parses, it needs to store each and every element in full keystring, valuestring (Sym=>"AAPL"MatDt=>"2015-01-17" Price="101.53") structure, along with the other XML standards discussed in last para. But, when we process the same line by line, the same happens, but for a micro subset of data whose size is a few KB for each and every line compared to GB for the whole file As well, after each and every line parse XMLIN() returned, I can copy those data into perl hash structure named Prices, which can store only values (like "AAPL"=>101.53) whose size is few bytes per line, once processed and stored. After we move the next line to process, the last lines memory is unreferenced and released, so the memory doesn’t grow faster than few bytes per line.

So I changed code to open the prices.xml as a flat file, read it one line at a time in a loop. Inside the loop, I processed that line using same XMLIn call.  I picked up the structure returned, stored only the needed values and stepped to next iteration, printing rows processed once for every 1000 lines. This was really memory efficient and from the beginning to end, the memory was very stable and low(35 MB vs 1.8 GB):




However the program did not finish after 6 minutes. It continued until 17 minutes to finish processing. Seems apparently we traded speed for memory.  Fair enough? May be. At least we have avoided slowing down the entire system and stopping other programs. May be not. There may be better ways around. I started looking at the code.


somewhere in the middle:

It did not take much time for me to grasp what was happening. With this new code we are calling XMLIn more than a million times (i.e once per line). But earlier we called it only once. Each XMLIn call requires some basic setups, initializations, stack up, stack downs and must have taken some time finish, all added up time.  So I decided to balance it out, with fewer number of XMLIn calls. I changed the code to read 1000 lines in a loop, and call XMLIn once with all 1000 lines. I chose the number 1000 because the xml records are rather small size and will fit in few blocks of memory(considering a block is 8KB). To make 1000 individual XML structures to be a single valid structure, needs embedding header and footer tags. So I made this structure by adding < Thou > in the XML. So instead of every line like <Sym="IBM" MatDt="2015-01-17" Price="52.80" … … … … ….. >
fed to XMLIn, 1000 lines were read from the Prices file and fed once like this with first and last lines added:
<Thou >
<Sym="IBM"  MatDt="2015-01-17" Price="52.80" … … … … ….. >
<Sym="AAPL" MatDt="2015-01-17" Price="101.53" … … … … ….. >
….   1000 lines like this
<Sym="TSLA" MatDt="2015-01-17" Price="282.17" … … … … ….. >
</Thou >


And after the XMLIn returned, I would run through all the elements in loop, store them one by one (1000 times). But will this run faster? I kept my fingers crossed and ran it, and watched the memory growth again, with our faithful top command:


This time, it took just 6 MB more (than line by line version), and returned back in 6 minutes 20 seconds. Very close to original run’s speed and close to line by lines memory usage. Awesome.  I should have stopped here, but I didn't.


Multi-threading failure :

Carried away by obsessive compulsion of speeding up the program, I spent more than a day in implementing multi-threading version to make parsing faster. I used threads to parse every thousand lines in parallel with 10 threads running. It finished in just 2 minutes 27 seconds. With complicated memory handling, synchronization of threads all I achieved was about 2 ½ times fast or 60% decrease in run time. But that brought back memory usage to original levels (actually it exceeded), as every thread needs to keep its own memory, variables, data structures, while the program size to handle multi-threading has already bloated up the program.  Surely not worth it. I also observed few facts, learned few things by this attempt, which are out of scope for this article. 


The Harmonious end - XML::Twig

The purpose of XML::Twig - as the name suggests, this module helps us just to process the little twigs of huge XML tree, instead of building entire tree. This straight away eliminates the memory issue by loading small tree of few twigs of each Sym element (instead of loading million branches of entire xml in reproducible manner we discussed earlier) provided we use it in correct manner and purge the memory after using xml elements. When we use the XML::Twig, we can write code / function to handle single xml element and have the Twig call our code/function for all such elements in the xml file.  Code is something like this :

$twin_var = new XML::Twig('Sym', &process_one_symbol_record ); #whenever 'Sym' tag occurs, call process_one_symbol_record function
$twin_var -> parse("Prices.xml");
..
sub process_one_symbol_record
{
($tree,$sym)=@_; #fetch the element into $sym.
$expiry_date = $sym->{'att'}->{'MatDt'};
$option_price = $sym->{'att'}->{'Price'};
..
$tree->purge_up_to($sym) ; #release memory
}

Since Twig processes one element at a time, and our call adds code to purge the memory there is no memory problem. As well there is no read-line & the pass to XMLIn() function overhead we had in line by line processing. This one while using low memory, finished faster than all others too. This is the only occurrence where both the speed and the memory were better.

Let me tabulate the results:

Program
Memory hogged
% Mem cmp. To Whole
Time taken (secs.)
% Time cmp. To Whole
Whole file parsing
1.8 GB
100.00%
370
100.00%
Line by line parsing
56 MB
3.11%
1125
304.40%
1000 lines at a time
62 MB
3.44%
380
102.71%
Multithreaed 
> 1.9 GB
110.00%
147
39.71%
XML::Twig
59 MB
3.28%
250
67.56%


Conclusion:

  • So 1000 lines at a time parsing, we gave up just 2.71% speed, but earned 96.56 % memory. Very balanced and Very satisfactory, since this programs ran fast and lets the system and other programs run fast too, by saving memory.
  • The 1000 lines once version is the most balanced overall for XML::Simple, but XML::Twig is THE BEST. Reason: faster, easier and less chance of introducing bugs. 
  • The multi threading is useful when we gather data from different resources (like parsing two different XMLS), but not very suitable, if we have common data structure / resource being operated by all threads. 
Thank you for reading. Please post your comments so I can correct myself,my English and article.

Monday, July 7, 2014

How I made a PERL program to run 10 times fast using just 25% disk space it originally used?

When I joined the company iTrade (name changed for legal reasons) for consulting I was asked to fix a perl program that stopped running half the way since data was too large. The program had run for 13 hours before having hiccups and collapsed after filling Unix disk partition to 100% and suffocating out of disk space. This article describes how I fine tuned the perl program to make it both fast and space efficient, while in most cases of tuning you trade off one for the other.

When I spoke to the manager I understood that there are TWO problems:


  • It takes only 10 minutes for him to download 1 GB file from the FTP server to his PC. But the same file  took  1 1/2 hour to download during the program execution. There are 6-7 zip files like that, so all took close to 10 hours.
  • The program somehow consumed 27 GB of disk space available and stopped running. It needed more space for further progress.


Studying the program, I was initially confused; It was since the program was doing misleading and meaningless things; but later understood that the program was very poorly written and I should not bother much. It downloads 7 GB s of zip files, with several thousands of documents inside. These documents are unzipped (extracted) and placed inside a single folder. These documents are  grouped by various business needs with filtering logic and copied  to two different folders. These folders are again zipped and sent to three different destination FTP servers.

 Speaking to the business team I got these information:

     a) First FTP destination receives all of the documents that were present in the incoming zips.
     b) Second FTP destination gets about 25% of these documents that match specific criteria.
     c) Third FTP destination gets remaining 75% of these documents.

Analysis of Performance / speed:

Why this Perl program should take 1 1/2 hour or 10 times longer when working through unix system? Was the manager precise when he gave me the information that he could download in 10 minutes? I tried to download the same file then I found that he was right. It indeed took about 11 minutes to download 1 GB from the same server. But the Perl program had taken 1 1/2 hour each GB. Was the Linux/unix box having some problem?  may be a network or device issue? May be the partition where the downloaded  files were kept had some disk i/o issue?

When I tried to download the same big zip file into same folder that used during the program using unix sftp command, I could rule out that possibility. The sftp command downloaded the file successfully in 10 minutes. i.e. 10% faster than windows and definitely not 10 times slower. No problem with the folder or network. So what was the problem?

Now we have only one dimension left out. The Perl program itself. When I looked at the Perl program, I found that the FTP tool used was Net::SFTP module. That module was totally written in perl. Being written in perl this can be slow, especially when the number of packets / block of data received is too many. In case of GB sized files, the number of blocks sent one by one, may be close to a 100,000. With that in mind, I searched internet for newer modules. If I don't find any, then I would use sftp unix command as a system command from perl and then use the file. But I found one superb perl module named Net::SFTP::Foreign which was exactly what I was searching for. This module is blazing fast, uses unix systems sftp & ssh (which is foriegn to perl's internal module, hence the name) combination and so robust that it provides tons of features. The most relevant among the features was block_size option. By setting this block_size option, say to 100,000 we can make every FTP round trip (request, receive, acknowledge) to get 100,000 bytes. That is, in mere 10 round trips, we can get as much as 1 MB data, which would need 125 round trips when using the old Net::SFTP module.

When I used this Net::SFTP::Foreign module with default options, it downloaded the 1 GB file in about 11 minutes. very good!! I added steroid to it by setting the option block_size to 100,000 then tried it. Result? it downloaded 1 GB zip file in 6 minutes!! i.e about 170 MB a minute!! awesome!!

I did not wait much and I went on to write my own flex_ftp.pm, which is a wrapper around Net::SFTP::Foreign with few inbuilt options and flexible block_size negotiation and progress reporting. It was not easy, it took four days, but at the end, I had a cool module that was ready to perform as a plug in. I replaced the original code to use this flex_ftp.pm (aka Net::SFTP::Foreign )  and tried the run. The same 7GB download that had taken more than 11 hours, was completed in 43 minutes. i.e mere 6% of the time!!

There are other areas in the program, some loops, queries were fine tuned too. They indeed improved but not to the extend of bragging :-), so let us move to the other issue:

Analysis of disk space:

ip60s56m3> df .

Filesystem
 1K-blocks
Used
Available
Use%
/dev/sdb1
138881684
131667900
45184
100%


Yes, above was from production machine, when the program failed. Now, reiterating  (a),(b) and (c) that we learnt from business team:

     a) First FTP destination receives all of the documents that were present in the incoming zips.
     b) Second FTP destination gets about 25% of these documents that match specific criteria.
     c) Third FTP destination gets remaining 75% of these documents.

Writing these down as tabular column, We can see  7 GB files were extracted to individual documents of 8GB in size; then copied into two different folders of 2 GB and 6 GB each during the filtering process (shaded in blue).  These three folders will be zipped into zip files that total 7GB, 1.5GB and 5GB respectively and sent to their destinations.

Folder
purpose
size
Download folder
Incoming zip files
7 GB (.zips)
All_docs
To zip & ftp to 1st Destination
8 GB (unzipped documents)
Harl_print
To zip & ftp to 2nd Dest
2 GB (copy from All_docs)
Online_docs
To zip & ftp to 3rd dest
6 GB (copy from All_docs)
All_docs
Zipped file to  1st destination
7 GB (.zip s)
Harl print
Zipped file to 2nd dest
1.5 GB (.zip s)
Online docs
Zipped file to 3rd dest
5 GB (.zip s)
Total
36.5 GB


Summing them up, no doubt, there was no memory to hold this 36 GB and system failed.

After this analysis, I again asked questions :
Why should we keep the incoming zip file instead of deleting it after expansion? Then came the answer from manager: we cannot afford to lose it, since we spent 11 hours in downloading it!!

But I have just implemented the new ftp module that downloads in about 40 minutes.. so we can afford now!! So I did further analysis and devised more techniques:


  1.  We should download the zip files one by one, immediately unzipping each and deleting each after extracting their contents into All_docs folder.  At the end of this new download-expand-delete process, we will have only 8 GB of extracted documents in side All_docs folder and no zip files lying around. 7 GB disk space saved!!
  2. Instead of copying files to Harl_print folder, we can create links inside the folder. ( unix ln command). These links do not create copy of the file in hard disk, so they consume no disk space, but just creates new name for the existing file points back to the original file, while behaving like a consistent copy of original. So all these 2GB such links point back to original files in the All_docs folder. So 2 GB disk space saved !!
  3. For the online docs folder, we can just repeat the technique in (2), i.e create links instead of copy. 6GB saved !!


Until now, we can see that we have kept the disk spaced used by this program to 8 GB :-). Internal business logic is complete, now time to zip and send:

Now we can create zips to be transferred one by one, FTP to the destination and delete after each transfer. This is just same as the step(1), but we are sending instead of downloading; that is all. This will again ensure that no more disk space is used, other than about 1.x GB for each zip during creation and FTPing.

So after deleting all zips again after sending them, we just have 8 GB. During the entire process, we used about 8 GB (of expanded files) and 1.x GB of zip file . Hence totally using just around 9 GB of disk space the entire process is complete.

This is far better than using 36.5 GB and exhausting disk space!! and efficiently using just 25% of original disk space needed.

P.S:  My manager said, I should not have wasted a 2 days in this new implementation and testing it. He said I should have efficiently asked Admin team to add 40 GB etc space in the mount.  However he was satisfied about speeding up the download and said '6 minutes per GB is acceptable'

-*-*-
*-*-*-*-*
-*-*-