Monday, September 15, 2014

Parsing/Loading huge XML & managing memory in perl XML:simple

Purpose: 
To parse "huge" XML file into perl data structure using XMLIn(). This article does not intend to teach XML or XMLIn(). This focuses only in memory management & efficiency when using XMLIn() and Twig. The same ideas can be applied in Java or C++ too. 


Here we have taken 235 MB data file named "prices.xml" as example from original 900 MB file, since the mighty unix system ran out of memory and terminated process!! and cannot be used as benchmark for comparison. Since the XMLIn() could not complete the parsing for 900MB file, this study is not just a improvement but an "essential" fine tuning of code, without which the production will fail.

Intro: Perl’s XML::Simple is a simple module (as the name suggests), that has member function XMLIn() to parse an XML file and convert it into a very easy to operate perl hash data structure. Though it is easy to use the hash once returned, it takes a long time and lot of memory before returning. So let's make our hands dirty and analyze right away:

When ran with this 235MB prices.xml file, XMLIn() did not return back. After waiting for a while, Instead of interrupting with ^c, I ran memory & system monitoring commands to see if it was really running fine or hanging. When I ran ‘top’ command, It showed two things. One good. The other bad.
The good: it clearly showed most activity happening in my perl program. (So may be not hanging):

The Bad: The process started in few MBs, reached 320MB in 1 minute, 630 MB in 2 minutes and so on. In about 6 minutes, the XMLIn consumed about 1.8 GB of resident and virtual memory and was still running for a few more seconds. This is not just a memory problem but it would cause swap-in, swap-out of large virtual memory (since Unix is time sharing system) which is time consuming and as well can slow down entire unix system.

So this article is about how to make this efficient.. interested ? read on..

The prices.xml file has several million rows. Loading this into memory using XMLIn() is not easy because of several tasks performed during loading: 

  1. parsing, splitting each piece of millions of points in the data
  2. defining sibling/child/parent relationships, annotations in them
  3. storing it in a tree like structure with no loss of data with ability reproduce the original file when used in XMLOut etc etc.
So it needed 1.8GB virtual memory, as the top command showed. When the function returned the memory usage dropped to normal. Let us investigate the data in prices.xml:

The file looks like this:
<MktData Date="2014-09-17" … … … … ….. >
<Sym="IBM"  MatDt="2015-01-17" Price="52.80" … … … … ….. >
<Sym="AAPL" MatDt="2015-01-17" Price="101.53" … … … … ….. >
….   Several millions of rows like this.. and then
<Sym="TSLA" MatDt="2015-01-17" Price="282.17" … … … … ….. >
</MktData >

As we see the snap shot of datafile, it is a rather almost flat simple structure. Except the first and last lines, all lines are linear, representing exactly one element and all data of that single record/element is specified in the same line. This qualifies this file to be operated in line by line or element by element. Let us see how it works out:


one at a time please:

So we observed that, since every line contains one "xml-record", it renders the file to be processed line by line possible. So I was almost sure that line by line processing will use less memory even before testing it. Reason: when XMLIn works and parses, it needs to store each and every element in full keystring, valuestring (Sym=>"AAPL"MatDt=>"2015-01-17" Price="101.53") structure, along with the other XML standards discussed in last para. But, when we process the same line by line, the same happens, but for a micro subset of data whose size is a few KB for each and every line compared to GB for the whole file As well, after each and every line parse XMLIN() returned, I can copy those data into perl hash structure named Prices, which can store only values (like "AAPL"=>101.53) whose size is few bytes per line, once processed and stored. After we move the next line to process, the last lines memory is unreferenced and released, so the memory doesn’t grow faster than few bytes per line.

So I changed code to open the prices.xml as a flat file, read it one line at a time in a loop. Inside the loop, I processed that line using same XMLIn call.  I picked up the structure returned, stored only the needed values and stepped to next iteration, printing rows processed once for every 1000 lines. This was really memory efficient and from the beginning to end, the memory was very stable and low(35 MB vs 1.8 GB):




However the program did not finish after 6 minutes. It continued until 17 minutes to finish processing. Seems apparently we traded speed for memory.  Fair enough? May be. At least we have avoided slowing down the entire system and stopping other programs. May be not. There may be better ways around. I started looking at the code.


somewhere in the middle:

It did not take much time for me to grasp what was happening. With this new code we are calling XMLIn more than a million times (i.e once per line). But earlier we called it only once. Each XMLIn call requires some basic setups, initializations, stack up, stack downs and must have taken some time finish, all added up time.  So I decided to balance it out, with fewer number of XMLIn calls. I changed the code to read 1000 lines in a loop, and call XMLIn once with all 1000 lines. I chose the number 1000 because the xml records are rather small size and will fit in few blocks of memory(considering a block is 8KB). To make 1000 individual XML structures to be a single valid structure, needs embedding header and footer tags. So I made this structure by adding < Thou > in the XML. So instead of every line like <Sym="IBM" MatDt="2015-01-17" Price="52.80" … … … … ….. >
fed to XMLIn, 1000 lines were read from the Prices file and fed once like this with first and last lines added:
<Thou >
<Sym="IBM"  MatDt="2015-01-17" Price="52.80" … … … … ….. >
<Sym="AAPL" MatDt="2015-01-17" Price="101.53" … … … … ….. >
….   1000 lines like this
<Sym="TSLA" MatDt="2015-01-17" Price="282.17" … … … … ….. >
</Thou >


And after the XMLIn returned, I would run through all the elements in loop, store them one by one (1000 times). But will this run faster? I kept my fingers crossed and ran it, and watched the memory growth again, with our faithful top command:


This time, it took just 6 MB more (than line by line version), and returned back in 6 minutes 20 seconds. Very close to original run’s speed and close to line by lines memory usage. Awesome.  I should have stopped here, but I didn't.


Multi-threading failure :

Carried away by obsessive compulsion of speeding up the program, I spent more than a day in implementing multi-threading version to make parsing faster. I used threads to parse every thousand lines in parallel with 10 threads running. It finished in just 2 minutes 27 seconds. With complicated memory handling, synchronization of threads all I achieved was about 2 ½ times fast or 60% decrease in run time. But that brought back memory usage to original levels (actually it exceeded), as every thread needs to keep its own memory, variables, data structures, while the program size to handle multi-threading has already bloated up the program.  Surely not worth it. I also observed few facts, learned few things by this attempt, which are out of scope for this article. 


The Harmonious end - XML::Twig

The purpose of XML::Twig - as the name suggests, this module helps us just to process the little twigs of huge XML tree, instead of building entire tree. This straight away eliminates the memory issue by loading small tree of few twigs of each Sym element (instead of loading million branches of entire xml in reproducible manner we discussed earlier) provided we use it in correct manner and purge the memory after using xml elements. When we use the XML::Twig, we can write code / function to handle single xml element and have the Twig call our code/function for all such elements in the xml file.  Code is something like this :

$twin_var = new XML::Twig('Sym', &process_one_symbol_record ); #whenever 'Sym' tag occurs, call process_one_symbol_record function
$twin_var -> parse("Prices.xml");
..
sub process_one_symbol_record
{
($tree,$sym)=@_; #fetch the element into $sym.
$expiry_date = $sym->{'att'}->{'MatDt'};
$option_price = $sym->{'att'}->{'Price'};
..
$tree->purge_up_to($sym) ; #release memory
}

Since Twig processes one element at a time, and our call adds code to purge the memory there is no memory problem. As well there is no read-line & the pass to XMLIn() function overhead we had in line by line processing. This one while using low memory, finished faster than all others too. This is the only occurrence where both the speed and the memory were better.

Let me tabulate the results:

Program
Memory hogged
% Mem cmp. To Whole
Time taken (secs.)
% Time cmp. To Whole
Whole file parsing
1.8 GB
100.00%
370
100.00%
Line by line parsing
56 MB
3.11%
1125
304.40%
1000 lines at a time
62 MB
3.44%
380
102.71%
Multithreaed 
> 1.9 GB
110.00%
147
39.71%
XML::Twig
59 MB
3.28%
250
67.56%


Conclusion:

  • So 1000 lines at a time parsing, we gave up just 2.71% speed, but earned 96.56 % memory. Very balanced and Very satisfactory, since this programs ran fast and lets the system and other programs run fast too, by saving memory.
  • The 1000 lines once version is the most balanced overall for XML::Simple, but XML::Twig is THE BEST. Reason: faster, easier and less chance of introducing bugs. 
  • The multi threading is useful when we gather data from different resources (like parsing two different XMLS), but not very suitable, if we have common data structure / resource being operated by all threads. 
Thank you for reading. Please post your comments so I can correct myself,my English and article.