Monday, July 7, 2014

How I made a PERL program to run 10 times fast using just 25% disk space it originally used?

When I joined the company iTrade (name changed for legal reasons) for consulting I was asked to fix a perl program that stopped running half the way since data was too large. The program had run for 13 hours before having hiccups and collapsed after filling Unix disk partition to 100% and suffocating out of disk space. This article describes how I fine tuned the perl program to make it both fast and space efficient, while in most cases of tuning you trade off one for the other.

When I spoke to the manager I understood that there are TWO problems:


  • It takes only 10 minutes for him to download 1 GB file from the FTP server to his PC. But the same file  took  1 1/2 hour to download during the program execution. There are 6-7 zip files like that, so all took close to 10 hours.
  • The program somehow consumed 27 GB of disk space available and stopped running. It needed more space for further progress.


Studying the program, I was initially confused; It was since the program was doing misleading and meaningless things; but later understood that the program was very poorly written and I should not bother much. It downloads 7 GB s of zip files, with several thousands of documents inside. These documents are unzipped (extracted) and placed inside a single folder. These documents are  grouped by various business needs with filtering logic and copied  to two different folders. These folders are again zipped and sent to three different destination FTP servers.

 Speaking to the business team I got these information:

     a) First FTP destination receives all of the documents that were present in the incoming zips.
     b) Second FTP destination gets about 25% of these documents that match specific criteria.
     c) Third FTP destination gets remaining 75% of these documents.

Analysis of Performance / speed:

Why this Perl program should take 1 1/2 hour or 10 times longer when working through unix system? Was the manager precise when he gave me the information that he could download in 10 minutes? I tried to download the same file then I found that he was right. It indeed took about 11 minutes to download 1 GB from the same server. But the Perl program had taken 1 1/2 hour each GB. Was the Linux/unix box having some problem?  may be a network or device issue? May be the partition where the downloaded  files were kept had some disk i/o issue?

When I tried to download the same big zip file into same folder that used during the program using unix sftp command, I could rule out that possibility. The sftp command downloaded the file successfully in 10 minutes. i.e. 10% faster than windows and definitely not 10 times slower. No problem with the folder or network. So what was the problem?

Now we have only one dimension left out. The Perl program itself. When I looked at the Perl program, I found that the FTP tool used was Net::SFTP module. That module was totally written in perl. Being written in perl this can be slow, especially when the number of packets / block of data received is too many. In case of GB sized files, the number of blocks sent one by one, may be close to a 100,000. With that in mind, I searched internet for newer modules. If I don't find any, then I would use sftp unix command as a system command from perl and then use the file. But I found one superb perl module named Net::SFTP::Foreign which was exactly what I was searching for. This module is blazing fast, uses unix systems sftp & ssh (which is foriegn to perl's internal module, hence the name) combination and so robust that it provides tons of features. The most relevant among the features was block_size option. By setting this block_size option, say to 100,000 we can make every FTP round trip (request, receive, acknowledge) to get 100,000 bytes. That is, in mere 10 round trips, we can get as much as 1 MB data, which would need 125 round trips when using the old Net::SFTP module.

When I used this Net::SFTP::Foreign module with default options, it downloaded the 1 GB file in about 11 minutes. very good!! I added steroid to it by setting the option block_size to 100,000 then tried it. Result? it downloaded 1 GB zip file in 6 minutes!! i.e about 170 MB a minute!! awesome!!

I did not wait much and I went on to write my own flex_ftp.pm, which is a wrapper around Net::SFTP::Foreign with few inbuilt options and flexible block_size negotiation and progress reporting. It was not easy, it took four days, but at the end, I had a cool module that was ready to perform as a plug in. I replaced the original code to use this flex_ftp.pm (aka Net::SFTP::Foreign )  and tried the run. The same 7GB download that had taken more than 11 hours, was completed in 43 minutes. i.e mere 6% of the time!!

There are other areas in the program, some loops, queries were fine tuned too. They indeed improved but not to the extend of bragging :-), so let us move to the other issue:

Analysis of disk space:

ip60s56m3> df .

Filesystem
 1K-blocks
Used
Available
Use%
/dev/sdb1
138881684
131667900
45184
100%


Yes, above was from production machine, when the program failed. Now, reiterating  (a),(b) and (c) that we learnt from business team:

     a) First FTP destination receives all of the documents that were present in the incoming zips.
     b) Second FTP destination gets about 25% of these documents that match specific criteria.
     c) Third FTP destination gets remaining 75% of these documents.

Writing these down as tabular column, We can see  7 GB files were extracted to individual documents of 8GB in size; then copied into two different folders of 2 GB and 6 GB each during the filtering process (shaded in blue).  These three folders will be zipped into zip files that total 7GB, 1.5GB and 5GB respectively and sent to their destinations.

Folder
purpose
size
Download folder
Incoming zip files
7 GB (.zips)
All_docs
To zip & ftp to 1st Destination
8 GB (unzipped documents)
Harl_print
To zip & ftp to 2nd Dest
2 GB (copy from All_docs)
Online_docs
To zip & ftp to 3rd dest
6 GB (copy from All_docs)
All_docs
Zipped file to  1st destination
7 GB (.zip s)
Harl print
Zipped file to 2nd dest
1.5 GB (.zip s)
Online docs
Zipped file to 3rd dest
5 GB (.zip s)
Total
36.5 GB


Summing them up, no doubt, there was no memory to hold this 36 GB and system failed.

After this analysis, I again asked questions :
Why should we keep the incoming zip file instead of deleting it after expansion? Then came the answer from manager: we cannot afford to lose it, since we spent 11 hours in downloading it!!

But I have just implemented the new ftp module that downloads in about 40 minutes.. so we can afford now!! So I did further analysis and devised more techniques:


  1.  We should download the zip files one by one, immediately unzipping each and deleting each after extracting their contents into All_docs folder.  At the end of this new download-expand-delete process, we will have only 8 GB of extracted documents in side All_docs folder and no zip files lying around. 7 GB disk space saved!!
  2. Instead of copying files to Harl_print folder, we can create links inside the folder. ( unix ln command). These links do not create copy of the file in hard disk, so they consume no disk space, but just creates new name for the existing file points back to the original file, while behaving like a consistent copy of original. So all these 2GB such links point back to original files in the All_docs folder. So 2 GB disk space saved !!
  3. For the online docs folder, we can just repeat the technique in (2), i.e create links instead of copy. 6GB saved !!


Until now, we can see that we have kept the disk spaced used by this program to 8 GB :-). Internal business logic is complete, now time to zip and send:

Now we can create zips to be transferred one by one, FTP to the destination and delete after each transfer. This is just same as the step(1), but we are sending instead of downloading; that is all. This will again ensure that no more disk space is used, other than about 1.x GB for each zip during creation and FTPing.

So after deleting all zips again after sending them, we just have 8 GB. During the entire process, we used about 8 GB (of expanded files) and 1.x GB of zip file . Hence totally using just around 9 GB of disk space the entire process is complete.

This is far better than using 36.5 GB and exhausting disk space!! and efficiently using just 25% of original disk space needed.

P.S:  My manager said, I should not have wasted a 2 days in this new implementation and testing it. He said I should have efficiently asked Admin team to add 40 GB etc space in the mount.  However he was satisfied about speeding up the download and said '6 minutes per GB is acceptable'

-*-*-
*-*-*-*-*
-*-*-




1 comment:

vas said...

good tech points mixed with bsns-specific stuff and a bit of politics. Yeah..things are hard to do than we think.