Wednesday, September 28, 2005

Current Status

Returning from a short vacation of sort I see that my trusty script has spidered over 120,000 blogspot blog pages while I was gone. The total number of blogs I need to sift through now stands at over 180,000 pages totaling about 15 GB of raw data. My useful little perl script that extracted links from html is no longer all that useful anymore. It used to take about a minute to process 2000 pages but now that I'm dealing with data size much larger I'll need to figure out more efficient means to process all this data at greater speed.

No comments: