Understanding Apache Nutch Crawl script's process

Apache Nutch is an open source Web Site crawler written in Java. We can use it to search for Web page hyperlinks in an automated manner and create a copy of all the visited pages for searching over.

Apache Nutch supports Solr out-the-box, greatly simplifying Nutch-Solr integration. It also removes the legacy dependence upon both Apache Tomcat for running the old Nutch Web Application and upon Apache Lucene for indexing

Solr is an open source full text search framework, with Solr we can search the visited pages from Nutch. We talked about Nutch-Solr integration in my previous post

Nutch is a plugin driven tool and as such has many modules that has to be triggered independently for the entire flow of the process, each plugin is replaceable and we can write custom plugins to suit our needs.

"bin/crawl" shell script replaces the crawl command of Nutch and does almost all the tasks necessary for indexing in a single flow.

For now we will understand what is happening inside the crawl shell script, as a java developer understanding/modifying a shell is difficult for me, so If you're a java dev this would be useful.

You can refer this link for a flowchart and the process explanation:

http://florianhartl.com/nutch-how-it-works.html

Here we will go through the script itself.

# initial injection

$bin/nutch inject $CRAWL_PATH/crawldb $SEEDDIR

The command formed is something like this:

bin/nutch inject crawldb urls

What this does is to get the list of base urls from seed.txt under urls folder. This list is added to the crawldb, created under crawldb folder.

Crawldb has the information of URLs to be parsed, the fetch status, metadata and the such.

#Generating a new segment

$bin/nutch generate $commonOptions $CRAWL_PATH/crawldb $CRAWL_PATH/segments -topN $sizeFetchlist -numFetchers $numSlaves -noFilter

The command formed is something like this:

bin/nutch generate crawldb segments -topN 50000 -numFetchers 1 -noFilter

This generates a segment which contains all of the urls that need to be fetched. This is put in multiple segments separated to distribute load and each segment will have an arbitrary limit of 500k URLs

The segments are created on the /segments folder

# fetching the segment

SEGMENT=`ls $CRAWL_PATH/segments/ | sort -n | tail -n 1`

echo "Fetching : $SEGMENT"

$bin/nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch $CRAWL_PATH/segments/$SEGMENT -noParsing -threads $numThreads

The command formed would look like:

SEGMENT=`ls -d /segments/ |sort -n | tail -1`

echo $SEGMENT

bin/nutch fetch -D fetcher.timelimit.mins=180 /segments/$SEGMENT -noParsing -threads 50

This command is for a local installation of nutch not a distributed installation. This fetches the content from the urls that are already stored and maps the contents again to the segments database.

# parsing the segment

echo "Parsing : $SEGMENT"

# enable the skipping of records for the parsing so that a dodgy document

# so that it does not fail the full task

skipRecordsOptions="-D mapred.skip.attempts.to.start.skipping=2 -D mapred.skip.map.max.skip.records=1"

$bin/nutch parse $commonOptions $skipRecordsOptions $CRAWL_PATH/segments/$SEGMENT

The command formed is:

skipRecordsOptions="-D mapred.skip.attempts.to.start.skipping=2 -D mapred.skip.map.max.skip.records=1"

bin/nutch parse $skipRecordsOptions /segments/$SEGMENT

This command parses content fetched in the previous step and extracts URLs which are present in those pages. This is again used for future crawling.

# updatedb with this segment

echo "CrawlDB update"

$bin/nutch updatedb $commonOptions $CRAWL_PATH/crawldb $CRAWL_PATH/segments/$SEGMENT

The command formed is

bin/nutch updatedb crawldb $SEGMENT

This is similar to the inject command, it adds the newly found URLs from the parser and updates the existing crawldb. Now the parsed urls are added back to the initial database of urls.

echo "Link inversion"

$bin/nutch invertlinks $CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT

The command formed is

bin/nutch invertlinks linkdb -dir segments

Inverting the links will help us form an inverted map of the child link/URL to the URLs which lead to the child URLs. So it can be sorted by the number of places from which a particular URL is reached. It works similar to Google's page rank.

echo "Dedup on crawldb"

$bin/nutch dedup $CRAWL_PATH/crawldb

The command formed is

bin/nutch dedup crawldb

This is a new command added to the process as it was previously done by SolrDedup. This command finds the duplicates based on the signature and the signature with the highest score is retained. If they have the same score, then fetching time is used, if they are also same then the URL length is compared. The errant/stray duplicate entries are marked as STATUS_DB_DUPLICATE and is cleaned up during cleaning and indexing jobs that follow.

echo "Indexing $SEGMENT on SOLR index -> $SOLRURL"

$bin/nutch index -D solr.server.url=$SOLRURL $CRAWL_PATH/crawldb -linkdb $CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT

The command formed is

bin/nutch solrindex http://localhost:8983/solr/ crawldb linkdb segments/*

This is where the actual indexing takes place, the data from the segments are sent to the Solr server, the process is left to Solr.

echo "Cleanup on SOLR index -> $SOLRURL"

$bin/nutch clean -D solr.server.url=$SOLRURL $CRAWL_PATH/crawldb

The command formed is

bin/nutch clear -D solr.server.url=http://localhost:8983/solr crawldb

The command scans crawldb directory looking for entries with status DB_GONE (404), STATUS_DB_DUPLICATE and the such; sends delete requests to Solr for those documents. Once Solr receives the request the documents are duly deleted. This maintains a healthier quality of Solr index.

This entire process is repeated for the number of times specified, as one fetch cycle fetches the parent seed file, the subsequent cycles fetch the URLs which are available in the seed URL's page

Mundasupatti - a fun movie

I saw a movie the other day called Mundasupatti, I’m sure you’ve heard of it if you’re in Tamil Nadu. It was funny, they took a simple superstition which is around and made a fun movie out of it. It was a period film, exploring the 70’s of rural India. I was instantly reminded of swades, a great hindi film which again explores the development gap between cities and villages, it goes a bit further to compare development as seen by an Indian NASA engineer and his old house-hold nanny’s village. While Swades was a serious film about self empowerment and braking society’s rules about casteism and encouraging education, Mundasupatti is just a funny movie about how stupid, people are. The movie revolves around a village after which the film is named, the people in the village believe that taking a photograph causes people to get sick and die. The movie did a faithful representation of the rural India, with its proud people and crazy traditions which make no sense. People...

A Different Land

Search This Blog

Understanding Apache Nutch Crawl script's process

Comments

Post a Comment

Popular posts from this blog

Recently executed queries in SQL Server

Mundasupatti - a fun movie

Apache Nutch-Solr Integration