Skip to main content

Understanding Apache Nutch Crawl script's process

Apache Nutch is an open source Web Site crawler written in Java. We can use it to search for Web page hyperlinks in an automated manner and create a copy of all the visited pages for searching over. 

Apache Nutch supports Solr out-the-box, greatly simplifying Nutch-Solr integration. It also removes the legacy dependence upon both Apache Tomcat for running the old Nutch Web Application and upon Apache Lucene for indexing

Solr is an open source full text search framework, with Solr we can search the visited pages from Nutch. We talked about Nutch-Solr integration in my previous post

Nutch is a plugin driven tool and as such has many modules that has to be triggered independently for the entire flow of the process, each plugin is replaceable and we can write custom plugins to suit our needs.


"bin/crawl" shell script replaces the crawl command of Nutch and does almost all the tasks necessary for indexing in a single flow.

For now we will understand what is happening inside the crawl shell script, as a java developer understanding/modifying a shell is difficult for me, so If you're a java dev this would be useful.

You can refer this link for a flowchart and the process explanation: 



Here we will go through the script itself.


#1
# initial injection
$bin/nutch inject $CRAWL_PATH/crawldb $SEEDDIR 
The command formed is something like this:

        bin/nutch inject crawldb urls 

What this does is to get the list of base urls from seed.txt under urls folder. This list is added to the crawldb, created under crawldb folder.

Crawldb has the information of URLs to be parsed, the fetch status, metadata and the such.


#2

#Generating a new segment
$bin/nutch generate $commonOptions $CRAWL_PATH/crawldb $CRAWL_PATH/segments -topN $sizeFetchlist -numFetchers $numSlaves -noFilter 

The command formed is something like this:
      
        bin/nutch generate crawldb segments -topN 50000 -numFetchers 1 -noFilter


This generates a segment which contains all of the urls that need to be fetched. This is put in multiple segments separated to distribute load and each segment will have an arbitrary limit of 500k URLs

The segments are created on the /segments folder 



#3


# fetching the segment
SEGMENT=`ls $CRAWL_PATH/segments/ | sort -n | tail -n 1`
echo "Fetching : $SEGMENT"
$bin/nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch $CRAWL_PATH/segments/$SEGMENT -noParsing -threads $numThreads

The command formed would look like:

      SEGMENT=`ls -d /segments/ |sort -n | tail -1` 
      echo $SEGMENT
      bin/nutch fetch -D fetcher.timelimit.mins=180 /segments/$SEGMENT -noParsing -threads 50


This command is for a local installation of nutch not a distributed installation. This fetches the content from the urls that are already stored and maps the contents again to the segments database.


#4


# parsing the segment
echo "Parsing : $SEGMENT"
# enable the skipping of records for the parsing so that a dodgy document 
# so that it does not fail the full task
skipRecordsOptions="-D mapred.skip.attempts.to.start.skipping=2 -D mapred.skip.map.max.skip.records=1"
$bin/nutch parse $commonOptions $skipRecordsOptions $CRAWL_PATH/segments/$SEGMENT 

The command formed is:

      skipRecordsOptions="-D mapred.skip.attempts.to.start.skipping=2 -D mapred.skip.map.max.skip.records=1"
      bin/nutch parse $skipRecordsOptions /segments/$SEGMENT


This command parses content fetched in the previous step and extracts URLs which are present in those pages. This is again used for future crawling.


#5

# updatedb with this segment
echo "CrawlDB update"
$bin/nutch updatedb $commonOptions $CRAWL_PATH/crawldb  $CRAWL_PATH/segments/$SEGMENT 

The command formed is

        bin/nutch updatedb crawldb $SEGMENT

This is similar to the inject command, it adds the newly found URLs from the parser and updates the existing crawldb. Now the parsed urls are added back to the initial database of urls. 


#6

echo "Link inversion"
$bin/nutch invertlinks $CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT 

The command formed is

        bin/nutch invertlinks linkdb -dir segments

Inverting the links will help us form an inverted map of the child link/URL to the URLs which lead to the child URLs. So it can be sorted by the number of places from which a particular URL is reached. It works similar to Google's page rank.


#7

echo "Dedup on crawldb"
$bin/nutch dedup $CRAWL_PATH/crawldb 

The command formed is

        bin/nutch dedup crawldb

This is a new command added to the process as it was previously done by SolrDedup. This command finds the duplicates based on the signature and the signature with the highest score is retained. If they have the same score, then fetching time is used, if they are also same then the URL length is compared. The errant/stray duplicate entries are marked as STATUS_DB_DUPLICATE and is cleaned up during cleaning and indexing jobs that follow.


#8


echo "Indexing $SEGMENT on SOLR index -> $SOLRURL"
$bin/nutch index -D solr.server.url=$SOLRURL $CRAWL_PATH/crawldb -linkdb $CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT 

The command formed is

        bin/nutch solrindex http://localhost:8983/solr/ crawldb linkdb segments/* 


This is where the actual indexing takes place, the data from the segments are sent to the Solr server, the process is left to Solr.


#9



echo "Cleanup on SOLR index -> $SOLRURL"
$bin/nutch clean -D solr.server.url=$SOLRURL $CRAWL_PATH/crawldb 

The command formed is
        
        bin/nutch clear -D solr.server.url=http://localhost:8983/solr crawldb

The command scans crawldb directory looking for entries with status DB_GONE (404), STATUS_DB_DUPLICATE and the such; sends delete requests to Solr for those documents. Once Solr receives the request the documents are duly deleted. This maintains a healthier quality of Solr index.


This entire process is repeated for the number of times specified, as one fetch cycle fetches the parent seed file, the subsequent cycles fetch the URLs which are available in the seed URL's page

Comments

Popular posts from this blog

Recently executed queries in SQL Server

To find out what all queries were executed recently in a sql server database, use the following queries Specific database: SELECT deqs.last_execution_time AS [Time], dest.text AS [Query], dest.* FROM sys.dm_exec_query_stats AS deqs CROSS APPLY sys.dm_exec_sql_text(deqs.sql_handle) AS dest WHERE dest.dbid = DB_ID(' msdb ') ORDER BY deqs.last_execution_time DESC   All Databases: SELECT deqs.last_execution_time AS [Time], dest.text AS [Query] FROM sys.dm_exec_query_stats AS deqs CROSS APPLY sys.dm_exec_sql_text(deqs.sql_handle) AS dest ORDER BY deqs.last_execution_time DESC  

Mundasupatti - a fun movie

I saw a movie the other day called Mundasupatti, I’m sure you’ve heard of it if you’re in Tamil Nadu. It was funny, they took a simple superstition which is around and made a fun movie out of it. It was a period film, exploring the 70’s of rural India. I was instantly reminded of swades, a great hindi film which again explores the development gap between cities and villages, it goes a bit further to compare development as seen by an Indian NASA engineer and his old house-hold nanny’s village.  While Swades was a serious film about self empowerment and braking society’s rules about casteism and encouraging education, Mundasupatti is just a funny movie about how stupid, people are.  The movie revolves around a village after which the film is named, the people in the village believe that taking a photograph causes people to get sick and die. The movie did a faithful representation of the rural India, with its proud people and crazy traditions which make no sense. People...

Apache Nutch-Solr Integration

Apache Nutch-Solr Integration   As of this writing, I am using Solr 4.8.0 and Nutch 1.8.0 binaries for the integration. Will catch up to later versions as and when my project requires. We can cover installation and operation of Solr and Nutch separately and then talk about the integration. the version of Nutch that I am using is very closely built with Solr and the integration is very simple. For simplicity sake, Ill stick to Linux environment for both as Nutch does not operate in windows natively. Getting Solr to work 1. All you need to for Solr to work are the binaries. you can get them from their  official page  (version 4.8.0) 2. Extract the  solr-4.8.0.zip  in some location. for this tutorial, lets assume that its in  /home/test/Research/solr 4.8.0/ 3. Open terminal, navigate to  /home/test/Research/solr-4.8.0/example/  and execute the following command to start solr server java -jar start.jar  4...