Apache Nutch is an open source Web Site crawler written in Java. We can use it to search for Web page hyperlinks in an automated manner and create a copy of all the visited pages for searching over.
Apache Nutch supports Solr out-the-box, greatly simplifying Nutch-Solr integration. It also removes the legacy dependence upon both Apache Tomcat for running the old Nutch Web Application and upon Apache Lucene for indexing
Solr is an open source full text search framework, with Solr we can search the visited pages from Nutch. We talked about Nutch-Solr integration in my previous post
Nutch is a plugin driven tool and as such has many modules that has to be triggered independently for the entire flow of the process, each plugin is replaceable and we can write custom plugins to suit our needs.
"bin/crawl" shell script replaces the crawl command of Nutch and does almost all the tasks necessary for indexing in a single flow.
For now we will understand what is happening inside the crawl shell script, as a java developer understanding/modifying a shell is difficult for me, so If you're a java dev this would be useful.
You can refer this link for a flowchart and the process explanation:
Here we will go through the script itself.
#1
The command formed is something like this:# initial injection$bin/nutch inject $CRAWL_PATH/crawldb $SEEDDIR
bin/nutch inject crawldb urls
What this does is to get the list of base urls from seed.txt under urls folder. This list is added to the crawldb, created under crawldb folder.
Crawldb has the information of URLs to be parsed, the fetch status, metadata and the such.
#2
#Generating a new segment$bin/nutch generate $commonOptions $CRAWL_PATH/crawldb $CRAWL_PATH/segments -topN $sizeFetchlist -numFetchers $numSlaves -noFilter
The command formed is something like this:
bin/nutch generate crawldb segments -topN 50000 -numFetchers 1 -noFilter
This generates a segment which contains all of the urls that need to be fetched. This is put in multiple segments separated to distribute load and each segment will have an arbitrary limit of 500k URLs
The segments are created on the /segments folder
#3
# fetching the segmentSEGMENT=`ls $CRAWL_PATH/segments/ | sort -n | tail -n 1`echo "Fetching : $SEGMENT"$bin/nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch $CRAWL_PATH/segments/$SEGMENT -noParsing -threads $numThreads
The command formed would look like:
SEGMENT=`ls -d /segments/ |sort -n | tail -1`
echo $SEGMENT
bin/nutch fetch -D fetcher.timelimit.mins=180 /segments/$SEGMENT -noParsing -threads 50
This command is for a local installation of nutch not a distributed installation. This fetches the content from the urls that are already stored and maps the contents again to the segments database.
#4
# parsing the segmentecho "Parsing : $SEGMENT"# enable the skipping of records for the parsing so that a dodgy document# so that it does not fail the full taskskipRecordsOptions="-D mapred.skip.attempts.to.start.skipping=2 -D mapred.skip.map.max.skip.records=1"$bin/nutch parse $commonOptions $skipRecordsOptions $CRAWL_PATH/segments/$SEGMENT
The command formed is:
skipRecordsOptions="-D mapred.skip.attempts.to.start.skipping=2 -D mapred.skip.map.max.skip.records=1"
bin/nutch parse $skipRecordsOptions /segments/$SEGMENT
This command parses content fetched in the previous step and extracts URLs which are present in those pages. This is again used for future crawling.
#5
# updatedb with this segmentecho "CrawlDB update"$bin/nutch updatedb $commonOptions $CRAWL_PATH/crawldb $CRAWL_PATH/segments/$SEGMENT
The command formed is
bin/nutch updatedb crawldb $SEGMENT
This is similar to the inject command, it adds the newly found URLs from the parser and updates the existing crawldb. Now the parsed urls are added back to the initial database of urls.
#6
echo "Link inversion"$bin/nutch invertlinks $CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT
The command formed is
bin/nutch invertlinks linkdb -dir segments
Inverting the links will help us form an inverted map of the child link/URL to the URLs which lead to the child URLs. So it can be sorted by the number of places from which a particular URL is reached. It works similar to Google's page rank.
#7
echo "Dedup on crawldb"$bin/nutch dedup $CRAWL_PATH/crawldb
The command formed is
bin/nutch dedup crawldb
This is a new command added to the process as it was previously done by SolrDedup. This command finds the duplicates based on the signature and the signature with the highest score is retained. If they have the same score, then fetching time is used, if they are also same then the URL length is compared. The errant/stray duplicate entries are marked as STATUS_DB_DUPLICATE and is cleaned up during cleaning and indexing jobs that follow.
#8
echo "Indexing $SEGMENT on SOLR index -> $SOLRURL"$bin/nutch index -D solr.server.url=$SOLRURL $CRAWL_PATH/crawldb -linkdb $CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT
The command formed is
bin/nutch solrindex http://localhost:8983/solr/ crawldb linkdb segments/*
This is where the actual indexing takes place, the data from the segments are sent to the Solr server, the process is left to Solr.
#9
echo "Cleanup on SOLR index -> $SOLRURL"$bin/nutch clean -D solr.server.url=$SOLRURL $CRAWL_PATH/crawldb
The command formed is
bin/nutch clear -D solr.server.url=http://localhost:8983/solr crawldb
The command scans crawldb directory looking for entries with status DB_GONE (404), STATUS_DB_DUPLICATE and the such; sends delete requests to Solr for those documents. Once Solr receives the request the documents are duly deleted. This maintains a healthier quality of Solr index.
This entire process is repeated for the number of times specified, as one fetch cycle fetches the parent seed file, the subsequent cycles fetch the URLs which are available in the seed URL's page
Comments
Post a Comment