Apache Nutch-Solr Integration

As of this writing, I am using Solr 4.8.0 and Nutch 1.8.0 binaries for the integration. Will catch up to later versions as and when my project requires.

We can cover installation and operation of Solr and Nutch separately and then talk about the integration. the version of Nutch that I am using is very closely built with Solr and the integration is very simple.

For simplicity sake, Ill stick to Linux environment for both as Nutch does not operate in windows natively.

Getting Solr to work

1. All you need to for Solr to work are the binaries. you can get them from their official page (version 4.8.0)

2. Extract the solr-4.8.0.zip in some location. for this tutorial, lets assume that its in /home/test/Research/solr 4.8.0/

3. Open terminal, navigate to /home/test/Research/solr-4.8.0/example/ and execute the following command to start solr server

java -jar start.jar

4. This command will successfully start up your solr, you can check the status by launching the following URL.

http://localhost:8983/solr

5. You can add stuff to the index and test the search functionality, using the examples provided. execute the following command to add an XML file to the index.

Navigate to /home/test/Research/solr-4.8.0/example/exampledocs/ and execute:

java -jar post.jar solr.XML

6. The above would've added the XML into the solr's index and you can query the content from it. Search for "solr" from the Query tab of Solr Admin web client, you can get it after choosing the default "collection1" core.

Configuring Nutch for Solr

Nutch 1.8 is directly compatible with Solr without any code modifications.

1. Get the latest Nutch distribution binaries from their download page

2. Extract he downloaded apache-nutch-1.8.zip file in some folder. Lets assume it is in /home/test/Research/apache-nutch-1.8

3. Put necessary configuration to connect to the proxy server, agent name and included plugins if there are any changes from the default.

the complete nutch-site.xml should look something like this:

<configuration>

<property>

<name>http.agent.name</name>

<value>Nutch Crawler</value>

</property>

<property>

<name>plugin.includes</name>

<value>protocol-httpclient||urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic</value>

</property>

<property>

<name>urlfilter.regex.file</name>

<value>regex-urlfilter.txt</value>

</property>

<property>

<name>http.proxy.host</name>

<value>proxy-server</value>

</property>

<property>

<name>http.proxy.port</name>

<value>1234</value>

</property>

<property>

<name>http.proxy.username</name>

<value>username</value>

</property>

<property>

<name>http.proxy.password</name>

<value>password</value>

</property>

</configuration>

The property highlighted in Blue is necessary for telling Nutch that the indexing is delegated to Solr and is available by default in nutch-default.xml

nutch-default.XML contains all the default configurations that are required for Nutch to run, If we want to over-ride any of the parameters, then we can specify the same in nutch-site.XML.

For example, the property "plugin.includes" specified above is not different from the one given in nutch-default.xml and hence can be removed from the nutch-site.XML without any effects to the current setup. It will still work the same way.

4. Put necessary field and fieldtype configuration from the schema.XML in Nutch to the solr's schema.XML. This change requires a restart of Solr server.

5. Create a folder "urls" under /home/test/Research/apache-nutch-1.8/ and put seed.txt file in it with the list of seed URLs that we want crawled.

6. Now that we have all the configurations out of the way, lets execute the following command to start crawling/indexing.

Navigate to /home/test/Research/apache-nutch-1.8/ folder in the terminal and execute

./bin/crawl urls crawl http://localhost:8983/solr 2

format for using the crawl command is:

crawl <seed.txt folder> <crawl folder> <solr url> <rounds of iteration>

->urls folder created in step 5, contains the seed.txt which has the list of seed urls to crawl

->crawl folder contains the crawled objects which are to be indexed, these files are sent to Solr for indexing.

->Solr url for connecting and sending the data for indexing

->no clear documentation on this. However from the source, we can see that it iterates the crawling process given number of times as a fail safe when certain pages don't respond the first time. (correct me if I'm wrong)

Once the crawl process completes, the data would have synced up to the Solr server, connect to the Solr admin console using the webclient and check if the documents have populated.

Do a initial search query, by choosing the default collection1 core and searching for a something from one of the pages in the website that you crawled.

More info & references:

Solr Wiki

Solr Tutorial - 4.8.0

Nutch Tutorial - 1.8

How Nutch Works - Florian Hartl

Nutch plugin tutorial

Build your own search engine in 10 mins

Mundasupatti - a fun movie

I saw a movie the other day called Mundasupatti, I’m sure you’ve heard of it if you’re in Tamil Nadu. It was funny, they took a simple superstition which is around and made a fun movie out of it. It was a period film, exploring the 70’s of rural India. I was instantly reminded of swades, a great hindi film which again explores the development gap between cities and villages, it goes a bit further to compare development as seen by an Indian NASA engineer and his old house-hold nanny’s village. While Swades was a serious film about self empowerment and braking society’s rules about casteism and encouraging education, Mundasupatti is just a funny movie about how stupid, people are. The movie revolves around a village after which the film is named, the people in the village believe that taking a photograph causes people to get sick and die. The movie did a faithful representation of the rural India, with its proud people and crazy traditions which make no sense. People...

ganeshMarch 2, 2015 at 6:58 PM
hi,

Thanks for the helpful information.
Im having problem with "java -jar post.jar solr.xml" command..

$ java -jar post.jar solr.xml
SimplePostTool version 5.0.0
SimplePostTool: FATAL: Specifying either url or core/collection is mandatory.
Usage: java [SystemProperties] -jar post.jar [-h|-] [ [...]]
$

Can you please help me with.

Thanks.

A Different Land

Search This Blog