Apache Nutch-Solr Integration
As of this writing, I am using Solr 4.8.0 and Nutch 1.8.0 binaries for the integration. Will catch up to later versions as and when my project requires.
We can cover installation and operation of Solr and Nutch separately and then talk about the integration. the version of Nutch that I am using is very closely built with Solr and the integration is very simple.
For simplicity sake, Ill stick to Linux environment for both as Nutch does not operate in windows natively.
Getting Solr to work
1. All you need to for Solr to work are the binaries. you can get them from their official page (version 4.8.0)
2. Extract the solr-4.8.0.zip in some location. for this tutorial, lets assume that its in /home/test/Research/solr 4.8.0/
3. Open terminal, navigate to /home/test/Research/solr-4.8.0/example/ and execute the following command to start solr server
java -jar start.jar
4. This command will successfully start up your solr, you can check the status by launching the following URL.
http://localhost:8983/solr
5. You can add stuff to the index and test the search functionality, using the examples provided. execute the following command to add an XML file to the index.
Navigate to /home/test/Research/solr-4.8.0/example/exampledocs/ and execute:
java -jar post.jar solr.XML
6. The above would've added the XML into the solr's index and you can query the content from it. Search for "solr" from the Query tab of Solr Admin web client, you can get it after choosing the default "collection1" core.
Configuring Nutch for Solr
Nutch 1.8 is directly compatible with Solr without any code modifications.
1. Get the latest Nutch distribution binaries from their download page
2. Extract he downloaded apache-nutch-1.8.zip file in some folder. Lets assume it is in /home/test/Research/apache-nutch-1.8
3. Put necessary configuration to connect to the proxy server, agent name and included plugins if there are any changes from the default.
the complete nutch-site.xml should look something like this:
<configuration><property><name>http.agent.name</name><value>Nutch Crawler</value></property><property><name>plugin.includes</name><value>protocol-httpclient||urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic</value></property><property><name>urlfilter.regex.file</name><value>regex-urlfilter.txt</value></property><property><name>http.proxy.host</name><value>proxy-server</value></property><property><name>http.proxy.port</name><value>1234</value></property><property><name>http.proxy.username</name><value>username</value></property><property><name>http.proxy.password</name><value>password</value></property></configuration>
The property highlighted in Blue is necessary for telling Nutch that the indexing is delegated to Solr and is available by default in nutch-default.xml
nutch-default.XML contains all the default configurations that are required for Nutch to run, If we want to over-ride any of the parameters, then we can specify the same in nutch-site.XML.
For example, the property "plugin.includes" specified above is not different from the one given in nutch-default.xml and hence can be removed from the nutch-site.XML without any effects to the current setup. It will still work the same way.
4. Put necessary field and fieldtype configuration from the schema.XML in Nutch to the solr's schema.XML. This change requires a restart of Solr server.
5. Create a folder "urls" under /home/test/Research/apache-nutch-1.8/ and put seed.txt file in it with the list of seed URLs that we want crawled.
6. Now that we have all the configurations out of the way, lets execute the following command to start crawling/indexing.
Navigate to /home/test/Research/apache-nutch-1.8/ folder in the terminal and execute
./bin/crawl urls crawl http://localhost:8983/solr 2
format for using the crawl command is:
crawl <seed.txt folder> <crawl folder> <solr url> <rounds of iteration>->urls folder created in step 5, contains the seed.txt which has the list of seed urls to crawl->crawl folder contains the crawled objects which are to be indexed, these files are sent to Solr for indexing.->Solr url for connecting and sending the data for indexing->no clear documentation on this. However from the source, we can see that it iterates the crawling process given number of times as a fail safe when certain pages don't respond the first time. (correct me if I'm wrong)
Once the crawl process completes, the data would have synced up to the Solr server, connect to the Solr admin console using the webclient and check if the documents have populated.
Do a initial search query, by choosing the default collection1 core and searching for a something from one of the pages in the website that you crawled.
More info & references:
Solr Tutorial - 4.8.0
Nutch Tutorial - 1.8
hi,
ReplyDeleteThanks for the helpful information.
Im having problem with "java -jar post.jar solr.xml" command..
$ java -jar post.jar solr.xml
SimplePostTool version 5.0.0
SimplePostTool: FATAL: Specifying either url or core/collection is mandatory.
Usage: java [SystemProperties] -jar post.jar [-h|-] [ [...]]
$
Can you please help me with.
Thanks.