Frequently Asked Questions

  1. Whats this all about?
  2. Can the open source wayback be used to rendor NutchWAX search results?
  3. Where do I go to learn more about WERA and how it works with NutchWAX
  4. How do I build from source?

Indexing

  1. What does the dedup step do (or why do I only see one version of a page when I know there are more than one in the repository)?
  2. Checksum errors consistently fail my job
  3. How do I sort an index with NutchWAX
  4. Is it possible to do incremental updates?

Querying

  1. When I try and open the opensearch servlet under tomcat, I get complaint about missing TransformerFactoryImpl.
  2. Why is encoding of non-ascii characters all messed up?
  3. What fields can I query against?
  4. What fields can I expect to see in results?
  5. I don't seem to be seeing all hits from a site? Why? (Or, whats "Hits 1-3 (out of about 6 total matching pages)" mean?).
  6. How to sort results by date?
  7. How to query for mimetypes?
  8. Tell me more about how scoring is done in nutch/NutchWAX (Or 'explain' the explain page)?
  9. What is this RSS symbol in search results all about?
  10. Why do I get an NPE when I go to access a NutchWAX page in tomcat?
  11. How do I run the distributed searcher?

MapReduce

  1. Where can I learn about mapreduce?
  2. Where can I learn more about setup and operation of hadoop, the mapreduce and distributed filesystem project nutchwax runs atop?

Old NutchWAX (pre-release 0.6.0, pre-move to mapreduce)

  1. How do I set the default parser, the one that is called when no explicit parser available?
  2. If boost is zero, nothing shows in the search results?
  3. What are the important environment variables?
  4. Which steps can be distributed?
  5. How to approach incremental indexing?
  6. What are these data and index files in nutch segments under data?
Whats this all about?

This project is a search engine for web collection archives. Used with the (non-distributable) Internet Archive Wayback Machine or with the freely available WERA or open source wayback applications, you have a complete access tool for SMALL to MEDIUM web archive collections (Up to 500Million documents or about 150k ARC files).

See Full Text Search of Web Archive Collections for a fuller, if now dated, treatment of the problems this project addresses.

Can the open source wayback be used to rendor NutchWAX search results?

Yes. See wayback-NutchWAX for instructions on how.

Where do I go to learn more about WERA and how it works with NutchWAX

See WERA.

How do I build from source?

See Building from source in the javadoc overview.

Indexing

Indexing
What does the dedup step do (or why do I only see one version of a page when I know there are more than one in the repository)?

It deduplicates content by an MD5 hash of the content. The dedup step runs after the indexing step and adds a '.del' file, a bit vector of files to ignore into, the index just made. Merging, the subsequent step, will skip over files mentioned in the '.del' file.

Before the move to a mapreduce base -- i.e. NutchWAX 0.6.0 -- dedup would deduplicate by MD5 of content AND by URL. The URL deduplication is now done implicitly by the framework since the Nutch mapreduce jobs key use the page's URL as mapreduce key; this means only one version of a page will prevail, usually the latest crawled. NutchWAX has improved this situation some by adding the collection name to the key used throughout mapreduce tasks. This makes it so you can have as many versions of a page as you have collections.

Checksum errors consistently fail my job

If you can't move to better quality hardware -- ECC memory, etc. -- then skip checksum errors by setting the hadoop configuration io.skip.checksum.errors. You will also need to apply the patch that is in the NutchWAX README to your hadoop install.

How do I sort an index in NutchWAX
How do I sort an index with NutchWAX

Sorting an index will usually return better quality results in less time. Most of Nutch is built into the NutchWAX jar. To run the nutch indexer sorter, do the following:

$ hadoop jar nutchwax.jar class org.apache.nutch.indexer.IndexerSorter

Is it possible to do incremental updates?

Here is a sketch of how to do it for now. Later we'll add better documentation as we have more experience running incrementals. Outstanding issues are how new versions of a page play with the old versions. If new ARCs are given a collection name on import that is the same as that of collection already extant in the index, then likely the newer page will replace the older version. There will be some aggregation of page metadata but only one version of the page will persist in the index. Otherwise, if newer ARCs are given a new collection name, both versions will appear in the index but anchor text is distributed only within the confines of the collection (pages are keyed by URL+Collection name) so the two versions may score very differently. For example, if you are adding a small collection to a big collection, a profusion of inlinks to the page from the big collection may cause it to score much higher than that of the page from the small collection. There's work to do here still. Meantime, here is a receipe for incremental updates.

Choose a collection name with the above in mind. Run the import step to ingest the newly accumulated ARCs. This will produce a new segment. Note its name.

Next update the crawldb -- the 'update' step -- with the content of the new segment. Here is the usage for the update step:

stack@debord:~/workspace$ ./hadoop-0.5.0/bin/hadoop jar nutchwax/build/nutchwax.jar help update
Usage: hadoop jar nutchwax.jar update <output> [<segments>...]
Arguments:
 output    Directory to write crawldb under.
 Options:
  segments  List of segments to update crawldb with. If none supplied, updates
              using latest segment found.
Pass the new segment (or segments) and the new ARC content will be added to the pre-existing crawldb. Do the same for the linkdb, the 'invert' step (Be sure to read its usage so you pass the options in correct order).

Next up is indexing but lets take pause. The NutchWAX index step takes an output directory and a list of segments outputting a new index at output/indexes. You probably already have an output/indexes in place with content of your initial indexing. You could move it aside but its possible to access more indexing options by invoking the NutchwaxIndexer class directly rather than going via the Nutchwax driver class:

$ ./hadoop/bin/hadoop jar nutchwax/build/nutchwax.jar class org.archive.access.nutch.NutchwaxIndexer
Usage: <index> <crawldb> <linkdb> <segment< ...
Now you can pass where you want the index written (at the cost of having to explicitly stipulate locations for crawldb, linkdb, etc.).

Run the (optional) dedup and the merge step. Again you'll need access to more options so you can specify the particular index you want deduped or merged:

$ hadoop jar nutchwax/build/nutchwax.jar class org.apache.nutch.indexer.DeleteDuplicates
Usage: <indexes>...
$ hadoop jar nutchwax/build/nutchwax.jar class org.apache.nutch.indexer.IndexMerger
Usage: IndexMerger [-workingdir <workingdir>] outputIndex indexesDir...

The new merged index can now be added to the index already deployed. You could do this by merging the two indices as one -- see the above cited usage for IndexMerger -- or, you could have the search application open both the old and new indices. Here is how you would do the latter. Assuming the currently running index is also the result of a merge, then its deployed directory name will be index as oppossed to indexes. To have the search application search against both the old and the new index, make a directory indexes under the search webapp and move into it the old index directory. Also move here the new, merged index delta (It should be named other than the old index but otherwise names can be anything). Finally, you need to add an empty file named index.done to both indices else they won't be used by the search application:

$ touch ./indexes/index/index.done
$ touch .indexes/new-index-delta/index.done.
Restart and queries should now hit both (Be sure you've not left over the old 'index' -- that its been moved, not copied under indexes directory).

Querying

Querying
When I try and open the opensearch servlet under tomcat, I get complaint about missing TransformerFactoryImpl.
Restart tomcat w/ 1.4.x JDK. See this link for more on the issue: http://forum.java.sun.com/thread.jspa?tstart=30&forumID=34&threadID=542044&trange=15 (Note, it speaks of xml-apis.jar. I had success removing xmlParserAPIs.jar).
Why is encoding of non-ascii characters all messed up?

See useBodyEncodingForURI in the Tomcat Configuration Reference. Edit $TOMCAT_HOME/conf/server.xml and add useBodyEncodingForURI=true. Here is what it looks like when edit has been added:

<!-- Define a non-SSL HTTP/1.1 Connector on port 8080 -->
    <Connector port="8080" maxHttpHeaderSize="8192"
               maxThreads="150" minSpareThreads="25" maxSpareThreads="75"
               enableLookups="false" redirectPort="8443" acceptCount="100"
               connectionTimeout="20000" disableUploadTimeout="true"
               useBodyEncodingForURI="true"
               />

What fields can I query against?

The set of query fields depends on configuration during indexing and configuration of the search engine at query time. Generally in NutchWAX, you can query against following fields:

Name Query Time Weight Source Notes
host 2.0 Nutch Unstored, indexed and tokenized. E.g. host:gov
site1.0 Nutch Unstored, indexed and un-tokenized. Site has zero weight. Means you must pass a term plus site: E.g. John Glenn site:atc.nasa.gov
url4.0 Nutch Stored and indexed and tokenized. See exacturlalso.
date 1.0 NutchWAX Stored, not indexed, not tokenized. Date is 14-digit ARC date: E.g. "date:20060110101010". Can query ranges by passing two dates delimited by hyphen: E.g. "date:20060110101010-20060111101010".
collection 0 NutchWAX Stored, indexed, not tokenized. A zero weight means you must pass a term as well as its collection name. Collection alone is not sufficent. E.g. "collection:nara john glenn".
arcname 1.0 NutchWAX Stored, indexed, not tokenized.
type 0.1 NutchWAX Not stored, indexed, not tokenized.
exacturl 1.0 NutchWAX Because 'url' is tokenized, to query for an exact url in the index, use this query field.

Its possible to search exclusively against title, content, and anchors but it requires adding the query-time plugins to the nutchwax configuration.

What fields can I expect to see in results?

The fields available to search results vary with configuration -- check out the explain link to see all available in your current install -- but in NutchWAX generally you can expect the following fields to be present (unless the field was empty for the particular document): url, title, date, arcdate, arcname, arcoffset, collection, primarytype, and subtype.

I don't seem to be seeing all hits from a site? Why? (Or, whats "Hits 1-3 (out of about 6 total matching pages)" mean?).

Default is to show only 1 or 2 hits per site (Google shows maximum of two). Append the hitsPerSite to your query to change this config. E.g. Add '&hitsPerSite=3' to the end of your query in the location box to see a maximum of three hits from each site (Set it to zero to see all from a site).

How to sort results by date?

http://localhost:8080/archive-access-nutch/search.jsp?query=traditional+irish+music+paddy&hitsPerPage=100&dedupField=date&hitsPerDup=100&sort=date ... and then in reverse: http://localhost:8080/archive-access-nutch/search.jsp?query=traditional+irish+music+paddy&hitsPerPage=100&dedupField=date&hitsPerDup=100&sort=date&reverse=true The hitsPerPage says how many hits to return per results page. The dedupField says what field to dedup the hit results on. Default is 'site'. The hitsPerDup says how many of dedupField to return as part of results (Default is 2 so we only ever return 2 hits from any one site by default). sort is field you want to sort on. reverse is self-explainatory.

How to query for mimetypes?

Use type query field name. NutchWAX -- like nutch -- adds the mimetype, the primary type and subtype to a type field. This means that you can query for the mimetypes 'text/html' by querying type:text/html, or for primary type 'text' by querying type:text, or for subtype 'html' by querying type:html, etc.

Tell me more about how scoring is done in nutch/NutchWAX (Or 'explain' the explain page)?

See How is scoring done in Nutch? (Or, explain the "explain" page?) and How can I influence Nutch scoring? over on the Nutch FAQ page.

What is this RSS symbol in search results all about?

See What is the RSS symbol in search results all about? in the Nutch FAQ.

Why do I get an NPE when I go to access a NutchWAX page in tomcat?

Does your NPE have a Root Cause that looks like the below?

java.lang.NullPointerException
    at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:96)
    at org.apache.nutch.searcher.NutchBean.(NutchBean.java:82)
    at org.apache.nutch.searcher.NutchBean.(NutchBean.java:72)
    at org.apache.nutch.searcher.NutchBean.get(NutchBean.java:64)
    at org.apache.jsp.search_jsp._jspService(search_jsp.java:79)
    at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:137)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
Set the searcher.dir in hadoop-site.xml to point to your index and segments.

How do I run the distributed searcher?

See the Distributed Searching section of the NutchHadoopTutorial for description of how it generally works only running the search servers is done differently in NutchWAX. See ${NUTCHWAX_HOME}/bin/start-slave-searcher.sh for a sample startup script (Adjacent are shutdown scripts and a script to start up a cluster of searchers). Amend these sample scripts to suit your environment.

MapReduce

MapReduce
Where can I learn about mapreduce?

See The Wikipedia MapReduce page.

Where can I learn more about setup and operation of hadoop, the mapreduce and distributed filesystem project nutchwax runs atop?

See the hadoop package documentation. Has notes on getting started, standalone and distributed operation, etc.

Old NutchWAX (pre-release 0.6.0, pre-move to mapreduce)

Old NutchWAX (pre-release 0.6.0, pre-move to mapreduce)
How do I set the default parser, the one that is called when no explicit parser available?

Its already setup for you in the default config. Here is what the 'parse-default' plugin does. If a resource has a content type for which there is no parser, e.g. if there is no image or audio parser mentioned in the nutch-site.xml plugin.includes, all such resources are passed to the html parser in native nutch (For non-html types it will return failed parse). The way nutch ParserFactory figures which parser to use as default is by looking at the plugin.xml of each parser and the first that it finds that has an empty pathSuffix is the one it uses as default. To change this behavior, we've filled in the nutch/src/plugin/parse-html/plugin.xml#pathSuffix with 'html' in the html parse plugin that is part of NutchWAX and have added our own default parser, parser-default, to nutch-site.xml in the plugin.includes with an empty pathSuffix in its plugin.xml.

If boost is zero, nothing shows in the search results?

By design. Boost of zero plugins get converted to filters. Could make it all in a query are zero boost, that we boost an arbitrary field.

What are the important environment variables?

NUTCH_HEAPSIZE and NUTCH_OPTS influence nutch script operations (Memory allocated, etc.). JAVA_OPTS defaults to '-Xmx400m -server' for running of segmenting.

Which steps can be distributed?

Igor Ranitovic wrote:

> If I want to merge indexes from 20 different machines what happens to links
> db?

The normal order I do things is:

1. create segments, on multiple machines in parallel
2. update db from segs, on a single machine that can access all segs
3. update segs from db, on a single machine that can access all segs
4. index segments, on multiple machines in parallel
5. dedup segment indexes, on a single machine that can access all segs
6. merge indexes, on a single machine that can access all segs

In the next few months, as I port stuff to use MapReduce, we'll get rid of the
single-machine bottlenecks of steps 2, 3, 5 and 6.  MapReduce should also make
it easy and reliable to script steps 1-6 on a bunch of machines without manual
intervention.

> These are the steps that I have done so far: create segments, link db, and
> index on each of individual machines.   Now, I want to run deduping and
> merging on aggregated segments/indexes from all 20 machines but I am afraid
> that this approach will drop link db info?  At this point, is it too late to
> 're-run' updatedb and updatesegs on aggregated segments since the segments
> have been already updated with link information?

You can always create a new db and update it from a set of segments. Or, if 
you have a db that's been updated with a subset of the segments then you can
update it with the rest.  Then you'll want to re-update all of the segments,
so they know about all of the new links in the db.  You can re-update segments
repeatedly too, but each time it adds link information you need to reindex the
segment before that link information is used.

If we need to do this a lot then there's a way to structure Lucene indexes so
that we can, e.g., re-build the index of incoming anchors without re-indexing
all of the the content, titles, urls, etc.

http://www.mail-archive.com/java-dev@lucene.apache.org/msg00414.html

Doug 

How to approach incremental indexing?

1. Segment new arcs.
2. Can update new segmetns into old db
3. Update from old db against new segments only; otherwise will have reindex
   old segments.
4. Index new segments.
5. Dedup everything because new links may be in old segments.
6. Merge all segments.

What are these data and index files in nutch segments under data?

See NutchFileFormats