explain page)?This project is a search engine for web collection archives. Used with the (non-distributable) Internet Archive Wayback Machine or with the freely available WERA or open source wayback applications, you have a complete access tool for SMALL to MEDIUM web archive collections (Up to 500Million documents or about 150k ARC files).
See Full Text Search of Web Archive Collections for a fuller, if now dated, treatment of the problems this project addresses.
| [top] |
Yes. See wayback-NutchWAX for instructions on how.
| [top] |
See WERA.
| [top] |
See Building from source in the javadoc overview.
| [top] |
It deduplicates content by an MD5 hash of the content. The dedup step runs after the indexing step and adds a '.del' file, a bit vector of files to ignore into, the index just made. Merging, the subsequent step, will skip over files mentioned in the '.del' file.
Before the move to a mapreduce base -- i.e. NutchWAX 0.6.0 -- dedup would deduplicate by MD5 of content AND by URL. The URL deduplication is now done implicitly by the framework since the Nutch mapreduce jobs key use the page's URL as mapreduce key; this means only one version of a page will prevail, usually the latest crawled. NutchWAX has improved this situation some by adding the collection name to the key used throughout mapreduce tasks. This makes it so you can have as many versions of a page as you have collections.
| [top] |
If you can't move to better quality hardware -- ECC memory, etc. -- then
skip checksum errors by setting the hadoop configuration
io.skip.checksum.errors. You will also need to
apply the patch that is in the NutchWAX README to your hadoop install.
| [top] |
Run the following to see the usage:
% ${HADOOP_HOME}/bin/hadoop jar nutchwax-job-0.11.0-SNAPSHOT.jar class org.apache.nutch.segment.SegmentMerger
Run the following to see the usage:
% ${HADOOP_HOME}/bin/hadoop jar nutchwax-job-0.11.0-SNAPSHOT.jar class org.apache.nutch.segment.SegmentMerger ~/tmp/crawl/segments_merged/ ~/tmp/crawl/segments/20070406155807-test/ ~/tmp/crawl/segments/20070406155856-test/
| [top] |
Sorting an index will usually return better quality results in less time. Most of Nutch is built into the NutchWAX jar. To run the nutch indexer sorter, do the following:
$ hadoop jar nutchwax.jar class org.apache.nutch.indexer.IndexerSorter
When the index is sorted, you might as well set the searcher.max.hits to, e.g., 1000, since you are getting back the top ranked documents and limit the number of hits someone is allowed to see to 1000.
See the end of How do I merge segments in NutchWAX for how to run multiple concurrent sorts.
| [top] |
If creating multiple indices, you may want to make use of the NutchWAX facility that runs a mapreduce job to farm out the multiple index merges, copy from hdfs to local, and index sorting across the cluster so they run concurrently rather than in series. For the usage on how to run multiple concurrent jobs, run the following:
stack@debord:~/workspace$ ${HADOOP_HOME}/bin/hadoop jar nutchwax.jar help multiple
Usage: multiple input output
Runs concurrently all commands listed in inputs.
Arguments:
input Directory of input files with each line describing task to run
output Output directory.
Example input lines:
An input line to specify a merge would look like:
org.apache.nutch.indexer.IndexMerger -workingdir /3/hadoop-tmp index-monday indexes-monday
Note that named class must implement org.apache.hadoop.util.ToolBase
To copy from hdfs://HOST:PORT/user/stack/index-monday to
file:///0/searcher.dir/index:
org.apache.hadoop.fs.FsShell /user/stack/index-monday /0/searcher.dir/index
org.apache.nutch.indexer.IndexSorter /home/stack/tmp/crawl
Note that IndexSorter refers to local filesystem and not to hdfs and is RAM-bound. Set
task child RAM with the mapred.child.java.opts property in your hadoop-site.xml.
It takes inputs and outputs directories. The latter is usually not used but required
by the framework. The inputs directory contains files that list per line a job to
run on a remote machine. Here is an example line from an input that would run an
index merge of the directory indexes-monday into index-monday
using /tmp as working directory:
org.apache.nutch.indexer.IndexMerger -workingdir /tmp index-monday indexes-mondayIf the inputs had a line per day of the week then we'd run seven tasks with each task merging a day's indices. If the cluster had 7 machines, then we'd the 7 tasks would run concurrently.
Here is how you would specify a copy task that copyied hdfs:///user/stack/index-monday
to file:///0/searcher.dir/index:
org.apache.hadoop.fs.FsShell -get /user/stack/index-monday /0/searcher.dir/index
In a similar fashion its possible to run multiple concurrent index sorts. Here is an example line from the inputs:
org.apache.nutch.indexer.IndexSorter /home/stack/tmp/crawlNote that the IndexSorter references the local filesystem explicitly (Your index cannot be in hdfs when you run the sort). Also index sorting is RAM-bound so you will probably need to up the RAM allocated to task children (Set the mapred.child.java.opts property in your hadoop-site.xml).
| [top] |
Here is a sketch of how to do it for now. Later we'll add better documentation as we have more experience running incrementals. Outstanding issues are how new versions of a page play with the old versions. If new ARCs are given a collection name on import that is the same as that of collection already extant in the index, then likely the newer page will replace the older version. There will be some aggregation of page metadata but only one version of the page will persist in the index. Otherwise, if newer ARCs are given a new collection name, both versions will appear in the index but anchor text is distributed only within the confines of the collection (pages are keyed by URL+Collection name) so the two versions may score very differently. For example, if you are adding a small collection to a big collection, a profusion of inlinks to the page from the big collection may cause it to score much higher than that of the page from the small collection. There's work to do here still. Meantime, here is a receipe for incremental updates.
Choose a collection name with the above in mind. Run the import step to ingest the newly accumulated ARCs. This will produce a new segment. Note its name.
Next update the crawldb -- the 'update' step -- with the content of the new segment. Here is the usage for the update step:
stack@debord:~/workspace$ ./hadoop-0.5.0/bin/hadoop jar nutchwax/build/nutchwax.jar help update
Usage: hadoop jar nutchwax.jar update output [segments...]
Arguments:
output Directory to write crawldb under.
Options:
segments List of segments to update crawldb with. If none supplied, updates
using latest segment found.
Pass the new segment (or segments) and the new ARC content will be added to
the pre-existing crawldb. Do the same for the linkdb, the 'invert' step (Be
sure to read its usage so you pass the options in correct order).
Next up is indexing but lets take pause. The NutchWAX index step takes an output directory and a list of segments outputting a new index at output/indexes. You probably already have an output/indexes in place with content of your initial indexing. You could move it aside but its possible to access more indexing options by invoking the NutchwaxIndexer class directly rather than going via the Nutchwax driver class:
$ ./hadoop/bin/hadoop jar nutchwax/build/nutchwax.jar class org.archive.access.nutch.NutchwaxIndexer Usage: index crawldb linkdb segment ...Now you can pass where you want the index written (at the cost of having to explicitly stipulate locations for crawldb, linkdb, etc.).
Run the (optional) dedup and the merge step. Again you'll need access to more options so you can specify the particular index you want deduped or merged:
$ hadoop jar nutchwax/build/nutchwax.jar class org.apache.nutch.indexer.DeleteDuplicates Usage: indexes... $ hadoop jar nutchwax/build/nutchwax.jar class org.apache.nutch.indexer.IndexMerger Usage: IndexMerger [-workingdir workingdir] outputIndex indexesDir...
The new merged index can now be added to the index already deployed.
You could do this by merging the two indices as one -- see the above cited usage
for IndexMerger -- or, you could have the
search application open both the old and new indices. Here is how you would
do the latter. Assuming the currently running index is also the result of a
merge, then its deployed directory name will be index as oppossed to
indexes. To have the search application search against
both the old and the new index, make a directory indexes
under the search webapp and move into it the old index
directory. Also move here the new, merged index delta (It should be named
other than the old index but otherwise names can be anything). Finally,
you need to add an empty file named index.done to both indices
else they won't be used by the search application:
$ touch ./indexes/index/index.done $ touch .indexes/new-index-delta/index.done.Restart and queries should now hit both (Be sure you've not left over the old 'index' -- that its been moved, not copied under indexes directory).
| [top] |
The archive-mapred jar has classes that will help you aggregate the content of
the userlogs directory across the cluster. To stream the content of one remote userlog directory,
do the following:
% ${HADOOP_HOME}/bin/hadoop jar archive-mapred-0.2.0-SNAPSHOT.jar org.archive.mapred.ArchiveTaskLog http://192.168.1.107:50060/logs/userlogs/task_0019_m_000000_0/syslog/
The archive-mapred has a primitive mapreduce job based on hadoop-1199 content for streaming all logging from a particular job. To run it, do the following:
% ${HADOOP_HOME}/bin/hadoop jar archive-mapred-0.2.0-SNAPSHOT.jar org.archive.mapred.TaskLogInputFormat /home/stack/tmp/outputs/ jobid
| [top] |
| [top] |
See useBodyEncodingForURI in the Tomcat Configuration
Reference. Edit $TOMCAT_HOME/conf/server.xml
and add useBodyEncodingForURI=true. Here is what it looks like
when edit has been added:
!-- Define a non-SSL HTTP/1.1 Connector on port 8080 -->
Connector port="8080" maxHttpHeaderSize="8192"
maxThreads="150" minSpareThreads="25" maxSpareThreads="75"
enableLookups="false" redirectPort="8443" acceptCount="100"
connectionTimeout="20000" disableUploadTimeout="true"
useBodyEncodingForURI="true"
/>
| [top] |
The set of query fields depends on configuration during indexing and configuration of the search engine at query time. Generally in NutchWAX, you can query against following fields:
| Name | Query Time Weight | Source | Notes |
| host | 2.0 | Nutch | Unstored, indexed and tokenized.
E.g. host:gov
|
| site | 1.0 | Nutch | Unstored, indexed and
un-tokenized. Site has zero weight. Means you must pass a term plus
site: E.g. John Glenn site:atc.nasa.gov |
| url | 4.0 | Nutch | Stored and indexed and tokenized. See exacturlalso. |
| date | 1.0 | NutchWAX | Stored, not indexed, not tokenized. Date is 14-digit ARC date: E.g. "date:20060110101010". Can query ranges by passing two dates delimited by hyphen: E.g. "date:20060110101010-20060111101010". |
| collection | 0 | NutchWAX | Stored, indexed, not tokenized. A zero weight means you must pass a term as well as its collection name. Collection alone is not sufficent. E.g. "collection:nara john glenn". |
| arcname | 1.0 | NutchWAX | Stored, indexed, not tokenized. |
| type | 0.1 | NutchWAX | Not stored, indexed, not tokenized. |
| exacturl | 1.0 | NutchWAX | Because 'url' is tokenized, to query for an exact url in the index, use this query field. |
Its possible to search exclusively against title, content, and anchors but it requires adding the query-time plugins to the nutchwax configuration.
| [top] |
The fields available to search results vary with configuration -- check out the explain link to see all available in your current install -- but in NutchWAX generally you can expect the following fields to be present (unless the field was empty for the particular document): url, title, date, arcdate, arcname, arcoffset, collection, primarytype, and subtype.
| [top] |
Default is to show only 1 or 2 hits per site (Google shows maximum of two). Append the hitsPerSite to your query to change this config. E.g. Add 'hitsPerSite=3' to the end of your query in the location box to see a maximum of three hits from each site (Set it to zero to see all from a site).
| [top] |
http://localhost:8080/archive-access-nutch/search.jsp?query=traditional+irish+music+paddyhitsPerPage=100dedupField=datehitsPerDup=100sort=date
... and then in reverse:
http://localhost:8080/archive-access-nutch/search.jsp?query=traditional+irish+music+paddyhitsPerPage=100dedupField=datehitsPerDup=100sort=datereverse=true
The hitsPerPage says how many hits to return per results page.
The dedupField says what field to dedup the hit results on. Default is 'site'.
The hitsPerDup says how many of dedupField to return as part of results
(Default is 2 so we only ever return 2 hits from any one site by default).
sort is field you want to sort on.
reverse is self-explainatory.
| [top] |
Use type query field name. NutchWAX -- like nutch -- adds the
mimetype, the primary type and subtype to a type field. This
means that you can query for the mimetypes 'text/html' by querying
type:text/html, or for primary type 'text' by
querying type:text, or for subtype 'html' by querying
type:html, etc.
| [top] |
explain page)?See How is scoring done in Nutch? (Or, explain the "explain" page?) and How can I influence Nutch scoring? over on the Nutch FAQ page.
| [top] |
See What is the RSS symbol in search results all about? in the Nutch FAQ.
| [top] |
Does your NPE have a Root Cause that looks like the below?
java.lang.NullPointerException
at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:96)
at org.apache.nutch.searcher.NutchBean.(NutchBean.java:82)
at org.apache.nutch.searcher.NutchBean.(NutchBean.java:72)
at org.apache.nutch.searcher.NutchBean.get(NutchBean.java:64)
at org.apache.jsp.search_jsp._jspService(search_jsp.java:79)
at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:137)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
Set the searcher.dir in hadoop-site.xml to point to your index and segments.
| [top] |
See the Distributed Searching section of the
NutchHadoopTutorial for
description of how it generally works only running the search servers is done differently
in NutchWAX. See ${NUTCHWAX_HOME}/bin/start-slave-searcher.sh for a
sample startup script (Adjacent are shutdown scripts and a script to start up
a cluster of searchers). Amend these sample scripts to suit your environment.
| [top] |
| [top] |
See The Wikipedia MapReduce page.
| [top] |
See the hadoop package documentation. Has notes on getting started, standalone and distributed operation, etc.
| [top] |
Its already setup for you in the default config. Here is what the 'parse-default' plugin does. If a resource has a content type for which there is no parser, e.g. if there is no image or audio parser mentioned in the nutch-site.xml plugin.includes, all such resources are passed to the html parser in native nutch (For non-html types it will return failed parse). The way nutch ParserFactory figures which parser to use as default is by looking at the plugin.xml of each parser and the first that it finds that has an empty pathSuffix is the one it uses as default. To change this behavior, we've filled in the nutch/src/plugin/parse-html/plugin.xml#pathSuffix with 'html' in the html parse plugin that is part of NutchWAX and have added our own default parser, parser-default, to nutch-site.xml in the plugin.includes with an empty pathSuffix in its plugin.xml.
| [top] |
By design. Boost of zero plugins get converted to filters. Could make it all in a query are zero boost, that we boost an arbitrary field.
| [top] |
NUTCH_HEAPSIZE and NUTCH_OPTS influence nutch script operations (Memory allocated, etc.). JAVA_OPTS defaults to '-Xmx400m -server' for running of segmenting.
| [top] |
Igor Ranitovic wrote: > If I want to merge indexes from 20 different machines what happens to links > db? The normal order I do things is: 1. create segments, on multiple machines in parallel 2. update db from segs, on a single machine that can access all segs 3. update segs from db, on a single machine that can access all segs 4. index segments, on multiple machines in parallel 5. dedup segment indexes, on a single machine that can access all segs 6. merge indexes, on a single machine that can access all segs In the next few months, as I port stuff to use MapReduce, we'll get rid of the single-machine bottlenecks of steps 2, 3, 5 and 6. MapReduce should also make it easy and reliable to script steps 1-6 on a bunch of machines without manual intervention. > These are the steps that I have done so far: create segments, link db, and > index on each of individual machines. Now, I want to run deduping and > merging on aggregated segments/indexes from all 20 machines but I am afraid > that this approach will drop link db info? At this point, is it too late to > 're-run' updatedb and updatesegs on aggregated segments since the segments > have been already updated with link information? You can always create a new db and update it from a set of segments. Or, if you have a db that's been updated with a subset of the segments then you can update it with the rest. Then you'll want to re-update all of the segments, so they know about all of the new links in the db. You can re-update segments repeatedly too, but each time it adds link information you need to reindex the segment before that link information is used. If we need to do this a lot then there's a way to structure Lucene indexes so that we can, e.g., re-build the index of incoming anchors without re-indexing all of the the content, titles, urls, etc. http://www.mail-archive.com/java-dev@lucene.apache.org/msg00414.html Doug
| [top] |
1. Segment new arcs. 2. Can update new segmetns into old db 3. Update from old db against new segments only; otherwise will have reindex old segments. 4. Index new segments. 5. Dedup everything because new links may be in old segments. 6. Merge all segments.
| [top] |
See NutchFileFormats
| [top] |