Table of Contents
Abstract
Bug fixes and improvements in the quality of search results but the main benefit of NutchWAX 0.10.0 is a move to hadoop 0.9.2 from 0.5.0. The upgraded hadoop platform makes indexing much more robust and noticeably faster.
Nutch has added pluggable URL normalization
('canonicalization' in heritrix-speak) and there is also
now the ability to filter by URL at each of the
indexing steps. Nutchwax picks up this feature in this release
(See Flexible URL
normalization NUTCH-365). Normally its operation will
be of no concern, particularly as the default behavior is mild
equating 'http://www.archive.org:80/' and 'http://www.archive.org/'
for instance but it may become an issue if the engine you are using
to rendering found pages uses an index other than the one made by
nutchwax: e.g., a wayback that made its own CDX or bdbje index,
Here there may be times when the nutch normalization disagrees
with the wayback normalization and lookups
into the alternate wayback index will fail.
The configuration that manages urlnormalization is in
nutch-default.xml. Its the property:
urlnormalizer.order combined with mention
in plugin.includes.
Complaints from outlinks parser are now much tidier occuping a single line rather than dumping a MalformedURLException stacktrace. Here are samples:
2006-12-15 11:19:16,749 WARN parse.OutlinkExtractor - Invalid url: 'NOTE:DON', skipping. 2006-12-15 11:19:16,750 WARN parse.OutlinkExtractor - Invalid url: 'DH0:LEVELS/', skipping.
With this release, the default pdf parser has been switched
to the nutch parse-pdf PDFBox-based parser. Previous, a nutchwax plugin
called parse-waxext was the default. The parse-waxext plugin ran an external
dependency named xpdf -- via the wrapper script
parse-pdf.sh -- parsing
application/pdf document types. In primitive testing, nutch's parse-pdf
comes close enough to the nutchwax parse-waxext plugin in the number of PDFs
successfully parsed (80% vs. 90% of all PDFs in an ARC that contained
158). We make the move in the name of minimizing the number of NutchWAX
external dependencies and in the hope that parse-pdf will continue
to improve with time. Should you run into problems with parse-pdf -- it
used to hang on PDFs from time-to-time in the past -- or you require
that ingest parse the maximum number of PDFs, switching back to
parse-waxext is just a matter of configuration. After ensuring
xpdf is installed on all nodes, edit hadoop-sites.xml.
Add in the plugin.includes from wax-default.xml
and edit it so that rather than parse-pdf, instead it references
parse-waxext. You'll then need to copy to your hadoop conf directory
the wax-parse-plugins.xml and change references
to parse-pdf to parse-waxext.
Its now possible to configure the open source wayback to use NutchWAX indices finding pages (and a page's embeds). One useful setup, has the wayback and NutchWAX WARs deployed in the same container with NutchWAX using the colocated wayback as the search result page renderer. See HOWTO: Configure Wayback to use NutchWAX index
Table 1. Fixes and Additions
| ID | Type | Summary | Open Date | By | Filer |
|---|---|---|---|---|---|
| 1592768 | Add | Better job names and note job in jobtracker log | 2006-11-08 | stack-sf | stack-sf |
| 1632531 | Add | Use parse-pdf in place of xpdf | 2007-01-10 | stack-sf | stack-sf |
| 1288990 | Add | Configurable collection name in search.jsp | 2005-09-12 | stack-sf | stack-sf |
| 1503045 | Add | PDFs have URL for title | 2006-06-08 | stack-sf | stack-sf |
| 1407760 | Add | Can't do phrase search against 'title' | 2006-01-16 | stack-sf | stack-sf |
| 1506319 | Add | Port 80 messes up queries against urls | 2006-06-14 | nobody | stack-sf |
| 1616124 | Add | Move from nutch-0.8.1 to TRUNK and hadoop 0.9.2 | 2006-12-14 | stack-sf | stack-sf |
| 1567247 | Add | Remove harmless outlink parse fail messages | 2006-09-28 | stack-sf | stack-sf |
| 1567251 | Add | pdf parse of too long doc. failure msg cryptic | 2006-09-28 | nobody | stack-sf |
| 1631694 | Fix | CCE when doing initial update and specifying a segment | 2007-01-09 | stack-sf | nobody |
| 1636313 | Fix | If exact date passed, use it | 2007-01-15 | stack-sf | stack-sf |
| 1629593 | Fix | Add a NutchwaxLinkDbMerger | 2007-01-06 | stack-sf | stack-sf |
| 1591709 | Fix | spacer.gif shows high in search results | 2006-11-06 | nobody | stack-sf |
| 1619644 | Fix | standalone mode can't find parse-pdf.sh | 2006-12-20 | stack-sf | stack-sf |
| 1628157 | Fix | Query 'host' field is broken | 2007-01-04 | stack-sf | nobody |
| 1596432 | Fix | fix non-indexing of mimetype \'no-type\' | 2006-11-14 | stack-sf | stack-sf |
| 1582980 | Fix | wax-parse-plugins.xml assigns javascript to parse-text | 2006-10-23 | stack-sf | nobody |
Abstract
NutchWAX 0.8.0 is built against Nutch 0.8.1, released 09/24/2006. A version of this software was recently used to make an index of greater than 400 million documents.
Patches made to the Nutch 0.8.1 included in Nutch are listed in the NutchWAX README. NutchWAX 0.8.0 will only run on a hadoop 0.5.0. It will fail to run on later versions.
You'll see lots of output like the below during the import step. Its harmless
even though its reported at the ERROR log level (Subsequent to the 0.8.1
release of nutch, these messages are no longer reported at ERROR log level).
The HTML parser is reporting something that looked like a link is not of
a supported protocol. In the below example margin-bottom
is not a supported protocol:
06/12/04 15:36:35 ERROR parse.OutlinkExtractor: getOutlinks java.net.MalformedURLException: unknown protocol: margin-bottom at java.net.URL.<init>(URL.java:574) at java.net.URL.<init>(URL.java:464) at java.net.URL.<init>(URL.java:413) at org.apache.nutch.net.BasicUrlNormalizer.normalize(BasicUrlNormalizer.java:78) at org.apache.nutch.parse.Outlink.<init>(Outlink.java:35) at org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:111) at org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:70) at org.apache.nutch.parse.text.TextParser.getParse(TextParser.java:47) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) at org.archive.access.nutch.ImportArcs.processRecord(ImportArcs.java:513) at org.archive.access.nutch.ImportArcs$IndexingThread.run(ImportArcs.java:324)
Table 2. Changes
| ID | Type | Summary | Open Date | By | Filer |
|---|---|---|---|---|---|
| 1247519 | Add | next/previous in search.jsp needs improvement | 2005-07-29 08:33 | stack-sf | stack-sf |
| 1611195 | Add | Set default html content limit to 101k. | 2006-12-07 15:44 | stack-sf | stack-sf |
| 1247141 | Add | Tools and doc. of incremental indexing | 2005-07-28 15:06 | stack-sf | stack-sf |
| 1566629 | Add | multiple collections in single index | 2006-09-27 13:52 | stack-sf | stack-sf |
| 1532024 | Add | nutchwax regression tests | 2006-07-31 14:32 | nobody | cathcart |
| 1611150 | Fix | 'arcname' (in 'explain') looks like unrelated URL | 2006-12-07 14:16 | stack-sf | gojomo |
| 1363012 | Fix | NegativeArraySizeException in search.jsp | 2005-11-21 10:49 | stack-sf | stack-sf |
| 1246906 | Fix | Fix nutchwax license | 2005-07-28 09:20 | stack-sf | stack-sf |
| 1477183 | Fix | Incremental indexing broken | 2006-04-26 12:42 | stack-sf | stack-sf |
| 1598060 | Fix | Distributed searcher mode doesn't work | 2006-11-16 14:49 | stack-sf | stack-sf |
| 1592633 | Fix | Missing closing tags for hadoop-site.xml ex | 2006-11-08 05:32 | stack-sf | nobody |
| 1581618 | Fix | No anchor text | 2006-10-20 16:53 | stack-sf | stack-sf |
| 1556559 | Fix | Sometimes webapp fails getting summary | 2006-09-11 11:14 | stack-sf | stack-sf |
| 1518430 | Fix | ARCName has filedesc prefix and arc suffix | 2006-07-06 16:03 | stack-sf | stack-sf |
| 1511418 | Fix | Collection name not passed to ImportArcs | 2006-06-23 08:34 | stack-sf | stack-sf |
Abstract
Move to mapreduce Nutch as base. Much has changed in the mapreduce version of NutchWAX. 0.6.0 bears little resemblance to previous releases both in how it goes bout its work and from how its run by the user. Be prepared to leave aside all old NutchWAX assumptions.
Incremental indexing does not work in 0.6.0 [See [1477183] [nutchwax] Incremental indexing broken].
Abstract
Minor bug fixes.
Abstract
Last release before move to mapreduce
Minor fixes: Added Google-like results paging and built for a 1.4.x Java target.
Abstract
Bug fix.
Fix encoding issue in 0.4.0: [1348019] [nutchwax] Double encoding of disallowed xml chars
Abstract
Bug fixes.
NutchWAX has been built against Nutch 0.7.0 (There seem to be issues with 0.7.1 build, and then some, so have not built against the 0.7.1 release).
General limitation of current platform are listed in Section 7. Observations on Page 9 of Full Text Search of Web Archive Collections.
Table 4. Bugs/Features
| ID | Type | Summary | Open Date | By | Filer |
|---|---|---|---|---|---|
| 1608891 | Add | ixes to make wayback use nutchwax index | 2006-12-04 17:35 | Maximilian Schoefmann | stack-sf |
| 1313214 | Add | Dedup'ing that considers collection field. | 2005-10-04 12:46 | stack-sf | stack-sf |
| 1309781 | Add | Add in skipping certain types if > size | 2005-09-30 14:01 | stack-sf | stack-sf |
| 1244843 | Add | Allow querying on mime primary and sub type | 2005-07-25 16:13 | stack-sf | stack-sf |
| 1280825 | Add | Make nutch merge segment work against nutchwax segments | 2005-09-02 10:00 | stack-sf | stack-sf |
| 1247571 | Fix | Items not getting indexed | 2005-07-29 09:55 | stack-sf | stack-sf |
| 1312212 | Fix | bad xml chars in search results | 2005-10-03 12:11 | stack-sf | stack-sf |
| 1244894 | Fix | Cannot query for non-ISO8859 characters | 2005-07-25 18:38 | stack-sf | stack-sf |
| 1312208 | Fix | Query time encoding issues | 2005-10-03 12:11 | stack-sf | stack-sf |
| 1312217 | Fix | Not indexing images | 2005-10-03 12:18 | stack-sf | stack-sf |
| 1244875 | Fix | exacturl encoding not working | 2005-07-25 17:21 | stack-sf | stack-sf |
| 1281697 | Fix | searching czech words not working | 2005-09-04 10:36 | stack-sf | kranach |