Nutchwax Release Notes

Internet Archive


Table of Contents

1. Release 0.10.0
1.1. Contributors
1.2. Changes
1.3. Fixes and Additions
2. Release 0.8.0
2.1. Known Limitations/Issues
2.2. Changes
3. Release 0.6.0
3.1. Known Limitations/Issues
4. Release 0.4.3 - 03/20/2006
4.1. Changes
5. Release 0.4.2 - 11/28/05
6. Release 0.4.1 - 11/04/05
7. Release 0.4.0 - 10/10/05
7.1. Known Limitations/Issues
7.2. Changes

1. Release 0.10.0

Abstract

Bug fixes and improvements in the quality of search results but the main benefit of NutchWAX 0.10.0 is a move to hadoop 0.9.2 from 0.5.0. The upgraded hadoop platform makes indexing much more robust and noticeably faster.

1.1. Contributors

  • Maximillian Schöfmann

1.2. Changes

1.2.1. URL Normalization and Filtering

Nutch has added pluggable URL normalization ('canonicalization' in heritrix-speak) and there is also now the ability to filter by URL at each of the indexing steps. Nutchwax picks up this feature in this release (See Flexible URL normalization NUTCH-365). Normally its operation will be of no concern, particularly as the default behavior is mild equating 'http://www.archive.org:80/' and 'http://www.archive.org/' for instance but it may become an issue if the engine you are using to rendering found pages uses an index other than the one made by nutchwax: e.g., a wayback that made its own CDX or bdbje index, Here there may be times when the nutch normalization disagrees with the wayback normalization and lookups into the alternate wayback index will fail. The configuration that manages urlnormalization is in nutch-default.xml. Its the property: urlnormalizer.order combined with mention in plugin.includes.

Complaints from outlinks parser are now much tidier occuping a single line rather than dumping a MalformedURLException stacktrace. Here are samples:

2006-12-15 11:19:16,749 WARN  parse.OutlinkExtractor - Invalid url: 'NOTE:DON', skipping.
2006-12-15 11:19:16,750 WARN  parse.OutlinkExtractor - Invalid url: 'DH0:LEVELS/', skipping.

1.2.3. parse-pdf

With this release, the default pdf parser has been switched to the nutch parse-pdf PDFBox-based parser. Previous, a nutchwax plugin called parse-waxext was the default. The parse-waxext plugin ran an external dependency named xpdf -- via the wrapper script parse-pdf.sh -- parsing application/pdf document types. In primitive testing, nutch's parse-pdf comes close enough to the nutchwax parse-waxext plugin in the number of PDFs successfully parsed (80% vs. 90% of all PDFs in an ARC that contained 158). We make the move in the name of minimizing the number of NutchWAX external dependencies and in the hope that parse-pdf will continue to improve with time. Should you run into problems with parse-pdf -- it used to hang on PDFs from time-to-time in the past -- or you require that ingest parse the maximum number of PDFs, switching back to parse-waxext is just a matter of configuration. After ensuring xpdf is installed on all nodes, edit hadoop-sites.xml. Add in the plugin.includes from wax-default.xml and edit it so that rather than parse-pdf, instead it references parse-waxext. You'll then need to copy to your hadoop conf directory the wax-parse-plugins.xml and change references to parse-pdf to parse-waxext.

1.2.4. NutchWAX and wayback integration

Its now possible to configure the open source wayback to use NutchWAX indices finding pages (and a page's embeds). One useful setup, has the wayback and NutchWAX WARs deployed in the same container with NutchWAX using the colocated wayback as the search result page renderer. See HOWTO: Configure Wayback to use NutchWAX index

1.3. Fixes and Additions

Table 1. Fixes and Additions

IDTypeSummaryOpen DateByFiler
1592768AddBetter job names and note job in jobtracker log2006-11-08stack-sfstack-sf
1632531AddUse parse-pdf in place of xpdf2007-01-10stack-sfstack-sf
1288990AddConfigurable collection name in search.jsp2005-09-12stack-sfstack-sf
1503045AddPDFs have URL for title2006-06-08stack-sfstack-sf
1407760AddCan't do phrase search against 'title'2006-01-16stack-sfstack-sf
1506319AddPort 80 messes up queries against urls2006-06-14nobodystack-sf
1616124AddMove from nutch-0.8.1 to TRUNK and hadoop 0.9.22006-12-14stack-sfstack-sf
1567247AddRemove harmless outlink parse fail messages2006-09-28stack-sfstack-sf
1567251Addpdf parse of too long doc. failure msg cryptic2006-09-28nobodystack-sf
1631694FixCCE when doing initial update and specifying a segment2007-01-09stack-sfnobody
1636313FixIf exact date passed, use it2007-01-15stack-sfstack-sf
1629593FixAdd a NutchwaxLinkDbMerger2007-01-06stack-sfstack-sf
1591709Fixspacer.gif shows high in search results2006-11-06nobodystack-sf
1619644Fixstandalone mode can't find parse-pdf.sh2006-12-20stack-sfstack-sf
1628157FixQuery 'host' field is broken2007-01-04stack-sfnobody
1596432Fixfix non-indexing of mimetype \'no-type\'2006-11-14stack-sfstack-sf
1582980Fixwax-parse-plugins.xml assigns javascript to parse-text2006-10-23stack-sfnobody

2. Release 0.8.0

Abstract

NutchWAX 0.8.0 is built against Nutch 0.8.1, released 09/24/2006. A version of this software was recently used to make an index of greater than 400 million documents.

2.1. Known Limitations/Issues

2.1.1. Nutch and Hadoop versions

Patches made to the Nutch 0.8.1 included in Nutch are listed in the NutchWAX README. NutchWAX 0.8.0 will only run on a hadoop 0.5.0. It will fail to run on later versions.

2.1.2. HTML Parser ERROR: parse.OutlinkExtractor

You'll see lots of output like the below during the import step. Its harmless even though its reported at the ERROR log level (Subsequent to the 0.8.1 release of nutch, these messages are no longer reported at ERROR log level). The HTML parser is reporting something that looked like a link is not of a supported protocol. In the below example margin-bottom is not a supported protocol:

06/12/04 15:36:35 ERROR parse.OutlinkExtractor: getOutlinks
java.net.MalformedURLException: unknown protocol: margin-bottom
	at java.net.URL.<init>(URL.java:574)
	at java.net.URL.<init>(URL.java:464)
	at java.net.URL.<init>(URL.java:413)
	at org.apache.nutch.net.BasicUrlNormalizer.normalize(BasicUrlNormalizer.java:78)
	at org.apache.nutch.parse.Outlink.<init>(Outlink.java:35)
	at org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:111)
	at org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:70)
	at org.apache.nutch.parse.text.TextParser.getParse(TextParser.java:47)
	at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
	at org.archive.access.nutch.ImportArcs.processRecord(ImportArcs.java:513)
	at org.archive.access.nutch.ImportArcs$IndexingThread.run(ImportArcs.java:324)

2.2. Changes

Table 2. Changes

IDTypeSummaryOpen DateByFiler
1247519Addnext/previous in search.jsp needs improvement2005-07-29 08:33stack-sfstack-sf
1611195AddSet default html content limit to 101k.2006-12-07 15:44stack-sfstack-sf
1247141AddTools and doc. of incremental indexing2005-07-28 15:06stack-sfstack-sf
1566629Addmultiple collections in single index2006-09-27 13:52stack-sfstack-sf
1532024Addnutchwax regression tests2006-07-31 14:32nobodycathcart
1611150Fix'arcname' (in 'explain') looks like unrelated URL2006-12-07 14:16stack-sfgojomo
1363012FixNegativeArraySizeException in search.jsp2005-11-21 10:49stack-sfstack-sf
1246906FixFix nutchwax license2005-07-28 09:20stack-sfstack-sf
1477183FixIncremental indexing broken2006-04-26 12:42stack-sfstack-sf
1598060FixDistributed searcher mode doesn't work2006-11-16 14:49stack-sfstack-sf
1592633FixMissing closing tags for hadoop-site.xml ex2006-11-08 05:32stack-sfnobody
1581618FixNo anchor text2006-10-20 16:53stack-sfstack-sf
1556559FixSometimes webapp fails getting summary2006-09-11 11:14stack-sfstack-sf
1518430FixARCName has filedesc prefix and arc suffix2006-07-06 16:03stack-sfstack-sf
1511418FixCollection name not passed to ImportArcs2006-06-23 08:34stack-sfstack-sf

3. Release 0.6.0

Abstract

Move to mapreduce Nutch as base. Much has changed in the mapreduce version of NutchWAX. 0.6.0 bears little resemblance to previous releases both in how it goes bout its work and from how its run by the user. Be prepared to leave aside all old NutchWAX assumptions.

3.1. Known Limitations/Issues

3.1.1. Incremental Indexing

Incremental indexing does not work in 0.6.0 [See [1477183] [nutchwax] Incremental indexing broken].

3.1.2. Incompatible

Indexes and segments made with 0.4.x NutchWAX will not work with the 0.6.0 release (and vice versa).

4. Release 0.4.3 - 03/20/2006

Abstract

Minor bug fixes.

4.1. Changes

Table 3. Bugs

IDTypeSummaryOpen DateByFiler
1454710FixIndex '.arc' (as well as '.arc.gz').2006-03-20 08:54stack-sfstack-sf
1454714FixNull mimetype stops indexing2006-03-20 09:00stack-sfstack-sf
1429788Fixxml output destroyed by html entity encoding2006-03-20 08:59stack-sfstack-sf

5. Release 0.4.2 - 11/28/05

Abstract

Last release before move to mapreduce

Minor fixes: Added Google-like results paging and built for a 1.4.x Java target.

6. Release 0.4.1 - 11/04/05

Abstract

Bug fix.

Fix encoding issue in 0.4.0: [1348019] [nutchwax] Double encoding of disallowed xml chars

7. Release 0.4.0 - 10/10/05

Abstract

Bug fixes.

NutchWAX has been built against Nutch 0.7.0 (There seem to be issues with 0.7.1 build, and then some, so have not built against the 0.7.1 release).

7.1. Known Limitations/Issues

General limitation of current platform are listed in Section 7. Observations on Page 9 of Full Text Search of Web Archive Collections.

7.1.1. PDFs

PDFs whose size is greater than 10megs are skipped completely. Legitimate PDFs whose http content-length does not strictly gree with the ARC length are also skipped.

7.2. Changes

Table 4. Bugs/Features

IDTypeSummaryOpen DateByFiler
1608891Addixes to make wayback use nutchwax index2006-12-04 17:35Maximilian Schoefmannstack-sf
1313214AddDedup'ing that considers collection field.2005-10-04 12:46stack-sfstack-sf
1309781AddAdd in skipping certain types if > size2005-09-30 14:01stack-sfstack-sf
1244843AddAllow querying on mime primary and sub type2005-07-25 16:13stack-sfstack-sf
1280825AddMake nutch merge segment work against nutchwax segments2005-09-02 10:00stack-sfstack-sf
1247571FixItems not getting indexed2005-07-29 09:55stack-sfstack-sf
1312212Fixbad xml chars in search results2005-10-03 12:11stack-sfstack-sf
1244894FixCannot query for non-ISO8859 characters2005-07-25 18:38stack-sfstack-sf
1312208FixQuery time encoding issues2005-10-03 12:11stack-sfstack-sf
1312217FixNot indexing images2005-10-03 12:18stack-sfstack-sf
1244875Fixexacturl encoding not working2005-07-25 17:21stack-sfstack-sf
1281697Fixsearching czech words not working2005-09-04 10:36stack-sfkranach