Package org.archive.wayback.hadoop

Class Summary
AlphaPartitioner  
CDXCanonicalizingMapper  
CDXReducer  
CDXSort  
CDXSort.CDXCanonicalizerMapClass Mapper which reads an identity CDX line, outputting: key - canonicalized original URL + timestamp val - everything else
CDXSort.CDXMapClass Mapper which reads a canonicalized CDX line, splitting into: key - URL + timestamp val - everything else
CDXSort.DeReffingCDXCanonicalizerMapClass  
CDXSort.FunkyCDXCanonicalizerMapClass Mapper which reads an identity Funky format CDX line, outputting: key - canonicalized original URL + timestamp val - everything else input lines are a hybrid format: ORIG_URL DATE '-' (literal) MIME HTTP_CODE SHA1 REDIRECT START_OFFSET ARC_PREFIX (sans .arc.gz) ROBOT_FLAG (combo of AIF - no: Archive,Index,Follow, or '-' if none) Ex: http://www.myow.de:80/news_show.php? 20061126032815 - text/html 200 DVKFPTOJGCLT3G5GUVLCETHLFO3222JM - 91098929 foo A Need to: .
CDXSort.FunkyDeReffingCDXCanonicalizerMapClass  
CDXSortDriver  
LineDereferencingInputFormat FileInputFormat subclass which assumes the configured input files are lines containing hdfs:// pointers to the actual Text data.
LineDereferencingRecordReader RecordReader which reads pointers to actual files from an internal LineRecordReader, producing a LineRecordReader for the files pointed to by the actual input.
SortDriver  
 



Copyright © 2005-2011 Internet Archive. All Rights Reserved.