|
||||||||||
| PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES | |||||||||
| Class Summary | |
|---|---|
| AlphaPartitioner | |
| CDXCanonicalizingMapper | |
| CDXReducer | |
| CDXSort | |
| CDXSort.CDXCanonicalizerMapClass | Mapper which reads an identity CDX line, outputting: key - canonicalized original URL + timestamp val - everything else |
| CDXSort.CDXMapClass | Mapper which reads a canonicalized CDX line, splitting into: key - URL + timestamp val - everything else |
| CDXSort.DeReffingCDXCanonicalizerMapClass | |
| CDXSort.FunkyCDXCanonicalizerMapClass | Mapper which reads an identity Funky format CDX line, outputting: key - canonicalized original URL + timestamp val - everything else input lines are a hybrid format: ORIG_URL DATE '-' (literal) MIME HTTP_CODE SHA1 REDIRECT START_OFFSET ARC_PREFIX (sans .arc.gz) ROBOT_FLAG (combo of AIF - no: Archive,Index,Follow, or '-' if none) Ex: http://www.myow.de:80/news_show.php? 20061126032815 - text/html 200 DVKFPTOJGCLT3G5GUVLCETHLFO3222JM - 91098929 foo A Need to: . |
| CDXSort.FunkyDeReffingCDXCanonicalizerMapClass | |
| CDXSortDriver | |
| LineDereferencingInputFormat | FileInputFormat subclass which assumes the configured input files are lines containing hdfs:// pointers to the actual Text data. |
| LineDereferencingRecordReader | RecordReader which reads pointers to actual files from an internal LineRecordReader, producing a LineRecordReader for the files pointed to by the actual input. |
| SortDriver | |
|
||||||||||
| PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES | |||||||||