org.archive.wayback.hadoop
Class CDXSort.FunkyCDXCanonicalizerMapClass
java.lang.Object
org.apache.hadoop.mapred.MapReduceBase
org.archive.wayback.hadoop.CDXSort.FunkyCDXCanonicalizerMapClass
- All Implemented Interfaces:
- Closeable, org.apache.hadoop.mapred.JobConfigurable, org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text,org.apache.hadoop.io.Text,org.apache.hadoop.io.Text>
- Enclosing class:
- CDXSort
public static class CDXSort.FunkyCDXCanonicalizerMapClass
- extends org.apache.hadoop.mapred.MapReduceBase
- implements org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text,org.apache.hadoop.io.Text,org.apache.hadoop.io.Text>
Mapper which reads an identity Funky format CDX line, outputting:
key - canonicalized original URL + timestamp
val - everything else
input lines are a hybrid format:
ORIG_URL
DATE
'-' (literal)
MIME
HTTP_CODE
SHA1
REDIRECT
START_OFFSET
ARC_PREFIX (sans .arc.gz)
ROBOT_FLAG (combo of AIF - no: Archive,Index,Follow, or '-' if none)
Ex:
http://www.myow.de:80/news_show.php? 20061126032815 - text/html 200 DVKFPTOJGCLT3G5GUVLCETHLFO3222JM - 91098929 foo A
Need to:
. replace col 3 with orig url
. replace col 1 with canonicalized orig url
. replace SHA1 with first 4 digits of SHA1
. append .arc.gz to ARC_PREFIX
. omit lines with ROBOT_FLAG containing 'A'
. remove last column
- Version:
- $Date: 2010-09-29 05:28:38 +0700 (Wed, 29 Sep 2010) $, $Revision: 3262 $
- Author:
- brad
|
Method Summary |
void |
map(org.apache.hadoop.io.LongWritable lineNumber,
org.apache.hadoop.io.Text line,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,org.apache.hadoop.io.Text> output,
org.apache.hadoop.mapred.Reporter reporter)
|
| Methods inherited from class org.apache.hadoop.mapred.MapReduceBase |
close, configure |
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Methods inherited from interface org.apache.hadoop.mapred.JobConfigurable |
configure |
CDXSort.FunkyCDXCanonicalizerMapClass
public CDXSort.FunkyCDXCanonicalizerMapClass()
map
public void map(org.apache.hadoop.io.LongWritable lineNumber,
org.apache.hadoop.io.Text line,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,org.apache.hadoop.io.Text> output,
org.apache.hadoop.mapred.Reporter reporter)
throws IOException
- Specified by:
map in interface org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text,org.apache.hadoop.io.Text,org.apache.hadoop.io.Text>
- Throws:
IOException
Copyright © 2005-2011 Internet Archive. All Rights Reserved.