org.archive.wayback.hadoop
Class CDXSort.FunkyCDXCanonicalizerMapClass

java.lang.Object
  extended by org.apache.hadoop.mapred.MapReduceBase
      extended by org.archive.wayback.hadoop.CDXSort.FunkyCDXCanonicalizerMapClass
All Implemented Interfaces:
Closeable, org.apache.hadoop.mapred.JobConfigurable, org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text,org.apache.hadoop.io.Text,org.apache.hadoop.io.Text>
Enclosing class:
CDXSort

public static class CDXSort.FunkyCDXCanonicalizerMapClass
extends org.apache.hadoop.mapred.MapReduceBase
implements org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text,org.apache.hadoop.io.Text,org.apache.hadoop.io.Text>

Mapper which reads an identity Funky format CDX line, outputting: key - canonicalized original URL + timestamp val - everything else input lines are a hybrid format: ORIG_URL DATE '-' (literal) MIME HTTP_CODE SHA1 REDIRECT START_OFFSET ARC_PREFIX (sans .arc.gz) ROBOT_FLAG (combo of AIF - no: Archive,Index,Follow, or '-' if none) Ex: http://www.myow.de:80/news_show.php? 20061126032815 - text/html 200 DVKFPTOJGCLT3G5GUVLCETHLFO3222JM - 91098929 foo A Need to: . replace col 3 with orig url . replace col 1 with canonicalized orig url . replace SHA1 with first 4 digits of SHA1 . append .arc.gz to ARC_PREFIX . omit lines with ROBOT_FLAG containing 'A' . remove last column

Version:
$Date: 2010-09-29 05:28:38 +0700 (Wed, 29 Sep 2010) $, $Revision: 3262 $
Author:
brad

Constructor Summary
CDXSort.FunkyCDXCanonicalizerMapClass()
           
 
Method Summary
 void map(org.apache.hadoop.io.LongWritable lineNumber, org.apache.hadoop.io.Text line, org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,org.apache.hadoop.io.Text> output, org.apache.hadoop.mapred.Reporter reporter)
           
 
Methods inherited from class org.apache.hadoop.mapred.MapReduceBase
close, configure
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.hadoop.mapred.JobConfigurable
configure
 
Methods inherited from interface java.io.Closeable
close
 

Constructor Detail

CDXSort.FunkyCDXCanonicalizerMapClass

public CDXSort.FunkyCDXCanonicalizerMapClass()
Method Detail

map

public void map(org.apache.hadoop.io.LongWritable lineNumber,
                org.apache.hadoop.io.Text line,
                org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,org.apache.hadoop.io.Text> output,
                org.apache.hadoop.mapred.Reporter reporter)
         throws IOException
Specified by:
map in interface org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text,org.apache.hadoop.io.Text,org.apache.hadoop.io.Text>
Throws:
IOException


Copyright © 2005-2011 Internet Archive. All Rights Reserved.