<?xml version='1.0' ?>
<!--See http://xml.resource.org/ for formatting tools that can deal with
    this RFC2629 (and beyond) XML format.

    I tried including xml-stylesheet with a pointer to the xslt from Section
    3.3 of http://xml.resource.org/authoring/draft-mrose-writing-rfcs.html
    but only seems to work in IE, not mozilla nor firefox, so whats the use.

    $Id: warc_file_format.xml 1235 2006-09-20 00:03:54Z stack-sf $
 -->

<!DOCTYPE rfc SYSTEM 'rfcXXXX.dtd' [

  <!ENTITY mdash '&#8212;' >

  <!ENTITY rfc0822 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.RFC.0822.xml'>
  <!ENTITY rfc1034 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.RFC.1034.xml'>
  <!ENTITY rfc1035 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.RFC.1035.xml'>
  <!ENTITY rfc1884 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.RFC.1884.xml'>
  <!ENTITY rfc1950 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.RFC.1950.xml'>
  <!ENTITY rfc1951 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.RFC.1951.xml'>
  <!ENTITY rfc1952 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.RFC.1952.xml'>
  <!ENTITY rfc2045 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2045.xml'>
  <!ENTITY rfc2046 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2046.xml'>
  <!ENTITY rfc2048 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2048.xml'>
  <!ENTITY rfc2141 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2141.xml'>
  <!ENTITY rfc2234 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2234.xml'>
  <!ENTITY rfc2396 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2396.xml'>
  <!ENTITY rfc2540 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2540.xml'>
  <!ENTITY rfc2616 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2616.xml'>
  <!ENTITY rfc4027 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.RFC.4027.xml'>
  <!ENTITY rfc4501 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.RFC.4501.xml'>
]>
<?rfc symrefs="yes"?>
<?rfc toc="yes"?>
<!-- make a private memo for now, not an RFC or Internet-Draft -->
<?rfc private="IIPC Framework Working Group"?>  

<rfc> <!-- ipr="full3978"> elided until such time as IETF submission is planned -->
 <front>
  <title abbrev="WARC File Format, 0.10">
   The WARC File Format (Version 0.10)
  </title>

  <author initials="J." surname="Kunze" 
          fullname="John A. Kunze" role="editor"> 
   <organization>
    California Digital Library 
   </organization>
   <address>
    <postal>
     <street>415 20th St, 4th Floor</street>
     <city>Oakland</city> <region>CA</region>
     <code>94612-3550</code>
     <country>US</country>
    </postal>
    <email>jak@ucop.edu</email>
    <facsimile>+1 510-893-5212</facsimile>
   </address>
  </author>
  <author initials="A." surname="Arvidson" 
          fullname="Allan Arvidson"> 
   <organization>
    Kungliga biblioteket
    (National Library of Sweden)
   </organization>
   <address>
    <postal>
     <street>Box 5039</street>
     <code>10241</code> <city>Stockholm</city>
     <country>SE</country>
    </postal>
    <email>allan.arvidson@kb.se</email>
    <facsimile>+46 (0)8 463 4004</facsimile>
   </address>
  </author>
  <author initials="G." surname="Mohr" 
          fullname="Gordon Mohr"> 
   <organization>
    Internet Archive
   </organization>
   <address>
    <postal>
     <street>4 Funston Ave, Presidio</street>
     <city>San Francisco</city> <region>CA</region>
     <code>94117</code>
     <country>US</country>
    </postal>
    <email>gojomo@archive.org</email>
   </address>
  </author>
  <author initials="M." surname="Stack" 
          fullname="Michael Stack"> 
   <organization>
    Internet Archive
   </organization>
   <address>
    <postal>
     <street>4 Funston Ave, Presidio</street>
     <city>San Francisco</city> <region>CA</region>
     <code>94117</code>
     <country>US</country>
    </postal>
    <email>stack@archive.org</email>
   </address>
  </author>

  <date month="September" year="2006" />

  <abstract>

<t>The WARC (Web ARChive) format specifies a method for combining multiple 
digital resources into an aggregate archival file together with related 
information. Resources are dated, identified by URIs, and preceded by 
simple text headers. By convention, files of this format are named with 
the extension ".warc" and have the MIME type application/warc. The 
WARC file format is a revision and generalization of the ARC format 
used by the Internet Archive to store information blocks harvested by 
web crawlers. This document specifies version 0.10 of the WARC format.</t>

  </abstract>

 </front>

 <middle>

  <section title="Introduction">

<t>
Content on the  World Wide Web is ephemeral.  Everyday websites and web pages
are created, changed, relocated and disappear.  For the past ten years, memory
institutions have tried to find the most appropriate means of collecting and
keeping track of this important, transitory material. One approach uses a
web crawler to take 'snapshots' of the web, or of parts of the web, at
particular moments in time.  Web crawlers are software programs which browse
the web in an automated manner according to a set of policies.  A crawler
starts with a list of URLs to visit.  As it visits each URL, it makes a copy
of the visited page and then extracts all hyperlinks -- links to other pages,
images, videos, scripting or style instructions, etc. -- to add to its queue
of URLs to visit next.  During any given web crawl, a crawler can collect
millions of pages.  The accumulated captures from many web crawls can run
into the billions.  An efficient format for storing and preserving the
captures of large-scale web harvests is critical.  This document describes
the WARC (Web ARChive) file format.
</t>

<t>
The WARC file format offers a convention for concatenating
multiple resource records, each consisting of a set of simple text headers
and an arbitrary data block, into one long file.  The WARC format is a
revision of the <xref target="ARC">ARC File Format</xref> which has
traditionally been used to store web crawls as sequences of content blocks
harvested from the World Wide Web.  The ARC format file has been in active
use by the Internet Archive (IA) since 1996.  Currently over 50 billion
objects are stored in ARCs at the IA. The ARC format is also being used
by several national libraries.
</t>

<t>
The motivation to revise the ARC format
arose from the discussion and experiences of the
<xref target="IIPC"> International Internet Preservation Consortium
(IIPC)</xref>, whose members include the national libraries of
Australia, Canada, Denmark, Finland, France, Iceland, Italy, Norway, Sweden,
The British Library (UK), The Library of Congress (USA), and the Internet
Archive (IA).  Input from the California Digital Library and Los Alamos
National Laboratory, who have also established large repositories, was also
considered.</t>  

<t>In an ARC file, each capture is preceded by a one-line header that
briefly describes the harvested content and its length. This is directly
followed by the the retrieval protocol response messages and content.  Only
the response is recorded. There is no provision for noting the request or for
adding metadata.  
</t>

<t>The WARC format generalizes the ARC format to better support the
harvesting, access, and exchange needs of archiving organizations. Besides the
primary content currently recorded, the revision accommodates related
secondary content, such as assigned metadata, abbreviated duplicate detection
events, and later-date transformations.
</t>

<t>
The WARC format is expected to become the standard way to structure, manage
and store billions of collected web resources. It will be used as the output
format by web harvesting applications, such as the open-source 
<xref target="HERITRIX">Heritrix</xref> web crawler, and as an input 
format for a wide array of cataloging and access tools.  
</t>


  </section>

  <section title="Goals">

<t>Goals of the WARC file format include the following.</t>

<list style="symbols">

<t>Ability to store both the payload content and control information
from mainstream Internet application layer protocols, including HTTP,
FTP, NNTP, and SMTP.</t>

<t>Ability to store arbitrary metadata linked to other stored data
(e.g., subject classifier, discovered language, encoding)</t>

<t>Support for data compression and maintenance of data record
integrity.</t>

<t>Ability to store all control information from the harvesting
protocol (e.g., request headers), not just response information.</t>

<t>Ability to store the results of data transformations linked to
other stored data.</t>

<t>Ability to store a duplicate detection event linked to other stored
data (to reduce storage in the presence of identical or substantially
similar resources).</t>

<t>Amenable to efficient processing.</t>


<t>Support for deterministic handling of long records (e.g.,
truncation, segmentation).</t>

</list>
<t>The WARC file format is made sufficiently different from the legacy ARC
format files so that software tools can unambiguously detect and correctly
process both WARC and ARC records; given the large amount of existing
archival data in the previous ARC format, it is important that access and
use of this legacy not be interrupted when transitioning to the WARC format.</t>

  </section>

  <section title="The WARC Record Model">

<t>A WARC format file is the simple concatenation of one or more WARC
records.  A WARC record consists of a record header followed by a record
content block and two newlines where newlines are CRLF as per other
Internet standards such as <xref target="RFC0822" />.  The first record in
a WARC file usually describes the records to follow.  Subsequent records
contain content blocks that are either the direct result of a retrieval
attempt — web pages, inline images, URL redirection information, DNS hostname
lookup results, standalone files, etc. — or they are synthesized content
blocks (e.g., metadata, transformed content) that provide additional
information about archived content. 
</t>

<t>
The format of a WARC file can be expressed in IETF Augmented Backus-Naur
Form (ABNF) grammar as specified in <xref target="RFC2234"/> as follows.
(All-caps "core" elements are as defined in RFC2234.)</t>

<figure>
 <artwork>
  warc-file   = 1*warc-record
  warc-record = header block CRLF CRLF
  header      = header-line CRLF *anvl-field CRLF
  block       = *OCTET
 </artwork>
</figure>

<t>The record <spanx style="emph">header-line</spanx> is a
newline-terminated sequence of whitespace-delimited text tokens
representing parameters such as record length, time of creation, and
subject URI.</t>

<figure>
 <artwork>
  header-line = warc-id tsp data-length tsp record-type tsp
                subject-uri tsp creation-date tsp
                record-id tsp content-type
  tsp         = 1*WSP
 </artwork>
</figure>

<t>The amount of whitespace between <spanx
style="emph">header-line</spanx> tokens is variable. This gives
archive builders the flexibility to add padding and later adjust
pre-written header parameters when final values are only completely
known after the record content <spanx style="emph">block</spanx> has
been written.  The header-line tokens are detailed later
in the section <xref target="pos_par" format="title" />.</t>

<t>Zero or more <spanx style="emph">anvl-field</spanx> named
parameters expressed in A Name-Value Language
<xref target="ANVL" /> follow the 
<spanx style="emph">header-line</spanx> in a line-oriented syntax
very similar to that of email headers <xref target="RFC0822" /> but with
unrestricted "text" values.  The precise format is as follows:</t>

<figure>
 <artwork>
  anvl-field  =  field-name ":" [ field-body ] CRLF
  field-name  =  1*&lt;any CHAR, excluding control-chars and ":">
  field-body  =  text [CRLF 1*WSP field-body]
  text        =  1*&lt;any UTF-8 character, including bare
                    CR and bare LF, but NOT including CRLF>
                                             ; (Octal, Decimal.)
  CHAR        =  &lt;any ASCII/UTF-8 character> ; (0-177,  0.-127.)
  CR          =  &lt;ASCII CR, carriage return> ; (   15,      13.)
  LF          =  &lt;ASCII LF, linefeed>        ; (   12,      10.)
  SPACE       =  &lt;ASCII SP, space>           ; (   40,      32.)
  HTAB        =  &lt;ASCII HT, horizontal-tab>  ; (   11,       9.)
  CRLF        =  CR LF
  WSP         =  SPACE / HTAB                ; semantics = SPACE
 </artwork>
</figure>

<t>The section <xref target="named_parameters" format="title" />
defines a number of mandatory and optional 
<spanx style="emph">anvl-field</spanx> named parameters.
</t>

<t>A content <spanx style="emph">block</spanx>, if any,
follows the named parameters.  It is described in the
section <xref target="record_content_block" format="title" />
</t>

<t>
Two CRLF newlines, not counted in the declared record
<spanx style="emph">data-length</spanx> terminate a WARC record.
</t>

  </section>

  <section anchor="record_types" title="Record Types">

<t>Every WARC record has a type.  There are 8 currently defined WARC
record types: 'warcinfo', 'response',
'resource', 'request', 'metadata', 'revisit', 'conversion', and
'continuation'. The purpose and use of each type is described below.</t>

<t> New record types that extend the WARC format may be defined in the
future.  WARC processing software should skip records of unknown
type.  A forum for proposals and discussion in advance of
standardization is the discussion list standards@netpreserve.org.</t>

   <section title="'warcinfo'">

<t>A 'warcinfo' record describes the records that follow it, up
through end of file, end of input, or until next 'warcinfo' record. 
Typically, this appears once and at the beginning of a WARC file.
For a web archive, it often contains a description of a web crawl
(e.g., depth, timeout, purpose).  The format of the description is
outside the scope of this document, but may include such things as
approximate maximum archive file size (e.g., 1GB) and site entry
point URIs for a targeted crawl.</t>

<t>So that multiple record excerpts from inside WARC files may also be
valid WARC files, it is not strictly required that the first record of
a legal WARC be a 'warcinfo' description. Also, to allow the
concatenation of WARC files into a larger valid WARC file, it is
allowable for 'warcinfo' records to appear in the middle of a WARC
file.
</t>

<t>The subject-uri of a 'warcinfo' record should be a URI name which
references the WARC file itself.</t> 

   </section>

   <section title="'response'">

<t>A 'response' record contains an entire protocol response, such as a full
HTTP response including headers and content-body, from a network
retrieval. Often the payload of such a response reflects the main
collection objective of the archiving service, whose responsibility it
is to distinguish payload from protocol headers during subsequent
processing. A response record often includes the named parameters
'IP-Address' and 'Related-Record-ID'.</t>

   </section>

   <section title="'resource'">

<t>A 'resource' record contains a resource, without full protocol response
information. For example: a file directly retrieved from a locally
accessible repository, or the result of a networked retrieval where
the protocol information has been discarded. A resource record often
includes the named parameter 'Related-Record-ID'.</t>

   </section>

   <section title="'request'">
<t>A 'request' record holds the manner in which a primary record's content was
requested.  In a web crawling context, this would hold the HTTP
request.  A request record often includes the named parameter
'Related-Record-ID'.</t>

   </section>

   <section title="'metadata'">
<t>A 'metadata' record contains content created in order to further describe,
explain, or accompany a harvested resource, in ways not covered by
other record types. A 'metadata' record will almost always refer to
another record of another type, with that other record holding original
harvested or transformed content. (However, it is allowable for a
'metadata' record to refer to any record type, including other
'metadata' records.) Any number of metadata records may be created that
reference one specific other record. The format of the metadata is outside
the scope of this document, but potential formats are
<xref target="ANVL" /> and <xref target="RDF" />. A metadata record 
often includes the named parameter 'Related-Record-ID'.</t>

   </section>

   <section title="'revisit'">

<t>A 'revisit' record describes the revisitation of content already archived,
and includes only an abbreviated content block which must be
interpreted relative to a previous record. Most typically, a 'revisit'
record is used instead of 'response' or 'resource' record to
indicate that the content visited was either a complete or substantial
duplicate of material previously archived.</t>

<t>A 'revisit' record should only be used when interpreting the record
requires consulting a previous record; other record types should be
preferred if the current record is understandable standing
alone. (It is not required that any revisit of a previously-visited
URI use 'revisit', only those which refer back to other records.)</t>

<t>The format of a 'revisit' record's content block is outside the scope
of this document and may vary to accomplish different goals such as
recording the apparent magnitude of difference from the previous
visit, or to encode the visited content as a "diff" -- where "diff"
is the file comparison utility that outputs the differences between
two files -- of the content previously stored.  The purpose of this
record type is to reduce
storage redundancy when repeatedly retrieving identical or
little-changed content, while still recording that a revisit occurred,
plus details about the current state of the visited content relative
to the archived version. A revisit record requires the named parameter
'Related-Record-ID'.</t>

   </section>

   <section title="'conversion'">

<t>A 'conversion' record contains an alternative version of another record's
content that was created as the result of an archival
process. Typically, this is used to hold content transformations that
maintain viability of content after widely available rendering tools
for the originally stored format disappear. As needed, the original
content may be migrated (transformed) to a more viable format in order
to keep the information usable with current tools while minimizing
loss of information (intellectual content, look and feel, etc). Any
number of transformation records may be created that reference a
specific source record, which may itself contain transformed
content. Each transformation should result in a freestanding, complete
record, with no dependency on survival of the original
record. Metadata records may be used to further describe
transformation records. A conversion record requires the named
parameter 'Related-Record-ID'.</t>

<t>Specification of the fields and metadata formats used to describe a 
'conversion' record is outside the scope of this document,</t>

   </section>

   <section title="'continuation'">

<t>A 'continuation' record needs to be logically appended to a prior record 
(e.g., from another WARC file) to create the logically complete full-sized
record. This is used when a record that would otherwise cause the WARC
file size to exceed a desired limit is broken into segments. See the
section <xref target="trunc_seg" format="title" />
for more information. A
continuation record requires the named parameters 'Segment-Origin-ID'
and 'Segment-Number'.
</t>
    </section>

  </section>
  
  <section anchor="record_header" title="Record Header">

<t>The WARC record header declares baseline identifying information about
the current record, and allows additional per-record
information. It consists of one line of required
positional parameters, then a variable number of lines of named
parameters.</t>

   <section anchor="pos_par" title="Positional Parameters">
<t>Positional parameters make up the first line in a WARC record,
the <spanx style="emph">header-line</spanx>,
and are separated from each other by one or more spaces.
Positional parameter order is significant.</t>  

<t>Using ABNF, the WARC record header-line parameters are:</t>

<figure>
 <artwork> 
  warc-id       = "WARC/" 1*DIGIT "." 1*DIGIT
  data-length   = 1*DIGIT
  record-type   = "warcinfo" / "response" / "request"
                  / "metadata" / "revisit" / "conversion"
                  / "continuation" / future-type
  future-type   = 1*VCHAR
  subject-uri   = uri
  uri           = &lt;'URI' per RFC3986>
  creation-date = timestamp
  timestamp     = &lt;date per below>
  record-id     = uri
  content-type  = type "/" subtype *(";" parameter)
  type          = &lt;'type' per Section 5.1 of RFC2045>
  subtype       = &lt;'subtype' per Section 5.1 of RFC2045>
  parameter     = &lt;'parameter' per Section 5.1 of RFC2045>
 </artwork>
</figure>

<t>DIGIT and VCHAR are as defined in <xref target="RFC2234" />.
No parameter may be written with internal whitespace except
the last, content-type.</t>

<list style="hanging">

 <t hangText="warc-id">
A fixed pattern, "WARC/0.10", that appears first in every
record and hence begins the WARC file itself. It serves to identify
the file format and version to outside inspection, and to assist error
recovery when a process reading a WARC file fails to find the next
record boundary where expected. Occurrences of this string are not
definitively the same as record boundaries, since the string may by
chance occur inside a record.  However, it may still be useful to
locate such strings when attempting to recover from file corruption
which renders one or more data-length parameters unreliable.
The warc-id string may change in future versions, but will always
begin "WARC/", continue with version numbers, and end at whitespace.</t>
 
 <t hangText="data-length">
The number of octets in the record, starting with the first letter
("W") of the first token, through to the end of the content block 
&mdash; not including the 2 record-ending newlines.  After proceeding 
this many octets from that first character of the record header, there
should be two newlines and either the beginning of a new record or the
end of the file. (WARC reading implementations may choose to tolerate
more or fewer newlines at the end of a record.)

<vspace blankLines="2" />

If the first next token does not match the first token of a WARC record,
then the previous data-length should be considered in error; corrective
action might include searching for a nearby occurrence of "WARC/0.10"
and other character patterns indicative of a legal record beginning.
</t>

 <t hangText="record-type">
The kind of WARC record. All record types are optional, though
starting all WARC files with a "warcinfo" record is
recommended. Record types are defined in the section
<xref target="record_types" format="title" />.</t>

 <t hangText="subject-uri">
The original URI whose collection gave rise to the information content
in this record. In the context of web harvesting, this is the URI that
was the target of a crawler's retrieval request. Indirectly, such as
for a 'revisit', 'metadata', or 'conversion' record, it is a copy of
the subject-uri appearing in the original record to which the newer
record pertains.  The URI in this value should be properly escaped
according to <xref target="RFC2396"/> and written with no internal
whitespace.

 </t>

 <t hangText="creation-date">
A 14-digit timestamp in the format YYYYMMDDhhmmss representing the GMT
time when record creation began. Multiple records written as part of a
single collection action may share the same creation-date, even though
the times of their writing will not be exactly synchronized.
 </t>

 <t hangText="record-id">
An identifier assigned to the record that is globally unique for its
period of intended use.  No identifier scheme is mandated by this
specification, but each record-id should be a legal URI and clearly
indicate a documented and registered scheme to which it conforms
(e.g., via a URI scheme prefix such as "http:" or "urn:").
Care should be taken to ensure that this value is written with no
internal whitespace.

 </t>

 <t hangText="content-type">
The MIME type <xref target="RFC2045"/> of the information contained in
the record's content block.  For content in HTTP request and response
records, this should be 'application/http' as per Section 19.1 of
<xref target="RFC2616" /> (or 'application/http; msgtype=request' and
'application/http; msgtype=response' respectively).  In particular,
the content-type is not the value of the HTTP Content-Type header in an
HTTP response but a MIME type to describe the content body (hence
'application/http' if the content body contains response headers
and the response itself).  Whitespace (WSP <xref target="RFC2234" />)
delimiting 'parameters' or inside 'quoted-string' is allowed. This is
the only positional parameter that may legally contain whitespace.
 </t>
 
</list>

   </section>

   <section anchor="named_parameters" title="Named Parameters">
<t>Named parameters, also referred to as named fields, follow the
WARC record header-line and are written using the line-oriented
syntax, <xref target="ANVL" />, defined previously.  Normally,
named parameters are optional and their order is not significant,
however, specific record types require that certain named parameters
be present (and future extensions may have ordering requirements). If
there are no named parameters present, the entire WARC record header
is the line of positional parameters followed by one blank line (two
consecutive newlines).
</t>

<t>
Additional named parameters may be proposed by WARC users, who are
urged to publicly document and discuss with the WARC community new
named parameters before use.
</t>


<list style="hanging">

 <t hangText="IP-Address: IP-address">
The numeric Internet address contacted to retrieve any included
content. An IPv4 address should be written as a "dotted quad"; an IPv6
address as per <xref target="RFC1884"/>. For an HTTP retrieval, this
will be the IP address used at retrieval time corresponding to the
hostname in the record's subject-uri.
 </t>

 <t hangText="Checksum: algorithm:value">
An optional parameter indicating the name of a digest algorithm run
against the content block and the string representation of the resulting
value computed.  An example is:
<figure><artwork>
Checksum: sha1:AB2CD3EF4GH5IJ6KL7MN8OPQ
</artwork></figure>
As of this writing, this document recommends no particular algorithm.
 </t>

 <t hangText="Related-Record-ID: record-id">
The identifier of the record for which the present record holds
related content. This parameter is required of the record types
'revisit' and 'conversion'. It is also required to associate records
of types 'request', 'response', 'resource', and 'metadata' with one
another, when desired. However, none of these record types necessarily
takes precedence over the others to become the referred-to (primary)
record. (Any of them may appear first or alone.)

<vspace blankLines="2" />

A potential strategy, after choosing one record to be primary, is to
extend its record-id as described in the appendix
<xref target="recordids" format="title"/>.
This creates satellite record-ids for related records
that contain the primary record-id as an initial substring, which
greatly optimizes the detection (and in some cases derivation) of
related records.


 </t>

 <t hangText="Segment-Origin-ID: record-id">
In a continuation record, this named parameter is mandatory.  It
identifies the record of the first segment of the set.
 </t>
 
 <t hangText="Segment-Number: integer">
In the first segment of a record that is completed in one or more
later 'continuation' WARC records, this parameter is mandatory. 
Its value is "1". In a 'continuation' record, this parameter is also
mandatory.  Its value is the sequence number of the current segment
in the logical whole record, increasing by 1 in each next segment.  
 </t>

 <t hangText="Truncated: reason-token">
When present, indicates that the current record ends before the
apparent end of the source material, but no continuation records are
forthcoming. Possible values indicate the reason for the truncation:
'length' for exceeding a desired length limit; 'time' for exceeding a
desired time limit during collection.
 </t>

 <t hangText="Warcinfo-ID: record-id">
When present, indicates the record-id of the associated 'warcinfo'
record for this record.  Typically, the Warcinfo-ID parameter is used
when the context of the applicable 'warcinfo' record is unavailable,
such as after distributing single records into separate WARC files.
WARC writing applications (such web crawlers) may choose to record
this parameter routinely.
The Warcinfo-ID parameter overrides any association with a previously
occurring (in the WARC) 'warcinfo' record, thus providing a way to protect
the true association when records are combined from different WARCs.
Use of this parameter in a record of type 'warcinfo' is undefined and
reserved for possible future extension.
 </t>

</list>

<t>
Of note, there is nothing to preclude multiple instances of
a particular named field per record.  For example, a record may relate
to multiple other records.  In this case, a writer may output
multiple instances of the Related-Record-IDs named field, one per
referred-to record.
</t>

  </section>
  </section>

  <section anchor="record_content_block" title="Record Content Block">
 
<t>Each record's content block contains zero or more bytes of data, 
interpreted according to the record type and any preceding headers,
up through the remaining number of octets as specified in the previously-given 
<spanx style="emph">data-length</spanx> parameter.
</t>


  </section>
  
  <section anchor="trunc_seg"
    title="Truncated and Segmented Records">

<t>For practical reasons, writers of the WARC format may place limits
on the time or storage allocated to archiving a single resource. As a
result, only a truncated portion of the original resource may be
available for saving into a WARC record.</t>

<t>Additionally, users will often want to keep individual WARC files
near or below some target size, such as 100MB or 1GB. If some
records would be too large to be contained by a single WARC file of
desired maximum size, those records will have to be split between
multiple WARC files.</t>

<t>This section defines mechanisms for indicating that a WARC record
has been truncated or split into multiple records, called segments,
across WARC files.</t>

<t>These mechanisms are provisional and subject to change. A superior
method of indicating truncation and segmentation may be developed,
which better allows the writing of records to begin without foreknowledge 
of their final length.</t>

   <section title="Record Truncation">

<t>Any record may indicate that truncation has occurred and give the
reason by the addition of a named 'Truncated' field in the record
header. Acceptable values for this field include 'time' for truncation
due to exceeding a time limit, and 'length' for truncation due to
exceeding a length limit.</t>

   </section>

   <section title="Record Segmentation">

<t>A record that will not fit into a single WARC file of desired
maximum size may be broken into any number of separate records, called
segments. As much as possible, segmentation should be avoided, and
where necessary, segments other than the first must be of record-type
'continuation'.</t>

<t>The first segment must carry the record-type (not 'continuation')
that the record would have had were it not broken into segments, and a
'Segment-Number' named field with a value of "1".</t>

<t>All subsequent segments must have a record type of 'continuation',
with an incremented 'Segment-Number' field. They must also include a
'Segment-Origin-ID' field with a value of the Record-ID of the record
containing the first segment of the set. All segments of a set must
have identical subject-uri parameters.</t>

<t>The last segment must contain a "End-Length" named field 
specifying the total length, in bytes, of all segment content if
reassembled. The last segment may also contain a 'Truncated' field, if
appropriate. Segments other than the first should contain no other
named parameters, as they merely serve to continue the record data block
of the first record.</t>

<t>To reassemble all segments into the intended complete logical
record, all records with the same 'Segment-Origin-ID' value must be
collected and appended, in 'Segment-Number' order, to the origin
record.</t>


    </section>
  </section>

  <section title="WARC Application to Specific Protocols">

   <section title="HTTP and HTTPS">

<t>A full HTTP or HTTPS response, with protocol information and
content-body (if any), can be saved verbatim into a WARC file as a
'response' type record, with a MIME content-type of 'application/http'
(or 'application/http; msgtype=response').</t>

<t>A full HTTP or HTTPS request, including all request headers and
content-body (if any), can similarly be saved verbatim into a WARC
file as a 'request' type record, with a MIME content-type of
'application/http' (or 'application/http; msgtype=request').</t>

<t>For either a request or response, an 'IP-Address' field should be
used to record the network IP address to which the request was
directed, using the best available DNS information at the time.</t>

<t>Additional metadata about the HTTP or HTTPS transaction may be
stored in a 'metadata' type record, in a format to be specified
elsewhere. In particular, information about the secure session in
which an HTTPS transaction occurs, such as certificates presented or
consulted and authentication information exchanged, may be stored in
one or more 'metadata' type records.</t>

<t>The multiple records which pertain to a single HTTP or HTTPS
logical group of records will all have unique record-id values. In
order to associate the records, all but one must use
'Related-Record-ID' fields to refer to another record in the set.</t>

<t>As any mixture of record types may appear for a single collection
event, and in any order, here is no specific record type which is
automatically considered primary. Generally, all may refer back to the
one record which appeared first, but this is not required. (A
'request' record may refer to a 'response' record or vice-versa;
either could refer to a 'metadata' record or a 'metadata' record could
refer to either.) Multiple and bidirectional 'Related-Record-ID'
fields may appear.</t>

<t>In the case where resources from a website have been harvested or
otherwise received without performing normal HTTP operations, or where
HTTP protocol information has been lost, it may be appropriate to
store the plain content in WARC 'resource' type records, under their
original subject-uri, but using the content MIME type in place of the
'application/http' type.</t>

   </section>

   <section title="DNS">

<t>A request for DNS information can be summarized in a URI in
accordance with <xref target="RFC4501" />. DNS information as
retrieved can be represented in
the formats specified by <xref target="RFC1035"/>, <xref
target="RFC2540" />, and <xref target="RFC4027"/>.</t>

<t>The results of a DNS lookup can thus be straightforwardly archived
in a WARC 'response' record under the appropriate DNS URI and MIME
type.  If present, the IP-Address named field should be the address of
the DNS server that provided the DNS record.
</t>

   </section>

   <section title="Other Resources with URIs, and Other Protocols">

<t>Any resource that can be identified with a URI, even if it is not
retrieved via an Internet operation, may be archived in a WARC file
under a 'resource' type record. This includes files that have
meaningful URIs retrieved from a locally-accessible filesystem or
other repository.</t>

<t>Specific conventions for other protocols and media types are
expected to be defined as necessary.  In general, the WARC format
should be capable of archiving any digital resource which has a URI, a
specific time of collection, and a discrete length.
</t>

<t>The 'request' and 'response' record types should be used for
verbatim or lossless transcripts of collection activity, including
protocol information. The 'resource' record type should be used for
content without any protocol-specific enveloping. Additional
information about a resource or transaction can be supplied in a
protocol- or media-appropriate manner with 'metadata' type
records.</t>

   </section>
  </section>

  <section title="Registration of MIME Media Type application/warc">

<t>This section describes, as per <xref target="RFC2048"/>, the MIME 
types associated with the WARC format.</t>

<t>MIME media type name: application</t>

<t>MIME subtype names: warc</t>

<t>Required parameters: None</t>

<t>Optional parameters: None</t>

<t>Encoding considerations:</t> 
<t>Content of this type is in 'binary' format.</t>

<t>Security considerations:</t>

<t>The WARC record syntax poses no direct risk to computers and
networks. Implementors need to be aware of source authority and
trustworthiness of information structured in WARC.  Readers and
writers subject themselves to all the risks that accompany normal
operation of data processing services (e.g., message length errors,
buffer overflow attacks).</t>

<t>Interoperability considerations: None</t>

<t>Published specification: TBD</t>

<t>Applications which use this media type: Large- and small-scale
archiving</t>

<t>Additional information: None</t>

<t>Person and email address to contact for further information:</t>

<t>Gordon Mohr gojomo@archive.org, John Kunze jak@ucop.edu</t>

<t>Intended usage: COMMON</t>

<t>Author/Change controller: IESG</t>

  </section>

  <section title="IANA Considerations">

<t>After IESG approval, IANA is expected to register the WARC type
"application/warc" using the application provided in this document.</t>

  </section>

  <section title="Acknowledgments">

<t>This document could not have been written without major
contributions from participants of the International Internet
Preservation Consortium, especially Steen Christensen, and Julien
Masanes.</t>

  </section>

  <appendix anchor="recordids" 
    title="Considerations in Choice of record-id">

<t>The WARC format differs significantly from the ARC format in
requiring the record-id parameter. The record-id must be globally
unique for its period of intended use. If that period is indefinite,
the record-id should be maintained to a level appropriate for any
persistent identifier, in which case identifier opaqueness is usually
desirable.</t>

<t>There is no reason why the archiving institution may not choose
record-ids that are also "actionable" (submittable as retrieval
requests to widely available tools such as web browsers) as long as
there are providers to service them. This specification does not
dictate what identifier scheme to use; suitable schemes include <xref target="RFC2141">URN</xref>, <xref target="ARK"/>, 
<xref target="GUID"/>, etc.</t>

<t>Also worth considering is the establishment of lexical conventions
for record-ids that reveal or suggest relationships among content
blocks. Although the 'Related-Record-ID' parameter required of
'metadata', 'revisit', and 'conversion' records is sufficient to
convey relatedness in the context of a single WARC file, great
optimization can be had when relatedness can be inferred by third
parties through identifier comparison rather than by lookup in a
database or examination of the relevant WARC files.</t>

<t>These conventions are suggested by <xref target="RFC2396"/>,
formalized by the <xref target="ARK"/> scheme, and are applicable to
such things as the summarizing of large search results from
Internet-wide indexing engines. As an example of a convention that
could be adopted by users of any identifier scheme, the "/" character
could be reserved as a separator used to introduce an extension string
that is appended to a primary record-id. If the record-id of a primary
block of captured content were,</t>

<figure><artwork>
http://abc.org/12026/987654321
</artwork></figure>

<t>The convention could also reserve the extension strings "_s", "_d",
and "_t" to indicate record- ids for secondary, duplicate, and
transform blocks, respectively. Over time this might result in he
assignment of record-ids such as,</t>

<figure><artwork>
http://abc.org/12026/987654321/_s1
http://abc.org/12026/987654321/_s2
http://abc.org/12026/987654321/_d9
http://abc.org/12026/987654321/_d10
http://abc.org/12026/987654321/_t
</artwork></figure>

<t>...in which an integer count may further extend the identifier 
when more there is more than one relationship of the given type.</t>

  </appendix>


  <appendix title="Compression Recommendations">

<t>The WARC format defines no internal compression. Whether and how
WARC files should be compressed is an external decision.</t>

<t>However, experience with the precursor ARC format at the Internet
Archive has demonstrated that applying simple standard compression can
result in significant storage savings, while preserving random access
to individual records.</t>

<t>For this purpose, the GZIP format with customary "deflate"
compression is recommended, as defined in <xref target="RFC1950"/>, 
<xref target="RFC1951"/>, and <xref target="RFC1952"/>.
 Freely available source code implementing this format is
available, and the technique is free of patent encumberances. The GZIP
format is also widely used and supported across many free and
commercial software packages and operating systems.</t>

<t>This section documents recommended, but optional, practices for
compressing WARC files with GZIP.</t>

   <appendix title="Record-at-a-time Compression">

<t>Per section 2.2 of the GZIP specification, a valid GZIP file
consists of any number of gzip "members", each independently
compressed.</t>

<t>Where possible, this property should be exploited to compress each
record of a WARC file independently. This results in a valid GZIP file
whose per-record subranges also stand alone as valid GZIP files.</t>

<t>External indexes of WARC file content may then be used to record
each record's starting position in the GZIP file, allowing for random
access of individual records without requiring decompression of all
preceding records.</t>

<t>Note that the application of this convention causes no change to
the uncompressed contents of an individual WARC record. In particular,
the declared record length remains the length of the uncompressed
record.</t>

   </appendix>

   <appendix title="GZIP WARC File Name Suffix">
   <t>A gzip compressed WARC file should have the customary ".gz"
   appended to it, making the complete suffix, ".warc.gz".
   </t>

   </appendix>
 </appendix>

  <appendix title="WARC File Name and Size Recommendations">

<t>It is helpful to use practices within an institution that make it
unlikely or impossible to duplicate aggregate WARC file names. The
convention used inside the Internet Archive with ARC files is to name
files according to the following pattern:</t>

<t>Prefix-Timestamp-Serial-Crawlhost.warc.gz</t>

<t>Prefix is an abbreviation usually reflective of the project or
crawl that created this file.  Timestamp is a 14-digit GMT timestamp
indicating the time the file was initially begun. Serial is an
increasing serial-number within the process creating the files, often
(but not necessarily) unique with regard to the Prefix. Crawlhost is
the domain name or IP address of the machine creating the file.</t>

<t>IIPC member institutions have expressed an interest in adopting a
common naming strategy, with per-institution unique identifiers to
assist in marking WARC files with their institution of origin. It is
proposed that all such WARC file names adhering to this future
convention begin "iipc".</t>

<t>This specification does not require any particular WARC file naming
practice, but recommends conventions similar to the above be adopted
within WARC-creating institutions.  The file name prefix "iipc" should
be avoided unless participating in the IIPC naming registry.</t>

<t>1GB (10^9 bytes) is recommended as a practical target size for
WARC files, when record sizes allow. Oversized records may be
truncated, segmented, or simply placed in oversized WARC files, at a
project's discretion.</t>

  </appendix>

  <appendix title="Collected ABNF for WARC">
   <!--
       TODO: The dot after in ANVL zero?
   -->

<figure>
 <artwork>
  warc-file     = 1*warc-record
  warc-record   = header block CRLF CRLF
  header        = header-line CRLF *anvl-field CRLF
  block         = *OCTET

  header-line   = warc-id tsp data-length tsp record-type tsp
                    subject-uri tsp creation-date tsp
                    record-id tsp content-type
  tsp           = 1*WSP

  warc-id       = "WARC/" 1*DIGIT "." 1*DIGIT
  data-length   = 1*DIGIT
  record-type   = "warcinfo" / "response" / "request" / "metadata" /
                    "revisit" / "conversion" / "continuation" /
                    future-type
  future-type   = 1*VCHAR
  subject-uri   = uri
  uri           = &lt;'URI' per RFC3986>
  creation-date = timestamp
  timestamp     = 14*14DIGIT        ; GMT formatted as YYYYMMDDhhmmss
  record-id     = uri
  content-type  = type "/" subtype *(";" parameter)
  type          = &lt;'type' per Section 5.1 of RFC2045>
  subtype       = &lt;'subtype' per Section 5.1 of RFC2045>
  parameter     = &lt;'parameter' per Section 5.1 of RFC2045>

  anvl-field    = field-name ":" [ field-body ] CRLF
  field-name    = 1*&lt;any CHAR, excluding control-chars and ":">
  field-body    = text [CRLF 1*WSP field-body]
  text          = 1*&lt;any UTF-8 character, including bare
                    CR and bare LF, but NOT including CRLF>
                    ; (Octal, Decimal.)
  CHAR          = &lt;any ASCII/UTF-8 character> ; (0-177,  0.-127.)
  CR            = &lt;ASCII CR, carriage return> ; (   15,      13.)
  LF            = &lt;ASCII LF, linefeed>        ; (   12,      10.)
  SPACE         = &lt;ASCII SP, space>           ; (   40,      32.)
  HTAB          = &lt;ASCII HT, horizontal-tab>  ; (   11,       9.)
  CRLF          = CR LF
  WSP           = SPACE / HTAB                   ; semantics = SPACE
 </artwork>
</figure>
<t>All-caps "core" elements in the above are as defined in 
<xref target="RFC2234" />.
</t>
  </appendix>

  <appendix title="Examples of WARC Records">

<t>Examples of each of record-type are provided here. In some cases,
illustrative data is shown where conventions have not yet been
specified. Each record header-line is split over multiple lines for
readability; continuations of the single line are indented, and a
newline should only be considered to appear at the end of the last
indented line. Declared record lengths are approximate, and unique IDs
and checksums shown are plausible random filler.</t>

   <appendix title="Example of 'warcinfo' Record">

<t>The following 'warcinfo' example includes an XML description of the
enclosing WARC file that is loosely modelled after the descriptions
currently used in Internet Archive ARC files.  However, this is an
abbreviated and speculative illustration; the referenced
WARC-specific namespace .http://archive.org/warc/0.10. has not been
formally defined anywhere, and may not reflect eventual practice with
WARC files.</t>

<figure>
 <artwork><![CDATA[
WARC/0.10          223 warcinfo
    urn:uuid:d7ae5c10-e6b3-4d27-967d-34780c58ba39
    20060919172014 urn:uuid:d7ae5c10-e6b3-4d27-967d-34780c58ba39
    text/xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<warcmetadata
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:dcterms="http://purl.org/dc/terms/"
    xmlns:warc="http://archive.org/warc/0.10/">
  <warc:software>
  Heritrix 1.4.0 http://crawler.archive.org
  </warc:software>
  <warc:hostname>crawling017.archive.org</warc:hostname>
  <warc:ip>207.241.227.234</warc:ip>
  <dcterms:isPartOf>testcrawl-20050708</dcterms:isPartOf>
  <dc:description>testcrawl with WARC output</dc:description>
  <warc:operator>IA_Admin</warc:operator>
  <warc:http-header-user-agent>
  Mozilla/5.0 (compatible; heritrix/1.4.0 +http://crawler.archive.org)
  </warc:http-header-user-agent>
  <dc:format>WARC file version 0.10</dc:format>
  <dcterms:conformsTo xsi:type="dcterms:URI">
  http://www.archive.org/documents/WarcFileFormat-0.10.html
  </dcterms:conformsTo>
</warcmetadata>
]]></artwork>
</figure>

<t>The first line (spread over three lines for readability) shows the
required line of positional parameters.  The
<spanx style="emph">data-length</spanx> is space-padded and the
<spanx style="emph">subject-uri</spanx> for this warcinfo
record is the same as its <spanx style="emph">record-id</spanx>.
This record has no named parameters, as evidenced by the single blank
line following he header-line. The content block is 'text/xml', as
declared in the header-line.  Two newlines follow the content
block.</t>

   </appendix>

   <appendix title="Example of 'request' Record">

<t>A 'request' record captures the protocol request used to collect a
resource. For example, to collect the resource
'http://www.archive.org/images/logoc.jpg', the following 'request'
record might be generated:</t>

<figure>
 <artwork><![CDATA[
WARC/0.10          501 request http://www.archive.org/images/logoc.jpg
    20060919172024 urn:uuid:4885803b-eebd-4b27-a090-144450c11594
    application/http;msgtype=request
Related-Record-ID: urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0

GET /images/logoc.jpg HTTP/1.0^M
User-Agent: Mozilla/5.0 (compatible; heritrix/1.10.0)
From: stack@example.org
Connection: close
Referer: http://www.archive.org/
Host: www.archive.org
Cookie: PHPSESSID=009d7bb11022f80605aa87e18224d824
]]></artwork>
</figure>

<t>
The <spanx style="emph">data-length</spanx> is space-padded.
The Related-Record-ID named field points to the related response
record (See <xref target="example_response" format="title" />).</t>

   </appendix>
   
   <appendix anchor="example_response" 
   title="Example of 'response' Record">

<t>The response record to match the above example request might look
like the following:</t>

<figure>
 <artwork><![CDATA[
WARC/0.10         2210 response http://www.archive.org/images/logoc.jpg
    20060919172024 urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0
    application/http;msgtype=response
Checksum: sha1:UZY6ND6CCHXETFVJD2MSS7ZENMWF7KQ2
IP-Address: 207.241.233.58

HTTP/1.1 200 OK
Date: Tue, 19 Sep 2006 17:18:40 GMT
Server: Apache/2.0.54 (Ubuntu)
Last-Modified: Mon, 16 Jun 2003 22:28:51 GMT
ETag: "3e45-67e-2ed02ec0"
Accept-Ranges: bytes
Content-Length: 1662
Connection: close
Content-Type: image/jpeg

[image/jpeg binary data here]
]]></artwork>
</figure>

<t>Note that the creation-date for the 'response' is identical to that
of the previous 'request' record.  The IP-Address named field is that
of the server that generated the response.</t>

   </appendix>

   <appendix title="Example of 'resource' Record">

<t>This same file, 'logo.jpg', might be archived internally to an
organization under its local filesystem name. This could result in a
'resource' record:</t>

<figure>
 <artwork><![CDATA[
WARC/0.10 2210 resource file://var/www/htdoc/images/logoc.jpg
    20060919172024 urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0
    image/jpeg
Checksum: sha1:UZY6ND6CCHXETFVJD2MSS7ZENMWF7KQ2

[image/jpeg binary data here]
]]></artwork>
</figure>

   </appendix>

   <appendix title="Example of 'metadata' Record">

<t>If some crawl-time metadata should be archived near the above
<xref target="example_response" format="title" />, a 'metadata'
record could be used like the following (using ANVL format):</t>

<figure>
 <artwork><![CDATA[
WARC/0.10 261 metadata http://www.archive.org/images/logoc.jpg
    20060919172024 urn:uuid:16da6da0-bcdc-49c3-927e-57494593b943
    text/anvl
Related-Record-ID: urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0

via: http://www.archive.org/
pathFromSeed: E
downloadTimeMS: 565
]]></artwork>
</figure>

<t>Note again the same creation-date as the preceding related
records. A relationship is declared to the preceding 'response' record
but declaring a relationship to the 'request' would also be legal.</t>

   </appendix>

   <appendix title="Example of 'revisit' Record">

<t>If the same URI is later revisited and the content is unchanged, a
'revisit' record like the following (again with a speculative
content-type) could be generated:</t>

<figure>
 <artwork><![CDATA[
WARC/0.10 395 revisit http://www.archive.org/images/logoc.jpg
    20060919190040 urn:uuid:16da6da0-bcdc-49c3-927e-57494593bbbb
    text/xml 
Related-Record-ID: urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0

<?xml version="1.0"?>
<revisit xmlns="http://archive.org/revisit/0.10/">
  <server-response-excerpt>
  HTTP/1.x 304 Not Modified
  Date: Mon, 08 Aug 2005 01:01:01 GMT
  Etag: "914480-1b2e-42ab8245"
  </server-response-excerpt>
</revisit>
]]></artwork>
</figure>

<t>Again, reference is made back to the original 'response' record. A
new creation-date reflects the time of revisit. This content block
hypothesizes including header excerpts from a server response to
explain the results of the revisit. (In this case, the remote server
indicated the resource was unchanged from the previous 'Etag' value.)
The actual formats for describing the result of a revisit remain to be
defined.</t>

   </appendix>
 
   <appendix title="Example of 'conversion' Record">

<t>At some future date, the 'image/jpeg' format may no longer be
considered viable, prompting a conversion of the original archive
content into a hypothetical new format, 'image/neoimg', which
generates a 3098 byte version of the same image. This could be
accomodated with a 'conversion' record:</t>

<figure>
 <artwork><![CDATA[
WARC/0.10 4111 conversion http://www.archive.org/images/logoc.jpg
    20160919190040 urn:uuid:16da6da0-bcdc-49c3-927e-57494593dddd
    image/neoimg
Related-Record-ID: urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0
Checksum: sha1:XQMRY75YY42ZWC6JAT6KNXKD37F7MOEK

[image/neoimg binary data here]
]]></artwork>
</figure>

<t>An accompanying 'metadata' record, referring to this 'conversion'
record, could contain additional details about the
transformation. (Alternatively, new named-fields in this record could
serve this role.)</t>

   </appendix>

   <appendix title="Example of 'continuation' Record">

<t>If the 'response' above had been so large that it would not fit
into a single WARC file of desired maximum size, it would have to be
segmented into separate smaller records. The first record would be as
before, except with one additional named field, 'Segment-Number', with
a value of '1', indicating that the record was the beginning of a
segmented record set.</t>

<t>The subsequent segment for that record would then look like
this:</t>

<figure>
 <artwork><![CDATA[
WARC/0.10 39514322 continuation http://www.archive.org/images/logoc.jpg
    20160919172024 urn:uuid:16da6da0-bcdc-49c3-927e-57494593eeee
    application/http;msgtype=response
Segment-Origin-ID: urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0
Segment-Number: 2

[Segment 2 application/http binary data here]
]]></artwork>
</figure>

<t>Note that the 'Segment-Origin-ID' refers to the first segment of
the set, the one with the 'Segment-Number: 1' named field.</t>

   </appendix>
  </appendix>
 </middle>

 <back>

   <references>
    
    <reference anchor="ANVL"
      target="http://www.cdlib.org/inside/diglib/ark/anvlspec.pdf">
     <front>
      <title>A Name-Value Language</title>
      <author initials="J." surname="Kunze" fullname="John A. Kunze" />
      <author initials="B." surname="Kahle" fullname="Brewster Kahle" />
      <author initials="J." surname="Masanes" fullname="Julien Masanes" />
      <author initials="G." surname="Mohr" fullname="Gordon Mohr" />
     </front>
     <format type="PDF" 
       target="http://www.cdlib.org/inside/diglib/ark/anvlspec.pdf" />
    </reference>

    <reference anchor="ARC"
      target="http://www.archive.org/web/researcher/ArcFileFormat.php">
     <front>
      <title>The ARC File Format</title>
      <author initials="M." surname="Burner" fullname="Mike Burner" />
      <author initials="B." surname="Kahle" fullname="Brewster Kahle" />
      <date month="September" year="1996" />
     </front>
     <format type="HTML"
       target="http://www.archive.org/web/researcher/ArcFileFormat.php" />
    </reference>

    <reference anchor="ARK"
      target="http://www.cdlib.org/inside/diglib/ark/arkspec.pdf">
     <front>
      <title>The ARK Persistent Identifier Scheme</title>
      <author initials="J." surname="Kunze" fullname="John A. Kunze" />
      <author initials="R." surname="Rodgers" fullname="R. P. C. Rodgers" />
      <date month="August" year="2005" />
     </front>
     <format type="PDF"
       target="http://www.cdlib.org/inside/diglib/ark/arkspec.pdf" />
    </reference>

    <reference anchor="GUID"
      target="http://en.wikipedia.org/wiki/GUID">
     <front>
      <title>Wikipedia: Globally Unique Identifiers</title>
     </front>
     <format type="HTML" 
       target="http://en.wikipedia.org/wiki/GUID" />
    </reference>

    <reference anchor="HERITRIX"
      target="http://crawler.archive.org"> 
     <front>
      <title>Heritrix Open Source Archival Web Crawler</title>
     </front>
     <format type="HTML"
       target="http://crawler.archive.org" />
    </reference>

    <reference anchor="IIPC"
      target="http://www.netpreserve.org/">
     <front>
      <title>International Internet Preservation Consortium (IIPC)</title>
     </front>
     <format type="HTML"
       target="http://www.netpreserve.org/" />
    </reference>

    <reference anchor="RDF"
      target="http://www.w3.org/RDF/">
     <front>
       <title>Resource Description Framework (RDF)</title>
     </front>
     <format type="HTML"
       target="http://www.w3.org/RDF/" />
    </reference>
    
    &rfc0822; <!-- mail "format of ARPA internet text messages -->
  
    <!-- &rfc1034;  DNS; currently unreferenced -->

    &rfc1035; <!-- DNS -->

    &rfc1884; <!-- IPv6 format -->

    &rfc1950; <!-- ZLIB -->

    &rfc1951; <!-- DEFLATE -->

    &rfc1952; <!-- GZIP -->

    &rfc2045; <!-- MIME -->
  
    &rfc2048; <!-- MIME: registration -->

    &rfc2141; <!-- URN -->

    &rfc2234; <!-- ABNF -->

    &rfc2396; <!-- URI -->

    &rfc2540; <!-- detached DNS -->

    &rfc2616; <!-- HTTP/1.1 -->
  
    &rfc4027; <!-- DNS media types -->

    &rfc4501; <!-- DNS URI -->

  </references>

 </back>

</rfc>
