TOC 
IIPC Framework Working GroupJ. Kunze, Ed.
 California Digital Library
 A. Arvidson
 Kungliga biblioteket (National
 Library of Sweden)
 G. Mohr
 M. Stack
 Internet Archive
 March 2007


The WARC File Format (Version 0.16)

Abstract

The WARC (Web ARChive) format specifies a method for combining multiple digital resources into an aggregate archival file together with related information. Resources are dated, identified by URIs, and preceded by simple text headers. By convention, files of this format are named with the extension ".warc" and have the MIME type application/warc. The WARC file format is a revision and generalization of the ARC format used by the Internet Archive to store information blocks harvested by web crawlers. This document specifies version 0.16 of the WARC format.



Table of Contents

1.  Introduction
2.  Goals
3.  File and Record Model
4.  Named Fields
    4.1.  WARC-Record-ID (REQUIRED)
    4.2.  Content-Length (REQUIRED)
    4.3.  WARC-Date (REQUIRED)
    4.4.  WARC-Type (REQUIRED)
    4.5.  Content-Type
    4.6.  WARC-Concurrent-To
    4.7.  WARC-Block-Digest
    4.8.  WARC-Payload-Digest
    4.9.  WARC-IP-Address
    4.10.  WARC-Refers-To
    4.11.  WARC-Target-URI
    4.12.  WARC-Truncated
    4.13.  WARC-Warcinfo-ID
    4.14.  WARC-Filename
    4.15.  WARC-Profile
    4.16.  WARC-Identified-Payload-Type
    4.17.  WARC-Segment-Number
    4.18.  WARC-Segment-Origin-ID
    4.19.  WARC-Segment-Total-Length
5.  WARC Record Types
    5.1.  'warcinfo'
    5.2.  'response'
        5.2.1.  for 'http' and 'https' schemes
        5.2.2.  for other URI schemes
    5.3.  'resource'
        5.3.1.  for 'http' and 'https' schemes
        5.3.2.  for 'ftp' scheme
        5.3.3.  for 'dns' scheme
        5.3.4.  for other URI schemes
    5.4.  'request'
        5.4.1.  for 'http' and 'https' schemes
        5.4.2.  for other URI schemes
    5.5.  'metadata'
    5.6.  'revisit'
        5.6.1.  Profile: Identical Payload Digest
        5.6.2.  Profile: Server Not Modified
        5.6.3.  Other profiles
    5.7.  'conversion'
    5.8.  'continuation'
6.  Record Segmentation
7.  Registration of MIME Media Types application/warc and application/warc-fields
    7.1.  application/warc
    7.2.  application/warc-fields
8.  IANA Considerations
9.  Acknowledgments
Appendix A.  Compression Recommendations
Appendix A.1.  Record-at-a-time Compression
Appendix A.2.  GZIP WARC File Name Suffix
Appendix B.  WARC File Size and Name Recommendations
Appendix C.  Examples of WARC Records
Appendix C.1.  Example of 'warcinfo' Record
Appendix C.2.  Example of 'request' Record
Appendix C.3.  Example of 'response' Record
Appendix C.4.  Example of 'resource' Record
Appendix C.5.  Example of 'metadata' Record
Appendix C.6.  Example of 'revisit' Record
Appendix C.7.  Example of 'conversion' Record
Appendix C.8.  Example of Segmentation ('continuation' record)
10.  References
§  Authors' Addresses




 TOC 

1.  Introduction

Web sites and web pages emerge and disappear from the world wide web every day. For the past ten years, memory organizations have tried to find the most appropriate ways to collect and keep track of this vast quantity of important material using web-scale tools such as web crawlers. A web crawler is a program that browses the web in an automated manner according to a set of policies; starting with a list of URLs, it saves each page identified by a URL, finds all the hyperlinks in the page (e. g. links to other pages, images, videos, scripting or style instructions, etc.), and adds them to the list of URLs to visit recursively. Storing and managing the billions of saved web page objects itself presents a challenge.

At the same time, those same organizations have a rising need to archive large numbers of digital files not necessarily captured from the web (e.g., entire series of electronic journals, or data generated by environmental sensing equipment). A general requirement that appears to be emerging is for a container format that permits one file simply and safely to carry a very large number of constituent data objects for the purpose of storage, management, and exchange. Those data objects (or resources) must be of unrestricted type (including many binary types for audio, CAD, compressed files, etc.), but fortunately the container needs only minimal knowledge of the nature of the objects.

The WARC (Web ARChive) file format offers a convention for concatenating multiple resource records (data objects), each consisting of a set of simple text headers and an arbitrary data block into one long file. The WARC format is an extension of the ARC File Format (Burner, M. and B. Kahle, “The ARC File Format,” September 1996.) [ARC] that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web. Each capture in an ARC file is preceded by a one-line header that very briefly describes the harvested content and its length. This is directly followed by the retrieval protocol response messages and content. The original ARC format file is used by the Internet Archive (IA) since 1996 for managing billions of objects, and by several national libraries.

The motivation to extend the ARC format arose from the discussion and experiences of the International Internet Preservation Consortium (IIPC) (, “International Internet Preservation Consortium (IIPC),” .) [IIPC], whose members include the national libraries of Australia, Canada, Denmark, Finland, France, Iceland, Italy, Norway, Sweden, The British Library (UK), The Library of Congress (USA), and the Internet Archive (IA). The California Digital Library and the Los Alamos National Laboratory also provided input on extending and generalizing the format.

The WARC format is expected to be a standard way to structure, manage and store billions of resources collected from the web and elsewhere. It will be used to build applications for harvesting (such as the opensource Heritrix (, “Heritrix Open Source Archival Web Crawler,” .) [HERITRIX] web crawler), managing, accessing, and exchanging content.

Besides the primary content recorded in ARCs, the extended WARC format accommodates related secondary content, such as assigned metadata, abbreviated duplicate detection events, later-date transformations, and segmentation of large resources. The extension may also be useful for more general applications than web archiving. To aid the development of tools that are backwards compatible, WARC content is clearly distinguishable from pre-revision ARC content.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119] (Bradner, S., “Key words for use in RFCs to Indicate Requirement Levels,” March 1997.).



 TOC 

2.  Goals

Goals of the WARC file format include the following.

The WARC file format is made sufficiently different from the legacy ARC format files so that software tools can unambiguously detect and correctly process both WARC and ARC records; given the large amount of existing archival data in the previous ARC format, it is important that access and use of this legacy not be interrupted when transitioning to the WARC format.



 TOC 

3.  File and Record Model

A WARC format file is the simple concatenation of one or more WARC records. The first record usually describes the records to follow. In general, record content is either the direct result of a retrieval attempt — web pages, inline images, URL redirection information, DNS hostname lookup results, standalone files, etc. — or is synthesized material (e.g., metadata, transformed content) that provides additional information about archived content.

A WARC record consists of a record header followed by a record content block and two newlines. The WARC record header consists of one first line declaring the record to be in the WARC format with a given version number, then a variable number of line-oriented named fields terminated by a blank line. With one major exception, allowing UTF-8, the WARC record header format largely follows the tradition of HTTP/1.1 (Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., and T. Berners-Lee, “Hypertext Transfer Protocol -- HTTP/1.1,” June 1999.) [RFC2616] and [RFC2822] (Resnick, P., “Internet Message Format,” April 2001.) headers.

The top-level view of a WARC file can be expressed in an augmented Backus-Naur Form (BNF) grammar, reusing the augmented constructs defined in section 2.1 of HTTP/1.1 (Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., and T. Berners-Lee, “Hypertext Transfer Protocol -- HTTP/1.1,” June 1999.) [RFC2616]. (In particular, note that to avoid the risk of confusion, where any WARC rule has the same name as an RFC2616 rule, the definition here has been made the same, EXCEPT in the case of the CHAR rule, which in WARC includes multibyte UTF-8 characters.)

  warc-file    = 1*warc-record
  warc-record  = header CRLF
                 block CRLF CRLF
  header       = version CRLF
                 warc-fields
  version      = "WARC/0.16" CRLF
  warc-fields  = *named-field CRLF
  block        = *OCTET

The record version-line appears first in every record and hence also begins the WARC file itself.

The WARC record relies heavily on named fields. Each named field consists of a name followed by a colon (":") and the field value. Field names are case-insensitive. The field value MAY be preceded by any amount of linear whitespace (LWS), though a single space is preferred. Header fields can be extended over multiple lines by preceding each extra line with at least one space or tab character.

Named fields may appear in any order and field values may contain any UTF-8 character. Both defined-fields and extension-fields follow the generic named-field format. Extension fields may be used in extensions of the core format.

  named-field     = field-name ":" [ field-body ]
  field-name      = token
  field-value     = *( field-content | LWS )     ; further qualified
                                                 ; by field definitions
  field-content   = <the OCTETs making up the field-value
                    and consisting of either *TEXT or combinations
                    of token, separators, and quoted-string>
  token           = 1*<any CHAR except CTLs or separators>
  separators      = "(" | ")" | "<" | ">" | "@"
                      | "," | ";" | ":" | "\" | <">
                      | "/" | "[" | "]" | "?" | "="
                      | "{" | "}" | SP | HT
  TEXT            = <any OCTET except CTLs,
                    but including LWS>
  CHAR            = <UTF-8 characters; RFC3629>  ; (0-191, 194-244)
  CR              = <ASCII CR, carriage return>  ; (13)
  LF              = <ASCII LF, linefeed>         ; (10)
  SP              = <ASCII SP, space>            ; (32)
  HT              = <ASCII HT, horizontal-tab>   ; (9)
  CRLF            = CR LF
  LWS             = [CRLF] 1*( SP | HT )            ; semantics same as
                                                    ; single SP
  quoted-string   = ( <"> *(qdtext | quoted-pair ) <"> )
  qdtext          = <any TEXT except <">>
  quoted-pair     = "\" CHAR                ; single-character quoting
  uri             = "<" <'URI' per RFC3986> ">"

Although UTF-8 characters are allowed, the 'encoded-word' mechanism of [RFC2047] (Moore, K., “MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text,” November 1996.) MAY also be used when writing WARC fields and MUST also be understood by WARC reading software.

The rest of the WARC record grammar concerns defined-field parameters such as record identifier, record type, creation time, content length, and content type.

  defined-field  = WARC-Type
                 | WARC-Record-ID
                 | WARC-Date
                 | Content-Length
                 | Content-Type
                 | WARC-Concurrent-To
                 | WARC-Block-Digest
                 | WARC-Payload-Digest
                 | WARC-IP-Address
                 | WARC-Refers-To
                 | WARC-Target-URI
                 | WARC-Truncated
                 | WARC-Warcinfo-ID
                 | WARC-Filename                ; warcinfo only
                 | WARC-Profile                 ; revisit only
                 | WARC-Identified-Payload-Type
                 | WARC-Segment-Origin-ID       ; continuation only
                 | WARC-Segment-Number
                 | WARC-Segment-Total-Length    ; continuation only

Every WARC record has a type, reported in the WARC-Type field. There are eight WARC record types: 'warcinfo', 'response', 'resource', 'request', 'metadata', 'revisit', 'conversion', and 'continuation'. The relevant fields for each record type are further described at WARC Record Types (WARC Record Types). Each field's meaning and legal value format are described at Named Fields (Named Fields).

The record block contains OCTET content interpreted based on the record type and other header values. All records MUST include a Content-Length field to specify the length of the block.

Some record types also define a payload, such as a meaningful subset of the block or content from a predecessor record. Some headers pertain to the payload of a record rather than the block directly.

Content matching the warc-file rule has the MIME content-type "application/warc", registered below at Section 7.1 (application/warc).

Content matching only the warc-fields rule is useful as a simple descriptive format, and has MIME content-type "application/warc-fields", registered below at Section 7.2 (application/warc-fields).



 TOC 

4.  Named Fields

Named fields within a WARC record provide information about the current record, and allow additional per-record information. WARC both reuses appropriate headers from other standards and defines new headers, all beginning "WARC-", for WARC-specific purposes.

Because new fields may be defined in extensions to the core WARC format, WARC processing software MUST ignore fields with unrecognized names.



 TOC 

4.1.  WARC-Record-ID (REQUIRED)

An identifier assigned to the current record that is globally unique for its period of intended use. No identifier scheme is mandated by this specification, but each record-id MUST be a legal URI and clearly indicate a documented and registered scheme to which it conforms (e.g., via a URI scheme prefix such as "http:" or "urn:"). Care should be taken to ensure that this value is written with no internal whitespace.

  WARC-Record-ID   = "WARC-Record-ID" ":" uri

All records MUST have a WARC-Record-ID field.



 TOC 

4.2.  Content-Length (REQUIRED)

The number of octets in the block, similar to [RFC2616] (Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., and T. Berners-Lee, “Hypertext Transfer Protocol -- HTTP/1.1,” June 1999.). If no block is present, a value of '0' (zero) MUST be used.

  Content-Length   = "Content-Length" ":" 1*DIGIT

All records MUST have a Content-Length field.



 TOC 

4.3.  WARC-Date (REQUIRED)

A UTC timestamp formatted according to the W3C profile of ISO8601, YYYY-MM-DDThh:mm:ssZ, described at [W3CDTF] (, “Date and Time Formats (W3C profile of ISO8601),” .). The timestamp MUST represent the instant that data collection for record creation began. Multiple records written as part of a single collection action MUST use the same WARC-Date, even though the times of their writing will not be exactly synchronized.

  WARC-Date   = "WARC-Date" ":" w3c-iso8601
  w3c-iso8601 = <YYYY-MM-DDThh:mm:ssZ>

All records MUST have a WARC-Date field.



 TOC 

4.4.  WARC-Type (REQUIRED)

The type of WARC record: one of 'warcinfo', 'response', 'resource', 'request', 'metadata', 'revisit', 'conversion', or 'continuation'. Types are further described at WARC Record Types (WARC Record Types).

A WARC file need not contain any particular record types, though starting all WARC files with a "warcinfo" record is RECOMMENDED.

  WARC-Type   = "WARC-Type" ":" record-type
  record-type = "warcinfo" | "response" | "resource"
              | "request" | "metadata" | "revisit"
              | "conversion" | "contination" |  future-type
  future-type = token

All records MUST have a WARC-Type field.

WARC processing software MUST ignore records of unrecognized type.



 TOC 

4.5.  Content-Type

The MIME type [RFC2045] (Freed, N. and N. Borenstein, “Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies,” November 1996.) of the information contained in the record's block. For example, in HTTP request and response records, this would be 'application/http' as per Section 19.1 of [RFC2616] (Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., and T. Berners-Lee, “Hypertext Transfer Protocol -- HTTP/1.1,” June 1999.) (or 'application/http; msgtype=request' and 'application/http; msgtype=response' respectively). In particular, the content-type is not the value of the HTTP Content-Type header in an HTTP response but a MIME type to describe the full archived HTTP message (hence 'application/http' if the block contains request or response headers).

  Content-Type   = "Content-Type" ":" media-type
  media-type     = type "/" subtype *( ";" parameter )
  type           = token
  subtype        = token
  parameter      = attribute "=" value
  attribute      = token
  value          = token | quoted-string

All records with a non-empty block (non-zero Content-Length), except 'continuation' records, SHOULD have a Content-Type field. Only if the media type is not given by a Content-Type field, a reader MAY attempt to guess the media type via inspection of its content and/or the name extension(s) of the URI used to identify the resource. If the media type remains unknown, the reader SHOULD treat it as type "application/octet-stream".



 TOC 

4.6.  WARC-Concurrent-To

The WARC-Record-IDs of any records created as part of the same operation as the current record.

  WARC-Concurrent-To = "WARC-Concurrent-To" ":" 1#uri

This field MAY be used to associate records of types 'request', 'response', 'resource', 'metadata', and 'revisit' with one another when they arise from a single collection action. (When so used, any WARC-Concurrent-To association MUST be considered bidirectional even if the header only appears on one record.)



 TOC 

4.7.  WARC-Block-Digest

An optional parameter indicating the algorithm name and calculated value of a digest applied to the full block of the record.

  WARC-Block-Digest = "WARC-Block-Digest" ":" labelled-digest
  labelled-digest  = algorithm ":" digest-value
  algorithm        = token
  digest-value     = token

An example is a SHA-1 labeled Base32 ([RFC3548] (Josefsson, S., “The Base16, Base32, and Base64 Data Encodings,” July 2003.)) value:

WARC-Block-Digest: sha1:AB2CD3EF4GH5IJ6KL7MN8OPQ

This document recommends no particular algorithm.

Any record MAY have a WARC-Block-Digest field.



 TOC 

4.8.  WARC-Payload-Digest

An optional parameter indicating the algorithm name and calculated value of a digest applied to the payload referred to or contained by the record -- which is not necessarily equivalent to the record block.

  WARC-Payload-Digest = "WARC-Payload-Digest" ":" labelled-digest

An example is a SHA-1 labeled Base32 ([RFC3548] (Josefsson, S., “The Base16, Base32, and Base64 Data Encodings,” July 2003.)) value:

WARC-Payload-Digest: sha1:3EF4GH5IJ6KL7MN8OPQAB2CD

This document recommends no particular algorithm.

The payload of an application/http block is its 'entity-body' (per [RFC2616] (Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., and T. Berners-Lee, “Hypertext Transfer Protocol -- HTTP/1.1,” June 1999.)). In contrast to WARC-Block-Digest, the WARC-Payload-Digest field MAY also be used for data not actually present in the current record block, for example when a block is left off in accordance with a 'revisit' profile (see 'revisit' ('revisit')).

The WARC-Payload-Digest field MAY be used on WARC records with a well-defined payload and MUST NOT be used on records without a well-defined payload.



 TOC 

4.9.  WARC-IP-Address

The numeric Internet address contacted to retrieve any included content. An IPv4 address MUST be written as a "dotted quad"; an IPv6 address MUST be written as per [RFC1884] (Hinden, R. and S. Deering, “IP Version 6 Addressing Architecture,” December 1995.). For an HTTP retrieval, this will be the IP address used at retrieval time corresponding to the hostname in the record's subject-uri.

  WARC-IP-Address   = "WARC-IP-Address" ":" (ipv4 | ipv6)
  ipv4              = <"dotted quad">
  ipv6              = <per section 2.2 of RFC1884>

The WARC-IP-Address field MAY be used on 'response', 'resource', 'request', 'metadata', and 'revisit' records, but MUST NOT be used on 'conversion' or 'continuation' records.



 TOC 

4.10.  WARC-Refers-To

The WARC-Record-ID of a single record for which the present record holds additional content.

  WARC-Refers-To     = "WARC-Refers-To" ":" uri

The WARC-Refers-To field MAY be used to associate a 'metadata' record to another record it describes. The WARC-Refers-To field MAY also be used to associate a record of type 'revisit' or 'conversion' with the preceding record which helped determine the present record content. The WARC-Refers-To field is undefined on other record types.



 TOC 

4.11.  WARC-Target-URI

The original URI whose collection gave rise to the information content in this record. In the context of web harvesting, this is the URI that was the target of a crawler's retrieval request. Indirectly, such as for a 'revisit', 'metadata', or 'conversion' record, it is a copy of the WARC-Target-URI appearing in the original record to which the newer record pertains. The URI in this value MUST be properly escaped according to [RFC3986] (Berners-Lee, T., Fielding, R., and L. Masinter, “Uniform Resource Identifier (URI): Generic Syntax,” January 2005.) and written with no internal whitespace.

  WARC-Target-URI    = "WARC-Target-URI" ":" uri

All 'response', 'resource', 'request', 'revisit', and 'conversion' records MUST have a WARC-Target-URI field. A 'metadata' record MAY have a WARC-Target-URI field. A 'warcinfo' record MUST NOT have a WARC-Target-URI field.



 TOC 

4.12.  WARC-Truncated

For practical reasons, writers of the WARC format MAY place limits on the time or storage allocated to archiving a single resource. As a result, only a truncated portion of the original resource may be available for saving into a WARC record.

Any record MAY indicate that truncation of its content block has occurred and give the reason with a 'WARC-Truncated' field.

  WARC-Truncated    = "WARC-Truncated" ":" reason-token
  reason-token      = "length"         ; exceeds configured max length
                    | "time"           ; exceeds configured max time
                    | "disconnect"     ; network disconnect
                    | "unspecified"    ; other/unknown reason
                    | future-reason
  future-reason     = token

For example, if the collection of what appeared to be a multi-gigabyte resource was cut short after a transfer time limit was reached, the partial resource could be saved to a WARC record with this field.

The WARC-Truncated field MAY be used on any WARC record. The WARC field Content-Length MUST still report the actual truncated size of the record block.



 TOC 

4.13.  WARC-Warcinfo-ID

When present, indicates the WARC-Record-ID of the associated 'warcinfo' record for this record. Typically, the Warcinfo-ID parameter is used when the context of the applicable 'warcinfo' record is unavailable, such as after distributing single records into separate WARC files. WARC writing applications (such web crawlers) MAY choose to always record this parameter.

  WARC-Warcinfo-ID = "WARC-Warcinfo-ID" ":" uri

The WARC-Warcinfo-ID field value overrides any association with a previously occurring (in the WARC) 'warcinfo' record, thus providing a way to protect the true association when records are combined from different WARCs.

The WARC-Warcinfo-ID field MAY be used in any record type except 'warcinfo'.



 TOC 

4.14.  WARC-Filename

The filename containing the current 'warcinfo' record.

  WARC-Filename = "WARC-Filename" ":" ( TEXT | quoted-string )

The WARC-Filename field MAY be used in 'warcinfo' type records and MUST NOT be used for other record types.



 TOC 

4.15.  WARC-Profile

A URI signifying the kind of analysis and handling applied in a 'revisit' record. (Like an XML namespace, the URI may, but need not, return human-readable or machine-readable documentation.) If reading software does not recognize the given URI as a supported kind of handling, it MUST NOT attempt to interpret the associated record block.

  WARC-Profile = "WARC-Profile" ":" uri

The section 'revisit' ('revisit') defines two initial profile options for the WARC-Profile header for 'revisit' records.

The WARC-Profile field is REQUIRED on 'revisit' type records and undefined for other record types.



 TOC 

4.16.  WARC-Identified-Payload-Type

The content-type of the record's payload as determined by an independent check.

  WARC-Identified-Payload-Type = "WARC-Identified-Payload-Type" ":"
                                 media-type

The WARC-Identified-Payload-Type field MAY be used on WARC records with a well-defined payload and MUST NOT be used on records without a well-defined payload.



 TOC 

4.17.  WARC-Segment-Number

Reports the current record's relative ordering in a sequence of segmented records.

  WARC-Segment-Number = "WARC-Segment-Number" ":" 1*DIGIT

In the first segment of a any record that is completed in one or more later 'continuation' WARC records, this parameter is REQUIRED. Its value there is "1". In a 'continuation' record, this parameter is also REQUIRED. Its value is the sequence number of the current segment in the logical whole record, increasing by 1 in each next segment.

See the section below, Record Segmentation (Record Segmentation), for full details on the use of WARC record segmentation.



 TOC 

4.18.  WARC-Segment-Origin-ID

Identifies the starting record in a series of segmented records whose content blocks are reassembled to obtain a logically complete content block.

  WARC-Segment-Origin-ID = "WARC-Segment-Origin-ID" ":"
                           <'msg-id' per RFC2045/RFC2822>

This field is REQUIRED on all 'continuation' records, and MUST NOT be used in other records. See the section below, Record Segmentation (Record Segmentation), for full details on the use of WARC record segmentation.



 TOC 

4.19.  WARC-Segment-Total-Length

in the final record of a segmented series, reports the total length of all segment content blocks when concatenated together.

  WARC-Segment-Total-Length = "WARC-Segment-Total-Length" ":" 1*DIGIT

This field is REQUIRED on the last 'contination' record of a series, and MUST NOT be used elsewhere.

See the section below, Record Segmentation (Record Segmentation), for full details on the use of WARC record segmentation.



 TOC 

5.  WARC Record Types

The purpose and use of each defined record type is described below.

Because new record types that extend the WARC format may be defined in future standards, WARC processing software MUST skip records of unknown type.



 TOC 

5.1.  'warcinfo'

A 'warcinfo' record describes the records that follow it, up through end of file, end of input, or until next 'warcinfo' record. Typically, this appears once and at the beginning of a WARC file. For a web archive, it often contains information about the web crawl which generated the following records.

The format of this descriptive record block may vary, though the use of the "application/warc-fields" content-type is RECOMMENDED. Allowable fields include all [DCMI] (, “DCMI Metadata Terms,” .) plus the following field definitions. All fields are OPTIONAL.

'operator'
Contact information for the operator who created this WARC resource. A name or name and email address is RECOMMENDED.
'software'
The software and software version used creating this WARC resource. For example, "heritrix/1.12.0".
'robots'
The robots policy followed by the harvester creating this WARC resource. The string 'classic' indicates the 1994 web robots exclusion standard rules are being obeyed.
'hostname'
The hostname of the machine that created this WARC resource, such as "crawling17.archive.org".
'ip'
The IP address of the machine that created this WARC resource, such as "123.2.3.4".
'http-header-user-agent'
The HTTP 'user-agent' header usually sent by the harvester along with each request. Note that if 'request' records are used to save verbatim requests, this information is redundant. (If a 'request' or 'metadata' record reports a different 'user-agent' for a specific request, the more specific information SHOULD be considered more reliable.)
'http-header-from'
The HTTP 'From' header usually sent by the harvester along with each request. (The same considerations as for 'user-agent' apply.)

So that multiple record excerpts from inside WARC files are also valid WARC files, it is OPTIONAL that the first record of a legal WARC be a 'warcinfo' description. Also, to allow the concatenation of WARC files into a larger valid WARC file, it is allowable for 'warcinfo' records to appear in the middle of a WARC file.



 TOC 

5.2.  'response'

A 'response' record contains a complete scheme-specific response, including network protocol information where possible. The exact contents of a 'response' record are determined by not just by the record type but also by the URI scheme of the record's target-URI, as described below.



 TOC 

5.2.1.  for 'http' and 'https' schemes

For a target-URI of the 'http' or 'https' schemes, a 'response' record block SHOULD contain the full HTTP response received over the network, including headers. That is, it contains the 'Response' message defined by section 6 of HTTP/1.1 (RFC2616).

The WARC record's Content-Type field SHOULD contain the value defined by HTTP/1.1, "application/http;msgtype=response". When software bugs, network issues, or implementation limits cause response-like material to be collected that is not perfectly compliant with HTTP specifications, WARC writing software MAY record the problematic content using its best effort determination of the interesting material boundaries. That is, neither the use of the 'response' record with an 'http' target-URI nor the 'application/http' content-type serves as an absolute guarantee that the contained material is a legal HTTP response.

A WARC-IP-Address field SHOULD be used to record the network IP address from which the response material was received.

When a 'response' is known to have been truncated, this MUST be noted using the WARC-Truncated field.

A WARC-Concurrent-To field (or fields) MAY be used to associate the 'response' to a matching 'request' record or concurrently-created 'metadata' record.

The payload of a 'response' record with a target-URI of scheme 'http' or 'https' is defined as its 'entity-body' (per [RFC2616] (Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., and T. Berners-Lee, “Hypertext Transfer Protocol -- HTTP/1.1,” June 1999.)), with any transfer-encoding removed. If a truncated 'response' record block contains less than the full entity-body, the payload is considered truncated at the same position.

This document does not specify conventions for recording information about the 'https' secure socket transaction, such as certificates exchanged, consulted, or verified.



 TOC 

5.2.2.  for other URI schemes

This document does not specify the contents of the 'response' record for other URI schemes.



 TOC 

5.3.  'resource'

A 'resource' record contains a resource, without full protocol response information. For example: a file directly retrieved from a locally accessible repository, or the result of a networked retrieval where the protocol information has been discarded. The exact contents of a 'resource' record are determined by not just by the record type but also by the URI scheme of the record's target-URI, as described below.

For all 'resource' records, the payload is defined as the record block.

A 'resource' record, with a synthesized target-URI, MAY also be used to archive other artifacts of a harvesting process inside WARC files.



 TOC 

5.3.1.  for 'http' and 'https' schemes

For a target-URI of the 'http' or 'https' schemes, a 'resource' record block MUST contain the returned 'entity-body' (per [RFC2616] (Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., and T. Berners-Lee, “Hypertext Transfer Protocol -- HTTP/1.1,” June 1999.), with any transfer-encodings removed), possibly truncated.



 TOC 

5.3.2.  for 'ftp' scheme

For a target-URI of the 'ftp' scheme, a 'resource' record block MUST contain the complete file returned by an FTP operation, possibly truncated.



 TOC 

5.3.3.  for 'dns' scheme

For a target-URI of the 'dns' scheme ([RFC4501] (Josefsson, S., “Domain Name System Uniform Resource Identifiers,” May 2006.)), a 'resource' record MUST contain material of content-type 'text/dns' (registered by [RFC4027] (Josefsson, S., “Domain Name System Media Types,” April 2005.) and defined by [RFC2540] (Eastlake, D., “Detached Domain Name System (DNS) Information,” March 1999.) and [RFC1035] (Mockapetris, P., “Domain names - implementation and specification,” November 1987.)) representing the results of a single DNS lookup as described by the target-URI.



 TOC 

5.3.4.  for other URI schemes

This document does not specify the contents of the 'resource' record for other URI schemes.



 TOC 

5.4.  'request'

A 'request' record holds the details of a complete scheme-specific request, including network protocol information where possible. The exact contents of a 'request' record are determined by not just by the record type but also by the URI scheme of the record's target-URI, as described below.



 TOC 

5.4.1.  for 'http' and 'https' schemes

For a target-URI of the 'http' or 'https' schemes, a 'request' record block SHOULD contain the full HTTP request sent over the network, including headers. That is, it contains the 'Request' message defined by section 5 of HTTP/1.1 (RFC2616).

The WARC record's Content-Type field SHOULD contain the value defined by HTTP/1.1, "application/http;msgtype=request".

A WARC-IP-Address field SHOULD be used to record the network IP address to which the request material was directed.

A WARC-Concurrent-To field (or fields) MAY be used to associate the 'request' to a matching 'response' record or concurrently-created 'metadata' record.

The payload of a 'request' record with a target-URI of scheme 'http' or 'https' is defined as its 'entity-body' (per [RFC2616] (Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., and T. Berners-Lee, “Hypertext Transfer Protocol -- HTTP/1.1,” June 1999.)), with any transfer-encoding removed. If a truncated 'request' record block contains less than the full entity-body, the payload is considered truncated at the same position.

This document does not specify conventions for recording information about the 'https' secure socket transaction, such as certificates exchanged, consulted, or verified.



 TOC 

5.4.2.  for other URI schemes

This document does not specify the contents of the 'request' record for other URI schemes.



 TOC 

5.5.  'metadata'

A 'metadata' record contains content created in order to further describe, explain, or accompany a harvested resource, in ways not covered by other record types. A 'metadata' record will almost always refer to another record of another type, with that other record holding original harvested or transformed content. (However, it is allowable for a 'metadata' record to refer to any record type, including other 'metadata' records.) Any number of metadata records MAY reference one specific other record.

The format of the metadata record block may vary. The "application/warc-fields" format, defined earlier, MAY be used. Allowable fields include all [DCMI] (, “DCMI Metadata Terms,” .) plus the following field definitions. All fields are OPTIONAL.

'via'
The referring URI from which the archived URI was discovered.
'hopsFromSeed'
A symbolic string describing the type of each hop foo a starting 'seed' URI to the current URI.
'fetchTimeMs'
Time in milliseconds that it took to collect the archived URI, starting from the initiation of network traffic.

A 'metadata' record MAY be associated with other records derived from the same collection event using the WARC-Concurrent-To header. A 'metadata' record MAY be associated to another record which it describes using the WARC-Refers-To header.



 TOC 

5.6.  'revisit'

A 'revisit' record describes the revisitation of content already archived, and might include only an abbreviated content body which has to be interpreted relative to a previous record. Most typically, a 'revisit' record is used instead of a 'response' or 'resource' record to indicate that the content visited was either a complete or substantial duplicate of material previously archived.

Using a 'revisit' record instead of another type is OPTIONAL, for when benefits of reduced storage size or improved cross-referencing of material are desired.

A 'revisit' record REQUIRES a WARC-Profile field which determines the interpretation of the record's fields and record block. Two initial values and their interpretation are described in the following sections. A reader which does not recognize the profile URI MUST NOT attempt to interpret the enclosing record or associated content body.

The purpose of this record type is to reduce storage redundancy when repeatedly retrieving identical or little-changed content, while still recording that a revisit occurred, plus details about the current state of the visited content relative to the archived version.



 TOC 

5.6.1.  Profile: Identical Payload Digest

This 'revisit' profile MAY be used whenever a subsequent consideration of a URI provides payload content which a strong digest function, such as SHA-1, indicates is identical to a previously recorded version.

To indicate this profile, use the URI:

http://netpreserve.org/warc/0.16/revisit/identical-payload-digest

To report the payload digest used for comparison, a 'revisit' record using this profile MUST include a WARC-Payload-Digest field, with a value of the digest that was calculated on the payload.

A 'revisit' record using this profile MAY have no record block, in which case a Content-Length of zero must be written. If a record block is present, it MUST be interpreted the same as a 'response' record type for the same URI, but truncated to avoid storing the duplicate content. A WARC-Truncated header with reason 'length' MUST be used for any identical-digest truncation.

For records using this profile, the payload is defined as the original payload content whose digest value was unchanged.

Using a WARC-Refers-To header to identify a specific prior record from which the matching content can be retrieved is RECOMMENDED, to minimize the risk of misinterpreting the 'revisit' record.



 TOC 

5.6.2.  Profile: Server Not Modified

This 'revisit' profile MAY be used whenever a subsequent consideration of a URI encounters an assertion from the providing server that the content has not changed, such as an HTTP "304 Not Modified" response.

To indicate this profile, use the URI:

http://netpreserve.org/warc/0.16/revisit/server-not-modified

A 'revisit' record using this profile MAY have no content body, in which case a Content-Length of zero MOST be written. If a content body is present, it should be interpreted the same as a 'response' record type for the same URI, truncated if desired.

Any 'Etag' or 'Last-Modified' header value on the server response MUST be reported in new fields provided by this profile, "WARC-Etag" or "WARC-Last-Modified" respectively.

For records using this profile, the payload is defined as the original payload content from which a 'Last-Modified' and/or 'ETag' value was taken.

Using a WARC-Refers-To header to identify a specific prior record from which the unmodified content can be retrieved is RECOMMENDED, to minimize the risk of misinterpreting the 'revisit' record.



 TOC 

5.6.3.  Other profiles

Other documents may define additional profiles to accomplish other goals, such as recording the apparent magnitude of difference from the previous visit, or to encode the visited content as a "diff" -- where "diff" is the file comparison utility that outputs the differences between two files -- of the content previously stored.



 TOC 

5.7.  'conversion'

A 'conversion' record contains an alternative version of another record's content that was created as the result of an archival process. Typically, this is used to hold content transformations that maintain viability of content after widely available rendering tools for the originally stored format disappear. As needed, the original content may be migrated (transformed) to a more viable format in order to keep the information usable with current tools while minimizing loss of information (intellectual content, look and feel, etc). Any number of 'conversion' records MAY be created that reference a specific source record, which may itself contain transformed content. Each transformation SHOULD result in a freestanding, complete record, with no dependency on survival of the original record.

Metadata records MAY be used to further describe transformation records. Wherever practical, a 'conversion' record SHOULD contain a 'WARC-Refers-To' field to identify the prior material converted.

For 'conversion' records, the payload is defined as the record block.



 TOC 

5.8.  'continuation'

Record blocks from 'continuation' records must be appended to corresponding prior record block(s) (e.g., from other WARC files) to create the logically complete full-sized original record. That is, 'continuation' records are used when a record that would otherwise cause a WARC file size to exceed a desired limit is broken into segments. A continuation record MUST contain the named fields 'WARC-Segment-Origin-ID' and 'WARC-Segment-Number', and the last 'continuation' record of a series MUST contain a 'WARC-Segment-Total-Length' field. The full details of WARC record segmentation are described in the below section Record Segmentation (Record Segmentation).



 TOC 

6.  Record Segmentation

A record that will not fit into a single WARC file of desired maximum size MAY be broken into a number of separate records, called segments.

The first segment of a segmented series MUST carry the original record-type (not 'continuation'), and a 'WARC-Segment-Number' field with a value of "1".

All subsequent segments MUST have a record type of 'continuation', with an incremented 'WARC-Segment-Number' field. They MUST also include a 'WARC-Segment-Origin-ID' field with a value of the WARC-Record-ID of the record containing the first segment of the set. All segments of a set MUST have identical target-URI values. Segments MAY have individual WARC-Block-Digest fields.

The last segment MUST contain a "WARC-Segment-Total-Length" field specifying the total length, in bytes, of all segment content blocks if reassembled. The last segment MAY also contain a 'WARC-Truncated' field, if appropriate.

Segments other than the first SHOULD NOT contain other optional fields, as segments merely serve to continue the record data block of the first record.

To reassemble all segments into the intended complete logical record, the content blocks of all records with the same 'WARC-Segment-Origin-ID' value are collected and appended, in 'WARC-Segment-Number' order, to the origin record's content block. The resulting assembled record adopts as its 'Content-Length' the 'WARC-Segment-Total-Length' value. It also adopts any 'WARC-Truncated' reason of the final segment.

Segmentation MUST NOT be used if there is another way to store the record within the desired WARC file target size. Specifically, if a record could be stored without segmentation by starting a new WARC file, segmentation MUST NOT be used. Further, when segmentation is used, the size of the first segment MUST be maximized. Specifically, the origin segment MUST be placed in a new WARC file, preceded only by a 'warcinfo' record (if any).

Segmentation MAY be applied to any original record type other than 'continuation', but its use on 'warcinfo', 'request', and 'metadata' records is NOT RECOMMENDED.



 TOC 

7.  Registration of MIME Media Types application/warc and application/warc-fields

This section describes, as per [RFC2048] (Freed, N., Klensin, J., and J. Postel, “Multipurpose Internet Mail Extensions (MIME) Part Four: Registration Procedures,” November 1996.), the MIME types associated with the WARC format.



 TOC 

7.1.  application/warc

MIME media type name: application

MIME subtype names: warc

Required parameters: None

Optional parameters: None

Encoding considerations:

Content of this type is in 'binary' format.

Security considerations:

The WARC record syntax poses no direct risk to computers and networks. Implementors need to be aware of source authority and trustworthiness of information structured in WARC. Readers and writers subject themselves to all the risks that accompany normal operation of data processing services (e.g., message length errors, buffer overflow attacks).

Interoperability considerations: None

Published specification: TBD

Applications which use this media type: Large- and small-scale archiving

Additional information: None

Person and email address to contact for further information:

Gordon Mohr gojomo@archive.org, John Kunze jak@ucop.edu

Intended usage: COMMON

Author/Change controller: IESG



 TOC 

7.2.  application/warc-fields

MIME media type name: application

MIME subtype names: warc-fields

Required parameters: None

Optional parameters: None

Encoding considerations:

Content of this type is in 'binary' format.

Security considerations:

The WARC field syntax poses no direct risk to computers and networks. Implementors need to be aware of source authority and trustworthiness of information structured in WARC. Readers and writers subject themselves to all the risks that accompany normal operation of data processing services (e.g., message length errors, buffer overflow attacks).

Interoperability considerations: None

Published specification: TBD

Applications which use this media type: Large- and small-scale archiving

Additional information: None

Person and email address to contact for further information:

Gordon Mohr gojomo@archive.org, John Kunze jak@ucop.edu

Intended usage: COMMON

Author/Change controller: IESG



 TOC 

8.  IANA Considerations

After IESG approval, IANA is expected to register the WARC type "application/warc" using the application provided in this document.



 TOC 

9.  Acknowledgments

This document could not have been written without major contributions from participants of the International Internet Preservation Consortium, especially Steen Christensen, and Julien Masanès.



 TOC 

Appendix A.  Compression Recommendations

The WARC format defines no internal compression. Whether and how WARC files should be compressed is an external decision.

However, experience with the precursor ARC format at the Internet Archive has demonstrated that applying simple standard compression can result in significant storage savings, while preserving random access to individual records.

For this purpose, the GZIP format with customary "deflate" compression is RECOMMENDED, as defined in [RFC1950] (Deutsch, L. and J-L. Gailly, “ZLIB Compressed Data Format Specification version 3.3,” May 1996.), [RFC1951] (Deutsch, P., “DEFLATE Compressed Data Format Specification version 1.3,” May 1996.), and [RFC1952] (Deutsch, P., Gailly, J-L., Adler, M., Deutsch, L., and G. Randers-Pehrson, “GZIP file format specification version 4.3,” May 1996.). Freely available source code implementing this format is available, and the technique is free of patent encumberances. The GZIP format is also widely used and supported across many free and commercial software packages and operating systems.

This section documents recommended, but optional, practices for compressing WARC files with GZIP.



 TOC 

Appendix A.1.  Record-at-a-time Compression

Per section 2.2 of the GZIP specification, a valid GZIP file consists of any number of gzip "members", each independently compressed.

Where possible, this property SHOULD be exploited to compress each record of a WARC file independently. This results in a valid GZIP file whose per-record subranges also stand alone as valid GZIP files.

External indexes of WARC file content may then be used to record each record's starting position in the GZIP file, allowing for random access of individual records without requiring decompression of all preceding records.

Note that the application of this convention causes no change to the uncompressed contents of an individual WARC record.



 TOC 

Appendix A.2.  GZIP WARC File Name Suffix

A gzip compressed WARC file SHOULD have the customary ".gz" appended to it, making the complete suffix, ".warc.gz".



 TOC 

Appendix B.  WARC File Size and Name Recommendations

1GB (10^9 bytes) is RECOMMENDED as a practical target size for WARC files, when record sizes allow. Oversized records may be truncated, segmented, or placed in oversized WARC files, at a project's discretion.

It is helpful to use practices within an institution that make it unlikely or impossible to duplicate aggregate WARC file names. The convention used inside the Internet Archive with ARC files is to name files according to the following pattern:

Prefix-Timestamp-Serial-Crawlhost.warc.gz

Prefix is an abbreviation usually reflective of the project or crawl that created this file. Timestamp is a 14-digit GMT timestamp indicating the time the file was initially begun. Serial is an increasing serial-number within the process creating the files, often (but not necessarily) unique with regard to the Prefix. Crawlhost is the domain name or IP address of the machine creating the file.

IIPC member institutions have expressed an interest in adopting a common naming strategy, with per-institution unique identifiers to assist in marking WARC files with their institution of origin. It is proposed that all such WARC file names adhering to this future convention begin "iipc".

This specification does not require any particular WARC file naming practice, but conventions similar to the above are RECOMMENDED within WARC-creating institutions. The file name prefix "iipc" SHOULD NOT be used unless participating in a future IIPC naming registry.



 TOC 

Appendix C.  Examples of WARC Records



 TOC 

Appendix C.1.  Example of 'warcinfo' Record

WARC/0.16
WARC-Type: warcinfo
WARC-Date: 2006-09-19T17:20:14Z
WARC-Record-ID: <urn:uuid:d7ae5c10-e6b3-4d27-967d-34780c58ba39>
Content-Type: application/warc-fields
Content-Length: 381

software: Heritrix 1.12.0 http://crawler.archive.org
hostname: crawling017.archive.org
ip: 207.241.227.234
isPartOf: testcrawl-20050708
description: testcrawl with WARC output
operator: IA_Admin
http-header-user-agent:
 Mozilla/5.0 (compatible; heritrix/1.4.0 +http://crawler.archive.org)
format: WARC file version 0.16
conformsTo:
 http://www.archive.org/documents/WarcFileFormat-0.16.html




 TOC 

Appendix C.2.  Example of 'request' Record

WARC/0.16
WARC-Type: request
WARC-Target-URI: http://www.archive.org/images/logoc.jpg
WARC-Date: 2006-09-19T17:20:24Z
Content-Length: 236
WARC-Record-ID: <urn:uuid:4885803b-eebd-4b27-a090-144450c11594>
Content-Type: application/http;msgtype=request
WARC-Concurrent-To: <urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0>

GET /images/logoc.jpg HTTP/1.0
User-Agent: Mozilla/5.0 (compatible; heritrix/1.10.0)
From: stack@example.org
Connection: close
Referer: http://www.archive.org/
Host: www.archive.org
Cookie: PHPSESSID=009d7bb11022f80605aa87e18224d824




 TOC 

Appendix C.3.  Example of 'response' Record

WARC/0.16
WARC-Type: response
WARC-Target-URI: http://www.archive.org/images/logoc.jpg
WARC-Date: 2006-09-19T17:20:24Z
WARC-Block-Digest: sha1:UZY6ND6CCHXETFVJD2MSS7ZENMWF7KQ2
WARC-Payload-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2
WARC-IP-Address: 207.241.233.58
WARC-Record-ID: <urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0>
Content-Type: application/http;msgtype=response
WARC-Identified-Payload-Type: image/jpeg
Content-Length: 1902

HTTP/1.1 200 OK
Date: Tue, 19 Sep 2006 17:18:40 GMT
Server: Apache/2.0.54 (Ubuntu)
Last-Modified: Mon, 16 Jun 2003 22:28:51 GMT
ETag: "3e45-67e-2ed02ec0"
Accept-Ranges: bytes
Content-Length: 1662
Connection: close
Content-Type: image/jpeg

[image/jpeg binary data here]




 TOC 

Appendix C.4.  Example of 'resource' Record

WARC/0.16
WARC-Type: resource
WARC-Target-URI: file://var/www/htdoc/images/logoc.jpg
WARC-Date: 2006-09-19T17:20:24Z
WARC-Record-ID: <urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0>
Content-Type: image/jpeg
WARC-Payload-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2
WARC-Block-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2
Content-Length: 1662

[image/jpeg binary data here]




 TOC 

Appendix C.5.  Example of 'metadata' Record

WARC/0.16
WARC-Type: metadata
WARC-Target-URI: http://www.archive.org/images/logoc.jpg
WARC-Date: 2006-09-19T17:20:24Z
WARC-Record-ID: <urn:uuid:16da6da0-bcdc-49c3-927e-57494593b943>
WARC-Refers-To: <urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0>
Content-Type: application/warc-fields
WARC-Block-Digest: sha1:UZY6ND6CCHXETFVJD2MSS7ZENMWF7KQ2
Content-Length: 59

via: http://www.archive.org/
hopsFromSeed: E
fetchTimeMs: 565




 TOC 

Appendix C.6.  Example of 'revisit' Record

WARC/0.16
WARC-Type: revisit
WARC-Target-URI: http://www.archive.org/images/logoc.jpg
WARC-Date: 2007-03-06T00:43:35Z
WARC-Profile: http://netpreserve.org/warc/0.16/server-not-modified
WARC-Record-ID: <urn:uuid:16da6da0-bcdc-49c3-927e-57494593bbbb>
WARC-Refers-To: <urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0>
Content-Type: message/http
Content-Length: 226

HTTP/1.x 304 Not Modified
Date: Tue, 06 Mar 2007 00:43:35 GMT
Server: Apache/2.0.54 (Ubuntu) PHP/5.0.5-2ubuntu1.4
Connection: Keep-Alive
Keep-Alive: timeout=15, max=100
Etag: "3e45-67e-2ed02ec0"




 TOC 

Appendix C.7.  Example of 'conversion' Record

WARC/0.16
WARC-Type: conversion
WARC-Target-URI: http://www.archive.org/images/logoc.jpg
WARC-Date: 2016-09-19T19:00:40Z
WARC-Record-ID: <urn:uuid:16da6da0-bcdc-49c3-927e-57494593dddd>
WARC-Refers-To: <urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0>
WARC-Block-Digest: sha1:XQMRY75YY42ZWC6JAT6KNXKD37F7MOEK
Content-Type: image/neoimg
Content-Length: 934

[image/neoimg binary data here]




 TOC 

Appendix C.8.  Example of Segmentation ('continuation' record)

Let us take the example of the 'response' record given earlier, and segment it to fit the within a WARC file no larger than 2K. The first WARC file would contain the first segment, a record of type 'response' with a WARC-Segment-Number of 1. Note that the block-digest has changed -- as the block is no longer the same as the standalone 'response' record -- but the payload-digest has not changed, as the reassembled record will have the same internal payload.

WARC/0.16
WARC-Type: response
WARC-Target-URI: http://www.archive.org/images/logoc.jpg
WARC-Date: 2006-09-19T17:20:24Z
WARC-Block-Digest: sha1:2ASS7ZUZY6ND6CCHXETFVJDENAWF7KQ2
WARC-Payload-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2
WARC-IP-Address: 207.241.233.58
WARC-Record-ID: <urn:uuid:39509228-ae2f-11b2-763a-aa4c6ec90bb0>
WARC-Segment-Number: 1
Content-Type: application/http;msgtype=response
Content-Length: 1600

HTTP/1.1 200 OK
Date: Tue, 19 Sep 2006 17:18:40 GMT
Server: Apache/2.0.54 (Ubuntu)
Last-Modified: Mon, 16 Jun 2003 22:28:51 GMT
ETag: "3e45-67e-2ed02ec0"
Accept-Ranges: bytes
Content-Length: 1662
Connection: close
Content-Type: image/jpeg

[first 1360 bytes of image/jpeg binary data here]


The next file would contain the 'continuation' record, with fields to identify the start of the segmentation series (WARC-Segment-Origin-ID), to indicate this record's place in the series (WARC-Segment-Number), and to report that this the last record and what the total size is (WARC-Segment-Total-Length).

WARC/0.16
WARC-Type: continuation
WARC-Target-URI: http://www.archive.org/images/logoc.jpg
WARC-Date: 2006-09-19T17:20:24Z
WARC-Block-Digest: sha1:T7HXETFVA92MSS7ZENMFZY6ND6WF7KB7
WARC-Record-ID: <urn:uuid:70653950-a77f-b212-e434-7a7c6ec909ef>
WARC-Segment-Origin-ID: <urn:uuid:39509228-ae2f-11b2-763a-aa4c6ec90bb0>
WARC-Segment-Number: 2
WARC-Segment-Total-Length: 1902
WARC-Identified-Payload-Type: image/jpeg
Content-Length: 302

[last 302 bytes of image/jpeg binary data here]




 TOC 

10. References

[ARC] Burner, M. and B. Kahle, “The ARC File Format,” September 1996 (HTML).
[HERITRIX] Heritrix Open Source Archival Web Crawler” (HTML).
[IIPC] International Internet Preservation Consortium (IIPC)” (HTML).
[W3CDTF] Date and Time Formats (W3C profile of ISO8601)” (HTML).
[DCMI] DCMI Metadata Terms” (HTML).
[RFC1035] Mockapetris, P., “Domain names - implementation and specification,” STD 13, RFC 1035, November 1987.
[RFC1884] Hinden, R. and S. Deering, “IP Version 6 Addressing Architecture,” RFC 1884, December 1995.
[RFC1950] Deutsch, L. and J-L. Gailly, “ZLIB Compressed Data Format Specification version 3.3,” RFC 1950, May 1996 (TXT, PS, PDF).
[RFC1951] Deutsch, P., “DEFLATE Compressed Data Format Specification version 1.3,” RFC 1951, May 1996 (TXT, PS, PDF).
[RFC1952] Deutsch, P., Gailly, J-L., Adler, M., Deutsch, L., and G. Randers-Pehrson, “GZIP file format specification version 4.3,” RFC 1952, May 1996 (TXT, PS, PDF).
[RFC2045] Freed, N. and N. Borenstein, “Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies,” RFC 2045, November 1996.
[RFC2047] Moore, K., “MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text,” RFC 2047, November 1996 (TXT, HTML, XML).
[RFC2048] Freed, N., Klensin, J., and J. Postel, “Multipurpose Internet Mail Extensions (MIME) Part Four: Registration Procedures,” BCP 13, RFC 2048, November 1996 (TXT, HTML, XML).
[RFC2119] Bradner, S., “Key words for use in RFCs to Indicate Requirement Levels,” BCP 14, RFC 2119, March 1997 (TXT, HTML, XML).
[RFC2540] Eastlake, D., “Detached Domain Name System (DNS) Information,” RFC 2540, March 1999.
[RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., and T. Berners-Lee, “Hypertext Transfer Protocol -- HTTP/1.1,” RFC 2616, June 1999 (TXT, PS, PDF, HTML, XML).
[RFC2822] Resnick, P., “Internet Message Format,” RFC 2822, April 2001.
[RFC3548] Josefsson, S., “The Base16, Base32, and Base64 Data Encodings,” RFC 3548, July 2003.
[RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, “Uniform Resource Identifier (URI): Generic Syntax,” STD 66, RFC 3986, January 2005 (TXT, HTML, XML).
[RFC4027] Josefsson, S., “Domain Name System Media Types,” RFC 4027, April 2005.
[RFC4501] Josefsson, S., “Domain Name System Uniform Resource Identifiers,” RFC 4501, May 2006.


 TOC 

Authors' Addresses

  John A. Kunze (editor)
  California Digital Library
  415 20th St, 4th Floor
  Oakland, CA 94612-3550
  US
Fax:  +1 510-893-5212
Email:  jak@ucop.edu
  
  Allan Arvidson
  Kungliga biblioteket (National Library of Sweden)
  Box 5039
  Stockholm 10241
  SE
Fax:  +46 (0)8 463 4004
Email:  allan.arvidson@kb.se
  
  Gordon Mohr
  Internet Archive
  4 Funston Ave, Presidio
  San Francisco, CA 94117
  US
Email:  gojomo@archive.org
  
  Michael Stack
  Internet Archive
  4 Funston Ave, Presidio
  San Francisco, CA 94117
  US
Email:  stack@archive.org