NOTE: This document is out of date.

The latest version can be found in this directory of WARC specification drafts.
 TOC 
IIPC Framework Working GroupJ. Kunze, Ed.
 California Digital Library
 A. Arvidson
 Kungliga biblioteket (National
 Library of Sweden)
 G. Mohr
 M. Stack
 Internet Archive
 January 2006

The WARC File Format (Version 0.9)

Abstract

The WARC (Web ARChive) format specifies a method for combining multiple digital resources into an aggregate archival file together with related information. Resources are dated, identified by URIs, and preceded by simple text headers. By convention, files of this format are named with the extension ".warc" and have the MIME type application/warc. The WARC file format is a revision and generalization of the ARC format used by the Internet Archive to store information blocks harvested by web crawlers. This document specifies version 0.9 of the WARC format.



Table of Contents

1.  Introduction
2.  Goals
3.  The WARC Record Model
4.  Record Types
    4.1.  'warcinfo'
    4.2.  'response'
    4.3.  'resource'
    4.4.  'request'
    4.5.  'metadata'
    4.6.  'revisit'
    4.7.  'conversion'
    4.8.  'continuation'
5.  Record Header
    5.1.  Positional Parameters
    5.2.  Named Parameters
6.  Record Content Block
7.  Truncated and Segmented Records
    7.1.  Record Truncation
    7.2.  Record Segmentation
8.  WARC Application to Specific Protocols
    8.1.  HTTP and HTTPS
    8.2.  DNS
    8.3.  Other Resources with URIs, and Other Protocols
9.  Compression Recommendations
    9.1.  Record-at-a-time Compression
    9.2.  GZIP extra field: skip-lengths ('sl')
    9.3.  GZIP WARC File Name Suffix
10.  WARC File Name and Size Recommendations
11.  Registration of MIME Media Type application/warc
12.  IANA Considerations
13.  Acknowledgements
Appendix A.  Consideratons in Choice of record-id
Appendix B.  Examples of WARC Records
Appendix B.1.  Example of 'warcinfo' Record
Appendix B.2.  Example of 'request' Record
Appendix B.3.  Example of 'response' Record
Appendix B.4.  Example of 'resource' Record
Appendix B.5.  Example of 'metadata' Record
Appendix B.6.  Example of 'revisit' Record
Appendix B.7.  Example of 'conversion' Record
Appendix B.8.  Example of 'continuation' Record
Appendix C.  Collected BNF for WARC
14.  References
§  Authors' Addresses




 TOC 

1. Introduction

The WARC (Web ARChive) file format offers a convention for concatenating multiple resource records, each consisting of a set of simple text headers and an arbitary data block into one long file. The WARC format is a revision of the ARC File Format (Burner, M. and B. Kahle, “The ARC File Format,” September 1996.) [ARC] format that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web.

The original ARC format file is used internally by the Internet Archive (IA) to record a sequence of materials captured from the web (e.g., web "pages"). Each capture is preceded by a one-line header that very briefly describes the harvested content and its length. This is directly followed by the the retrieval protocol response messages and content. The motivation to revise the format arose from the discussion and experiences of the International Internet Preservation Consortium (IIPC) (, “International Internet Preservation Consortium (IIPC),” .) [IIPC], whose members include the IA and the national libraries of a dozen countries. The revised format is expected to be a standard way to structure, manage and store billions of collected web resources. For example, WARC will be an output format of harvesting software, such as the open-source Heritrix (, “Heritrix Open Source Archival Web Crawler,” .) [HERITRIX] web crawler, and an input format for a wide array of cataloguing and access tools.

The WARC format generalizes the older format to better support the harvesting, access, and exchange needs of archiving organizations. Besides the primary content currently recorded, the revision accommodates related secondary content, such as assigned metadata, abbreviated duplicate detection events, and later-date transformations. The revision may also be useful for more general applications than web archiving. To aid the development of tools that are backwards compatible, WARC content is clearly distinguishable from pre-revision ARC content.



 TOC 

2. Goals

Goals of the WARC file format include the following.



 TOC 

3. The WARC Record Model

A WARC format file is the simple concatenation of one or more WARC records. A record consists of a record header followed by a record content block and two newlines. (Newlines are CRLF as per other Internet standards.) This can be summarized in the following [RFC2234] (Crocker, D., Ed. and P. Overell, “Augmented BNF for Syntax Specifications: ABNF,” November 1997.) IETF ABNF grammar. (All-caps "core" elements are as defined in RFC2234.)

  warc-file   = 1*warc-record
  warc-record = header block CRLF CRLF
  header      = header-line CRLF *anvl-field CRLF
  block       = *OCTET

Elements of this grammar are further specified and explained in sections that follow.

The record header-line is a newline-terminated sequence of whitespace-delimited text tokens representing parameters such as record length, time of creation, and subject URI.

  header-line = warc-id tsp data-length tsp record-type tsp
                subject-uri tsp creation-date tsp content-type tsp
                record-id
  tsp         = 1*WSP

The amount of whitespace between header-line tokens is variable. This gives archive builders the flexibility to add padding and later adjust pre-written header parameters when final values are only completely known after the record content block has been written.

After the header-line come zero or more named ANVL (Kunze, J., Kahle, B., Masanes, J., and G. Mohr, “A Name-Value Language,” .) [ANVL] fields in a line-oriented syntax very similar to that of email headers [RFC0822] (Crocker, D., “Standard for the format of ARPA Internet text messages,” August 1982.) but with unrestricted "text" values (none of its 13 reserved special characters). The precise format is as follows:

  anvl-field  =  field-name ":" [ field-body ] CRLF
  field-name  =  1*<any CHAR, excluding control-chars and ":">
  field-body  =  text [CRLF LWSP-char field-body]
  text        =  1*<any UTF-8 character, including bare
                    CR and bare LF, but NOT including CRLF>
                                             ; (Octal, Decimal.)
  CHAR        =  <any ASCII/UTF-8 character> ; (0-177,  0.-127.)
  CR          =  <ASCII CR, carriage return> ; (   15,      13.)
  LF          =  <ASCII LF, linefeed>        ; (   12,      10.)
  SPACE       =  <ASCII SP, space>           ; (   40,      32.)
  HTAB        =  <ASCII HT, horizontal-tab>  ; (   11,       9.)
  CRLF        =  CR LF
  LWSP-char   =  SPACE / HTAB                ; semantics = SPACE

This document defines a number of named fields that may appear as an anvl-field. Note that the smallest possible anvl-fields is a single CRLF, indicating no named fields.

Following the headers comes the content block, if any, which may contain arbitrary binary data, up through the remaining number of octets as specified in the previously-given data-length parameter. Finally come two CRLF newlines, not counted in the declared record data-length.

It is often the case that the first record of a WARC to has the record-type 'warcinfo' and is used to describe the records that follow it. It is always the case that the concatenation of any two WARC files is a syntactically correct WARC file; care should be taken, however, when concatenation would inadvertently cause 'warcinfo' records to appear at points in the result that would create confusion.

Subsequent records contain content blocks that are either the direct result of a retrieval attempt — web pages, inline images, URL redirection information, DNS hostname lookup results, standalone files, etc. — or they are synthesized content blocks (e.g., metadata, transformed content) that provide additional information about archived content. Any content block may contain arbitrary text or binary data.



 TOC 

4. Record Types

There are 8 currently defined WARC record types: 'warcinfo', 'response', 'resource', 'request', 'metadata', 'revisit', 'conversion', and 'continuation'. The purpose and use of each type is described below.

New record types that extend the WARC format may be defined in the future. WARC processing software should skip records of unknown type. A forum in which new types are likely to be proposed and discussed in advance of standardization is the discussion list standards@netpreserve.org.



 TOC 

4.1. 'warcinfo'

A 'warcinfo' record describes the records that follow it, up through end of file, end of input, or another 'warcinfo' record. Typically, this appears once and at the beginning of a WARC file. For a web archive, it often contains a description of a web crawl (e.g., depth, timeout, purpose). The format of the description is outside the scope of this document, but may include such things as (a) approximate maximum archive file size (e.g., 500MB), (b) rate of crawling, and (c) site entry point URIs for a targeted crawl.

So that multiple record excerpts from inside WARC files may also be valid WARC files, it is not strictly required that the first record of a legal WARC be a 'warcinfo' description. Also, to allow the concatenation of WARC files into a larger valid WARC file, it is allowable for 'warcinfo' records to appear in the middle of a WARC file.

The subject-uri of a 'warcinfo' record should be a URI name, synthesized as necessary, which references the WARC file itself.



 TOC 

4.2. 'response'

A 'response' record contains an entire protocol response, such as a full HTTP response including headers and content-body, from an Internet retrieval. Often the payload of such a response reflects the main collection objective of the archiving service, whose responsibility it is to distinguish payload from protocol headers during subsequent processing. A response record often includes the named parameters 'IP-Address' and 'Related-Record-ID'.



 TOC 

4.3. 'resource'

A 'resource' record contains a resource, without full protocol response information. For example: a file directly retrieved from a locally accessible repository, or the result of a networked retrieval where the protocol information has been discarded. A resource record often includes the named parameter 'Related-Record-ID'.



 TOC 

4.4. 'request'

A 'request' record holds the manner in which a primary record's content was requested. (In a web crawling context, this would hold the HTTP request.) A request record often includes the named parameter 'Related-Record-ID'.



 TOC 

4.5. 'metadata'

A 'metadata' record contains content created in order to further describe, explain, or accompany a harvested resource, in ways not covered by other record types. A 'metadata' record will almost always refer to another record of another type, with that other record holding original harvested or transformed content. (However, it is allowable for a 'metadata' record to refer to any record type, including other 'metadata' records, or to refer to no other individual record at all.) Any number of metadata records may be created that reference one specific other record. The format of the metadata is outside the scope of this document, but potential formats are [ANVL] (Kunze, J., Kahle, B., Masanes, J., and G. Mohr, “A Name-Value Language,” .) and [RDF] (, “Resource Description Framework (RDF),” .) or other XML-based formats. A metadata record often includes the named parameter 'Related-Record-ID'.



 TOC 

4.6. 'revisit'

A 'revisit' record describes the revisitation of content already archived, and includes only an abbreviated content block which must be interpreted relative to a previous record. Most typically, a 'revisit' record is be used instead of 'response' or 'resource' record to indicate that the content visited was either a complete or substantial duplicate of material previously archived.

A 'revisit' record should only be used when interpreting the record requires consulting a previous record; other record types should be preferred if the current record is understandable standing alone. (It is not required that any revisit of a previously-visited URI use 'revisit', only those which refer back to other records.)

The format of a 'revisit' record's content block will be specified elsewhere, and may vary to accomplish different goals, such as recording the apparent magnitude of difference from the previous visit, or to encode the visited content as a "diff" of the content previously stored. The purpose of this record type is to reduce storage redundancy when repeatedly retrieving identical or little-changed content, while still recording that a revisit occurred, plus details about the current state of the visited content relative to the archived version. A revisit record requires the named parameter 'Related-Record-ID'.



 TOC 

4.7. 'conversion'

A 'conversion' record contains an alternative version of another record's content that was created as the result of an archival process. Typically, this is used to hold content transformations that maintain viability of content after widely available rendering tools for the originally stored format disappear. As needed, the original content may be migrated (transformed) to a more viable format in order to keep the information usable with current tools while minimizing loss of information (intellectual content, look and feel, etc). Any number of transformation records may be created that reference a specific source record, which may itself contain transformed content. Each transformation should result in a freestanding, complete record, with no dependency on survival of the original record. Metadata records may be used to further describe transformation records. A conversion record requires the named parameter 'Related-Record-ID'.

Specification of the fields and metadata formats used to describe a 'conversion' record is outside the scope of this document,



 TOC 

4.8. 'continuation'

A 'continuation' record needs to be logically appended to a prior record (e.g., from another WARC file) to create the logically complete full-sized record. This is used when a record that would otherwise cause the WARC file size to exceed a desired limit is broken into segments. See the section on Truncated and Segmented Records for more information. A continuation record requires the named parameters 'Segment-Origin-ID' and 'Segment-Number', and often includes the named parameter 'Related-Record-ID'.



 TOC 

5. Record Header

The WARC record header declares baseline identifying information about the current record, and allows additional per-record information. It consists of one first line of required positional parameters, then a variable number of lines of named parameters.

Positional parameters on the first header line are tokens separated from each other by one or more spaces. Positional parameter order is significant.

One of the parameters, data-length, indicates the combined length of the header and block sections of this record (excepting the final CRLF CRLF), in octets, counting from the first character of the record header first line. (As specified below, this character is always "w"). The data-length is thus the most important header parameter for efficient bulk scanning which may need to skip entire records.

The header-line parameters are:

  warc-id       = "warc/0.9"
  data-length   = 1*DIGIT
  record-type   = "warcinfo" / "response" / "request"
                  / "metadata" / "revisit" / "conversion"
                  / "continuation" / future-type
  future-type   = 1*VCHAR
  subject-uri   = uri
  uri           = <'URI' per RFC3986>
  creation-date = timestamp
  timestamp     = <date per below>
  content-type  = type "/" subtype
  type          = <'type' per RFC2045>
  subtype       = <'subtype' per RFC2045>
  record-id     = uri

The warc-id string may change in future versions, but will always begin "warc/", continue with version numbers, and end at whitespace.

Named parameters after the header-line, if any, follow the line-oriented syntax defined previously (also known as ANVL (Kunze, J., Kahle, B., Masanes, J., and G. Mohr, “A Name-Value Language,” .) [ANVL]). Normally, named parameters are optional and their order is insignificant, however, specific record types require that certain named parameters be present (and future extensions may have ordering requirements). If there are no named parameters present, the entire WARC record header is the line of positional parameters followed by one blank line (two consecutive newlines).



 TOC 

5.1. Positional Parameters

This section describes each of the individual positional parameters of the WARC header-line.

warc-id
A fixed pattern, "warc/0.9", that appears first in every record and hence begins the WARC file itself. It serves to identify the file format and version to outside inspection, and to assist error recovery when a process reading a WARC file fails to find the next record boundary where expected. Occurrences of this string are not definitively the same as record boundaries, since the string may by chance occur inside a record. However, it may still be useful to locate such strings when attempting to recover from file corruption which renders one or more data-length parameters unreliable.
data-length
The number of octets in the record, starting with the first letter ("w") of the first token, through to the end of the content block — not including the 2 record-ending newlines. After proceeding this many octets from that first character of the record header, there should be two newlines and either the beginning of a new record or the end of the file. (WARC reading implementations may choose to tolerate more or fewer newlines at the end of a record.)

If the first next token does not match the first token of a WARC record, then the previous data-length should be considered in error; corrective action might include searching for a nearby occurrence of "warc/0.9" and other character patterns indicative of a legal record beginning.
record-type
The kind of WARC record. All record types are optional, though starting all WARC files with a "warcinfo" record is recommended. Record types are defined in Record Types (Record Types).
subject-uri
The original URI whose collection gave rise to the information content in this record. In the context of web harvesting, this is the URI that was the target of a crawler's retrieval request. Indirectly, such as for a 'revisit', 'metadata', or 'conversion' record, it is a copy of the subject-uri appearing in the original record to which the newer record pertains. For a 'warcinfo' record, this parameter is given a synthesized value for the creation name of the WARC file, as a URI.

Care should be taken to ensure that the URI in this value is properly escaped (per [RFC2396] (Berners-Lee, T., Fielding, R., and L. Masinter, “Uniform Resource Identifiers (URI): Generic Syntax,” August 1998.) and that it is written with no internal whitespace.
creation-date
A 14-digit timestamp in the format YYYYMMDDhhmmss representing the GMT time when record creation began. Multiple records written as part of a single collection action may share the same creation-date, even though the times of their writing will not be exactly synchronized.
content-type
The MIME type [RFC2045] (Freed, N. and N. Borenstein, “Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies,” November 1996.) of the information contained in the record's content block. (Type and subtype only.) For content in HTTP request and response records, this should be "message/http"; in particular, it is not the content-type of any HTTP content body.

Care should be taken to ensure that this value is written with no internal whitespace.

record-id
An identifier assigned to the record that is globally unique for its period of intended use. No identifier scheme is mandated by this specification, but each record-id should be a legal URI and clearly indicate a documented and registered scheme to which it conforms (e.g., via a URI scheme prefix such as "http:").

Care should be taken to ensure that this value is written with no internal whitespace.


 TOC 

5.2. Named Parameters

Named parameters, also referred to as named fields, are optional except as noted otherwise. Additional named parameters may be proposed by WARC users, who are urged to publically document and discuss with the WARC community new named parameters before use.

IP-Address: IP-address
The numeric Internet address contacted to retrieve any included content. An IPv4 address should be written as a "dotted quad"; an IPv6 address as per [RFC1884] (Hinden, R. and S. Deering, “IP Version 6 Addressing Architecture,” December 1995.). For an HTTP retrieval, this will be the IP address used at retrieval time corresponding to the hostname in the record's subject-uri.
Checksum: algorithm:value
An optional parameter indicating that just before the content block for this record was stored, the named digest algorithm was run, and it computed the string represented by the given value. An example is:
Checksum: sha1:AB2CD3EF4GH5IJ6KL7MN8OPQ
As of this writing, this document recommends no particular algorithm, though a future recommendation is possible.
Related-Record-ID: record-id
The identifier of the record for which the present record holds related content. This parameter is required of the record types 'revisit' and 'conversion'. It is also required to associate records of types 'request', 'response', 'resource', and 'metadata' with one another, when desired. However, none of these record types necessarily takes precedence over the others to become the referred-to (primary) record. (Any of them may appear first or alone.)

A potential strategy, after choosing one record to be primary, is to extend its record-id as described in the Appendix about record-id considerations. This creates satellite record-ids for related records that contain the primary record-id as an initial substring, which greatly optimizes the detection (and in some cases derivation) of related records.
Segment-Origin-ID: record-id
In a continuation record, this identifies the record of the first segment of the set.
Segment-Number: integer
In the first segment of a record that is completed in one or more later 'continuation' WARC records, this parameter is "1". In a 'continuation' record, this parameter is the sequence number of the current segment in the logical whole record, increasing by 1 in each next segment.
Truncated: reason-token
When present, indicates that the current record ends before the apparent end of the source material, but no continuation records are forthcoming. Possible values indicate the reason for the truncation: 'length' for exceeding a desired length limit; 'time' for exceeding a desired time limit during collection.
Warcinfo-ID: record-id
When present, indicates the record-id of the associated 'warcinfo' record for this record. Typically, the Warcinfo-ID parameter is used when the context of the applicable 'warcinfo' record is unavailable, such as after distributing single records into separate WARC files. WARC writing applications (such web crawlers) may choose to record this parameter routinely (e.g., before computing checksums). The Warcinfo-ID parameter overrides any association with a previously occurring (in the WARC) 'warcinfo' record, thus providing a way to protect the true association when records are combined from different WARCs. Use of this parameter in a record of type 'warcinfo' is undefined and reserved for possible future extension.


 TOC 

6. Record Content Block

Each record's content block contains zero or more bytes of data, interpreted according to the record type and any preceding headers. For 'response', '



 TOC 

7. Truncated and Segmented Records

For practical reasons, writers of the WARC format may place limits on the time or storage allocated to archiving a single resource. As a result, only a truncated portion of the original resource may be available for saving into a WARC record.

Additionally, users will often want to keep individual WARC files near or below some target size, such as 100MB or 500MB. If some records would be too large to be contained by a single WARC file of desired maximum size, those records will have to be split between multiple WARC files.

This section defines mechanisms for indicating that a WARC record has been truncated or split into multiple records, called segments, across WARC files.

These mechanisms are provisional and subject to change. A superior method of indicating truncation and segmentation may be developed, which better allows the writing of records to begin without foreknowledge of their final length.



 TOC 

7.1. Record Truncation

Any record may indicate that truncation has occurred and give the reason by the addition of a named 'Truncated' field in the record header. Acceptable values for this field include 'time' for truncation due to exceeding a time limit, and 'length' for truncation due to exceeding a length limit.



 TOC 

7.2. Record Segmentation

A record that will not fit into a single WARC file of desired maximum size may be broken into any number of separate records, called segments. As much as possible, segmentation should be avoided, and where necessary, segments other than the first must be of record-type 'continuation'.

The first segment must carry the record-type (not 'continuation') that the record would have had were it not broken into segments, and a 'Segment-Number' named field with a value of "1".

All subsequent segments must have a record type of 'continuation', with an incremented 'Segment-Number' field. They must also include a 'Segment-Origin-ID' field with a value of the Record-ID of the record containing the first segment of the set. All segments of a set must have identical subject-uri parameters.

The last segment must contain a "End-Length" named field specifying the total length, in bytes, of all segment content if reassembled. The last segment may also contain a 'Truncated' field, if appropriate. Segments other than the first should contain no other named fields, as they merely serve to continue the record data block of the first record.

To reassemble all segments into the intended complete logical record, all records with the same 'Segment-Origin-ID' value must be collected and appended, in 'Segment-Number' order, to the origin record.



 TOC 

8. WARC Application to Specific Protocols



 TOC 

8.1. HTTP and HTTPS

A full HTTP or HTTPS response, with protocol information and content-body (if any), can be saved verbatim into a WARC file as a 'response' type record, with a MIME content-type of "message/http" (or "message/http;msgtype=response").

A full HTTP or HTTPS request, including all request headers and content-body (if any), can similarly be saved verbatim into a WARC file as a "request" type record, with a MIME content-type of "message/http" (or "message/http;msgtype=request").

For either a request or response, an 'IP-Address' field should be used to record the network IP address to which the request was directed, using the best available DNS information at the time.

Additional metadata about the HTTP or HTTPS transaction may be stored in a 'metadata' type record, in a format to be specified elsewhere. In particular, information about the secure session in which an HTTPS transaction occurs, such as certificates presented or consulted and authentication information exchanged, may be stored in one or more 'metadata' type records.

The multiple records which pertain to a single HTTP or HTTPS logical group of records will all have unique record-id values. In order to associate the records, all but one must use 'Related-Record-ID' fields to refer to another record in the set.

As any mixture of record types may appear for a single collection event, and in any order, here is no specific record type which is automatically considered primary. Generally, all may refer back to the one record which appeared first, but this is not required. (A 'request' record may refer to a 'response' record or vice-versa; either could refer to a 'metadata' record or a 'metadata' record could refer to either.) Multiple and bidirectional 'Related-Record-ID' fields may appear.

In the case where resources from a website have been harvested or otherwise received without performing normal HTTP operations, or where HTTP protocol information has been lost, it may be appropriate to store the plain content in WARC 'resource' type records, under their original subject-uri, but using the content MIME type in place of the "message/http" type.



 TOC 

8.2. DNS

A request for DNS information can be summarized in a URI in accordance with a IETF Network Working Group draft proposal [DNS-URI] (Josefsson, S., “Domain Name System Uniform Resource Identifiers,” May 2005.). DNS information as retrieved can be represented in the formats specified by [RFC1035] (Mockapetris, P., “Domain names - implementation and specification,” November 1987.), [RFC2540] (Eastlake, D., “Detached Domain Name System (DNS) Information,” March 1999.), and [RFC4027] (Josefsson, S., “Domain Name System Media Types,” April 2005.).

The results of a DNS lookup can thus be straightforwardly archived in a WARC 'response' record under the appropriate DNS URI and MIME type.



 TOC 

8.3. Other Resources with URIs, and Other Protocols

Any resource that can be identified with a URI, even if it is not retrieved via an Internet operation, may be archived in a WARC file under a 'resource' type record. This includes files that have meaningful URIs retrieved from a locally-accessible filesystem or other repository.

Specific conventions for other protocols and media types are expected to be defined as necessary. In general, the WARC format should be capable of archiving any digital resource which has a URI, a specific time of collection, and a discrete length.

The 'request' and 'response' record types should be used for verbatim or lossless transcripts of collection activity, including protocol information. The 'resource' record type should be used for content without any protocol-specific enveloping. Additional information about a resource or transaction can be supplied in a protocol- or media-appropriate manner with 'metadata' type records.



 TOC 

9. Compression Recommendations

The WARC format defines no internal compression. Whether and how WARC files should be compressed is an external decision.

However, experience with the precursor ARC format at the Internet Archive has demonstrated that applying simple standard compression can result in significant storage savings, while preserving random access to individual records.

For this purpose, the GZIP format with customary "deflate" compression is recommended, as defined in [RFC1950] (Deutsch, L. and J-L. Gailly, “ZLIB Compressed Data Format Specification version 3.3,” May 1996.), [RFC1951] (Deutsch, P., “DEFLATE Compressed Data Format Specification version 1.3,” May 1996.), and [RFC1952] (Deutsch, P., Gailly, J-L., Adler, M., Deutsch, L., and G. Randers-Pehrson, “GZIP file format specification version 4.3,” May 1996.). Freely available source code implementing this format is available, and the technique is free of patent encumberances. The GZIP format is also widely used and supported across many free and commercial software packages and operating systems.

This section documents recommended, but optional, practices for compressing WARC files with GZIP.



 TOC 

9.1. Record-at-a-time Compression

Per section 2.2 of the GZIP specification, a valid GZIP file consists of any number of gzip "members", each independently compressed.

Where possible, this property should be exploited to compress each record of a WARC file independently. This results in a valid GZIP file whose per-record subranges also stand alone as valid GZIP files.

External indexes of WARC file content may then be used to record each record's starting position in the GZIP file, allowing for random access of individual records without requiring decompression of all preceding records.

Note that the application of this convention causes no change to the uncompressed contents of an individual WARC record. In particular, the declared record length remains the length of the uncompressed record.



 TOC 

9.2. GZIP extra field: skip-lengths ('sl')

Customarily, GZIP members do not declare their compressed length. This presents a problem for WARC processing which, after reading a small portion of a record, would like to skip to the next full record. In the absence of an external, precalculated index, using only the WARC record's uncompressed length would require the complete current record to be decompressed to find the start of the next record.

Section 2.3.1.1 of the GZIP format specification makes an allowance for arbitrary extension fields, called "extra-fields". We define here a new GZIP extra-field, "skip-lengths", identified by the two byte id "sl" (0x73, 0x6C).

This field, when present, must contain two 4-byte unsigned integer values, with least significant byte first (as per other multi-byte values in the GZIP format). The first integer, compressed-skip-length, is a number of compressed bytes that may be skipped, from the beginning of the current GZIP member, to reach a distinct following member. (This value may be the exact length of the current member, but may also indicate a length of several related concatenated members.) The second integer, uncompressed-skip-length, is the number of uncompressed bytes that will be passed over when skipping the compressed-skip-length bytes forward.

With the help of these values, a decompressor can often skip forward past large ranges of the compressed input that are not of interest, restarting decompression at the targetted next member, while retaining knowledge of exactly how many bytes of uncompressed data have been skipped.

If the skip-length value is zero, the field should be ignored as if it were not present. (Compressors writing this field may use a zero value to reserve space for an as-yet-unknown skip-length, filling in the value if possible later.)

This extra-field will be registered with the GZIP authors as appropriate.



 TOC 

9.3. GZIP WARC File Name Suffix

A WARC file compressed with the extra GZIP field conventions described in this document is a legal GZIP file. To ensure that it is properly recognized by GZIP tools, its name should have the customary ".gz" appended to it, making the complete suffix, ".warc.gz". GZIP software that does not recognize the extra GZIP fields will simply pass over them without benefit or harm.



 TOC 

10. WARC File Name and Size Recommendations

It is helpful to use practices within an institution that make it unlikely or impossible to duplicate aggregate WARC file names. The convention used inside the Internet Archive with ARC files is to name files according to the following pattern:

Prefix-Timestamp-Serial-Crawlhost.warc.gz

Prefix is an abbreviation usually reflective of the project or crawl that created this file. Timestamp is a 14-digit GMT timestamp indicating the time the file was initially begun. Serial is an increasing serial-number within the process creating the files, often (but not necessarily) unique with regard to the Prefix. Crawlhost is the domain name or IP address of the machine creating the file.

IIPC member institutions have expressed an interest in adopting a common naming strategy, with per-institution unique identifiers to assist in marking WARC files with their institution of origin. It is proposed that all such WARC file names adhering to this future convention begin "iipc".

This specification does not require any particular WARC file naming practice, but recommends conventions similar to the above be adopted within WARC-creating institutions. The file name prefix "iipc" should be avoided unless participating in the IIPC naming registry.

500MB (5x10^8 bytes) is recommended as a practical target size for WARC files, when record sizes allow. Oversized records may be truncated, segmented, or simply placed in oversized WARC files, at a project's discretion.



 TOC 

11. Registration of MIME Media Type application/warc

This section describes, as per [RFC2048] (Freed, N., Klensin, J., and J. Postel, “Multipurpose Internet Mail Extensions (MIME) Part Four: Registration Procedures,” November 1996.), the MIME types associated with the WARC format.

MIME media type name: application

MIME subtype names: warc

Required parameters: None

Optional parameters: None

Encoding considerations:

UTF-8 is the default character encoding for the textual information defined by the WARC format. However, any binary data may be included as blocks within the WARC format, and so only "8bit" and "binary" encoding is allowable.

Security considerations:

The WARC record syntax poses no direct risk to computers and networks. Implementors need to be aware of source authority and trustworthiness of information structured in WARC. Readers and writers subject themselves to all the risks that accompany normal operation of data processing services (e.g., message length errors, buffer overflow attacks).

Interoperability considerations: None

Published specification: TBD

Applications which use this media type: Large- and small-scale archiving

Additional information: None

Person and email address to contact for further information:

Gordon Mohr gojomo@archive.org, John Kunze jak@ucop.edu

Intended usage: COMMON

Author/Change controller: IESG



 TOC 

12. IANA Considerations

After IESG approval, IANA is expected to register the WARC type "application/warc" using the application provided in this document.



 TOC 

13. Acknowledgements

This document could not have been written without major contributions from participants of the International Internet Preservation Consortium, especially Steen Christensen, and Julien Masanes.



 TOC 

Appendix A. Consideratons in Choice of record-id

The WARC format differs significantly from the ARC format in requiring the record-id parameter. The record-id must be globally unique for its period of intended use. If that period is indefinite, the record-id should be maintained to a level appropriate for any persistent identifier, in which case identifier opaqueness is usually desirable.

There is no reason why the archiving institution may not choose record-ids that are also "actionable" (submittable as retrieval requests to widely available tools such as web browsers) as long as there are providers to service them. This specification does not dictate what identifier scheme to use; suitable schemes include URN (Moats, R., “URN Syntax,” May 1997.) [RFC2141], [ARK] (Kunze, J. and R. Rodgers, “The ARK Persistent Identifier Scheme,” August 2005.), [GUID] (, “Wikipedia: Globally Unique Identifiers,” .), etc.

Also worth considering is the establishment of lexical conventions for record-ids that reveal or suggest relationships among content blocks. Although the 'Related-Record-ID' parameter required of 'metadata', 'revisit', and 'conversion' records is sufficient to convey relatedness in the context of a single WARC file, great optimization can be had when relatedness can be inferred by third parties through identifier comparison rather than by lookup in a database or examination of the relevant WARC files.

These conventions are suggested by [RFC2396] (Berners-Lee, T., Fielding, R., and L. Masinter, “Uniform Resource Identifiers (URI): Generic Syntax,” August 1998.), formalized by the [ARK] (Kunze, J. and R. Rodgers, “The ARK Persistent Identifier Scheme,” August 2005.) scheme, and are applicable to such things as the summarizing of large search results from Internet-wide indexing engines. As an example of a convention that could be adopted by users of any identifier scheme, the "/" character could be reserved as a separator used to introduce an extension string that is appended to a primary record-id. If the record-id of a primary block of captured content were,

http://abc.org/12026/987654321

The convention could also reserve the extension strings "_s", "_d", and "_t" to indicate record- ids for secondary, duplicate, and transform blocks, respectively. Over time this might result in he assignment of record-ids such as,

http://abc.org/12026/987654321/_s1
http://abc.org/12026/987654321/_s2
http://abc.org/12026/987654321/_d9
http://abc.org/12026/987654321/_d10
http://abc.org/12026/987654321/_t

...in which an integer count may further extend the identifier when more there is more than one relationship of the given type.



 TOC 

Appendix B. Examples of WARC Records

Examples of each of record-type are provided here. In some cases, illustrative data is shown where conventions have not yet been specified. Each record header-line is split over multiple lines for readability; continuations of the single line are indented, and a newline should only be considered to appear at the end of the last indented line. Declared record lengths are approximate, and unique IDs and checksums shown are plausible random filler.



 TOC 

Appendix B.1. Example of 'warcinfo' Record

The following 'warcinfo' example includes an XML description of the enclosing WARC file that is loosely modelled after the descriptions currently used in Internet Archive ARC files. However, this is an abbreviated and speculative illustration; the referenced WARC-specific namespace "http://archive.org/warc/0.9" has not been formally defined anywhere, and may not reflect eventual practice with WARC files.

warc/0.9 1012 warcinfo
    filedesc:test-20050708010101-00001-crawl017.archive.org.warc.gz
    20050708010101 text/xml uuid:cbad35b7-e591-4b43-8a67-9d1d8f9ef4cd

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<warcmetadata
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:dcterms="http://purl.org/dc/terms/"
    xmlns:warc="http://archive.org/warc/0.9/">
<warc:software>
Heritrix 1.4.0 http://crawler.archive.org
</warc:software>
<warc:hostname>crawling017.archive.org</warc:hostname>
<warc:ip>207.241.227.234</warc:ip>
<dcterms:isPartOf>testcrawl-20050708</dcterms:isPartOf>
<dc:description>testcrawl with WARC output</dc:description>
<warc:operator>IA_Admin</warc:operator>
<warc:http-header-user-agent>
Mozilla/5.0 (compatible; heritrix/1.4.0 +http://crawler.archive.org)
</warc:http-header-user-agent>
<dc:format>WARC file version 0.9</dc:format>
<dcterms:conformsTo xsi:type="dcterms:URI">
http://www.archive.org/documents/WarcFileFormat.php
</dcterms:conformsTo>
</warcmetadata>

The first line (spread over three lines for readability) shows the required line of positional parameters. This record has no named fields, as evidenced by the single blank line following he header-line. The content block is "text/xml", as declared in the header-line. Two newlines follow the content block.



 TOC 

Appendix B.2. Example of 'request' Record

A 'request' record captures the protocol request used to collect a resource. For example, to collect the resource "http://www.archive.org/images/logo.jpg", the following 'request' record might be generated:

warc/0.9 298 request http://www.archive.org/images/logo.jpg
    20050708010101 message/http
    uuid:f569983a-ef8c-4e62-b347-295b227c3e51
IP-Address: 207.241.224.241

GET /images/logo.jpg HTTP/1.0
Host: www.archive.org
User-Agent: Mozilla/5.0 (compatible; crawler/1.4 +http://example.com)



 TOC 

Appendix B.3. Example of 'response' Record

The archived response to the above request might look like the following.

warc/0.9 7583 response http://www.archive.org/images/logo.jpg
    20050708010101 message/http
    uuid:a4b26b6b-f918-4136-af04-f859d75aebe5
IP-Address: 207.241.224.241
Related-Record-ID: uuid:f569983a-ef8c-4e62-b347-295b227c3e51
Checksum: sha1:2ZWC6JAT6KNXKD37F7MOEKXQMRY75YY4

HTTP/1.x 200 OK
Date: Fri, 08 Jul 2005 01:01:01 GMT
Server: Apache/1.3.33 (Debian GNU/Linux) PHP/5.0.4-0.3
Last-Modified: Sun, 12 Jun 2005 00:31:01 GMT
Etag: "914480-1b2e-42ab8245"
Accept-Ranges: bytes
Content-Length: 6958
Keep-Alive: timeout=15, max=100
Connection: Keep-Alive
Content-Type: image/jpeg

[6958 bytes of binary data here]

Note the 'Related-Record-ID' named field referring back to the generating 'request' record, and the creation-date identical to the previous record.



 TOC 

Appendix B.4. Example of 'resource' Record

This same file, "logo.jpg", might be archived internally to an organization under its local filesystem name. This could result in a 'resource' record:

warc/0.9 7141 resource file://webserver/htdoc/images/logo.jpg
    20050710010101 image/jpeg
    uuid:a6c3132b-49b8-4fd5-8072-45ce66d48a4b
Checksum: sha1:37F7MOEKXQMRY75YY42ZWC6JAT6KNXKD

[6958 bytes of binary data here]


 TOC 

Appendix B.5. Example of 'metadata' Record

If some crawl-time metadata should be archived near the above response, a 'metadata' record could be used like the following (with a purely speculative XML format):

warc/0.9 395 metadata http://www.archive.org/images/logo.jpg
    20050708010101 text/xml
    uuid:a4acff63-c213-4f35-9652-41a0e2dfc492
Related-Record-ID: uuid:a4b26b6b-f918-4136-af04-f859d75aebe5

<?xml version="1.0"?>
<harvestmetadata
xmlns="http://archive.org/harvest/0.9/">
<discovered-via>http://www.archive.org<discovered-via>
<download-time-ms>565</download-time-ms>
</harvestmetadata>

Note again the same creation-date as the preceding related records. A relationship is declared o the preceding 'response' record, but declaring a relationship to the 'request' would also be legal.



 TOC 

Appendix B.6. Example of 'revisit' Record

If the same URI is later revisited and the content is unchanged, a 'revisit' record like the following (again with a speculative content-type) could be generated:

warc/0.9 395 revisit http://www.archive.org/images/logo.jpg
    20050808010101 text/xml
    uuid:ad522b3b-d68c-464a-b5e2-38149cfb511d
Related-Record-ID: uuid:a4b26b6b-f918-4136-af04-f859d75aebe5

<?xml version="1.0"?>
<revisit
xmlns="http://archive.org/revisit/0.9/">
<server-response-excerpt>
HTTP/1.x 304 Not Modified
Date: Mon, 08 Aug 2005 01:01:01 GMT
Etag: "914480-1b2e-42ab8245"
</server-response-excerpt>
</revisit>

Again, reference is made back to the original 'response' record. A new creation-date reflects the time of revisit. This content block hypothesizes including header excerpts from a server response to explain the results of the revisit. (In this case, the remote server indicated the resource was unchanged from the previous 'Etag' value.) The actual formats for describing the result of a revisit remain to be defined.



 TOC 

Appendix B.7. Example of 'conversion' Record

At some future date, the "image/jpeg" format may no longer be considered viable, prompting a conversion of the original archive content into a hypothetical new format, "image/neoimg", which generates a 3098 byte version of the same image. This could be accomodated with a 'conversion' record:

warc/0.9 4111 conversion http://www.archive.org/images/logo.jpg
    20150708010101 image/neoimg
    uuid:c631da8a-e8db-44a8-84c5-9cc848dff35a
Related-Record-ID: uuid:a4b26b6b-f918-4136-af04-f859d75aebe5
Checksum: sha1:XQMRY75YY42ZWC6JAT6KNXKD37F7MOEK

[3098 bytes of binary data here]

An accompanying 'metadata' record, referring to this 'conversion' record, could contain additional details about the transformation. (Alternatively, new named-fields in this record could serve this role.)



 TOC 

Appendix B.8. Example of 'continuation' Record

If the 'response' above had been so large that it would not fit into a single WARC file of desired maximum size, it would have to be segmented into separate smaller records. The first record would be as before, except with one additional named field, 'Segment-Number', with a value of "1", indicating that the record was the beginning of a segmented record set.

The subsequent segment for that record would then look like this:

warc/0.9 39514322 continuation http://www.archive.org/images/logo.jpg
    20150708010101 message/http
    uuid:c0d36ada-af8c-4608-8409-e60818b1d9e9
Segment-Number: 2
Segment-Origin-ID: uuid:a4b26b6b-f918-4136-af04-f859d75aebe5

[39514114 bytes of binary data here]

Note that the 'Segment-Origin-ID' refers to the first segment of the set, the one with the "Segment-Number: 1" named field.



 TOC 

Appendix C. Collected BNF for WARC

  warc-file     = 1*warc-record
  warc-record   = header block CRLF CRLF
  header        = header-line CRLF *anvl-field CRLF
  block         = *OCTET

  header-line   = warc-id tsp data-length tsp record-type tsp
                    subject-uri tsp creation-date tsp
                    content-type tsp record-id
  tsp           = 1*WSP

  warc-id       = "warc/" DIGIT "." DIGIT
  data-length   = 1*DIGIT
  record-type   = "warcinfo" / "response" / "request" / "metadata" /
                    "revisit" / "conversion" / "continuation" /
                    future-type
  future-type   = 1*VCHAR
  subject-uri   = uri
  uri           = <'URI' per RFC3986>
  creation-date = timestamp
  timestamp     = <date per below>
  content-type  = type "/" subtype
  type          = <'type' per RFC2045>
  subtype       = <'subtype' per RFC2045>
  record-id     = uri

  anvl-field    = field-name ":" [ field-body ] CRLF
  field-name    = 1*<any CHAR, excluding control-chars and ":">
  field-body    = text [CRLF LWSP-char field-body]
  text          = 1*<any UTF-8 character, including bare
                    CR and bare LF, but NOT including CRLF>
                    ; (Octal, Decimal.)
  CHAR          = <any ASCII/UTF-8 character> ; (0-177,  0.-127.)
  CR            = <ASCII CR, carriage return> ; (   15,      13.)
  LF            = <ASCII LF, linefeed>        ; (   12,      10.)
  SPACE         = <ASCII SP, space>           ; (   40,      32.)
  HTAB          = <ASCII HT, horizontal-tab>  ; (   11,       9.)
  CRLF          = CR LF
  LWSP-char     = SPACE / HTAB                ; semantics = SPACE


 TOC 

14. References

[ANVL] Kunze, J., Kahle, B., Masanes, J., and G. Mohr, “A Name-Value Language” (PDF).
[ARC] Burner, M. and B. Kahle, “The ARC File Format,” September 1996 (HTML).
[ARK] Kunze, J. and R. Rodgers, “The ARK Persistent Identifier Scheme,” August 2005 (PDF).
[GUID] Wikipedia: Globally Unique Identifiers” (HTML).
[HERITRIX] Heritrix Open Source Archival Web Crawler” (HTML).
[IIPC] International Internet Preservation Consortium (IIPC)” (HTML).
[RDF] Resource Description Framework (RDF)” (HTML).
[RFC0822] Crocker, D., “Standard for the format of ARPA Internet text messages,” STD 11, RFC 822, August 1982.
[RFC1035] Mockapetris, P., “Domain names - implementation and specification,” STD 13, RFC 1035, November 1987.
[RFC1884] Hinden, R. and S. Deering, “IP Version 6 Addressing Architecture,” RFC 1884, December 1995.
[RFC1950] Deutsch, L. and J-L. Gailly, “ZLIB Compressed Data Format Specification version 3.3,” RFC 1950, May 1996 (TXT, PS, PDF).
[RFC1951] Deutsch, P., “DEFLATE Compressed Data Format Specification version 1.3,” RFC 1951, May 1996 (TXT, PS, PDF).
[RFC1952] Deutsch, P., Gailly, J-L., Adler, M., Deutsch, L., and G. Randers-Pehrson, “GZIP file format specification version 4.3,” RFC 1952, May 1996 (TXT, PS, PDF).
[RFC2045] Freed, N. and N. Borenstein, “Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies,” RFC 2045, November 1996.
[RFC2048] Freed, N., Klensin, J., and J. Postel, “Multipurpose Internet Mail Extensions (MIME) Part Four: Registration Procedures,” BCP 13, RFC 2048, November 1996 (TXT, HTML, XML).
[RFC2141] Moats, R., “URN Syntax,” RFC 2141, May 1997 (TXT, HTML, XML).
[RFC2234] Crocker, D., Ed. and P. Overell, “Augmented BNF for Syntax Specifications: ABNF,” RFC 2234, November 1997 (TXT, HTML, XML).
[RFC2396] Berners-Lee, T., Fielding, R., and L. Masinter, “Uniform Resource Identifiers (URI): Generic Syntax,” RFC 2396, August 1998 (TXT, HTML, XML).
[RFC2540] Eastlake, D., “Detached Domain Name System (DNS) Information,” RFC 2540, March 1999.
[RFC4027] Josefsson, S., “Domain Name System Media Types,” RFC 4027, April 2005.
[DNS-URI] Josefsson, S., “Domain Name System Uniform Resource Identifiers,” May 2005 (TXT).


 TOC 

Authors' Addresses

  John A. Kunze (editor)
  California Digital Library
  415 20th St, 4th Floor
  Oakland, CA 94612-3550
  US
Fax:  +1 510-893-5212
Email:  jak@ucop.edu
  
  Allan Arvidson
  Kungliga biblioteket (National Library of Sweden)
  Box 5039
  Stockholm 10241
  SE
Fax:  +46 (0)8 463 4004
Email:  allan.arvidson@kb.se
  
  Gordon Mohr
  Internet Archive
  4 Funston Ave, Presidio
  San Francisco, CA 94117
  US
Email:  gojomo@archive.org
  
  Michael Stack
  Internet Archive
  4 Funston Ave, Presidio
  San Francisco, CA 94117
  US
Email:  stack@archive.org