The WARC (Web ARChive) format specifies a method for combining multiple digital resources into an aggregate archival file together with related information. Resources are dated, identified by URIs, and preceded by simple text headers. By convention, files of this format are named with the extension ".warc" and have the MIME type application/warc. The WARC file format is a revision and generalization of the ARC format used by the Internet Archive to store information blocks harvested by web crawlers. This document specifies version 0.10 of the WARC format.
3. The WARC Format
4. Record Types
5. Named Parameters
5.1. IP-Address: IP-address
5.2. Checksum: algorithm:value
5.3. Related-Record-ID: record-id
5.4. Segment-Origin-ID: record-id
5.5. Segment-Number: integer
5.6. Truncated: reason-token
5.7. Warcinfo-ID: record-id
6. Truncated and Segmented Records
6.1. Record Truncation
6.2. Record Segmentation
7. WARC Application to Specific Protocols
7.1. HTTP and HTTPS
7.3. Other Resources with URIs, and Other Protocols
8. Registration of MIME Media Type application/warc
9. IANA Considerations
Appendix A. Considerations in Choice of record-id
Appendix B. Compression Recommendations
Appendix B.1. Record-at-a-time Compression
Appendix B.2. GZIP WARC File Name Suffix
Appendix C. WARC File Name and Size Recommendations
Appendix D. Collected ABNF for WARC
Appendix E. Examples of WARC Records
Appendix E.1. Example of 'warcinfo' Record
Appendix E.2. Example of 'request' Record
Appendix E.3. Example of 'response' Record
Appendix E.4. Example of 'resource' Record
Appendix E.5. Example of 'metadata' Record
Appendix E.6. Example of 'revisit' Record
Appendix E.7. Example of 'conversion' Record
Appendix E.8. Example of 'continuation' Record
§ Authors' Addresses
Content on the World Wide Web is ephemeral. Everyday websites and web pages are created, changed, relocated and disappear. For the past ten years, memory institutions have tried to find the most appropriate means of collecting and keeping track of this important, transitory material. One approach uses a web crawler to take 'snapshots' of the web, or of parts of the web, at particular moments in time. Web crawlers are software programs which browse the web in an automated manner according to a set of policies. A crawler starts with a list of URLs to visit. As it visits each URL, it makes a copy of the visited page and then extracts all hyperlinks -- links to other pages, images, videos, scripting or style instructions, etc. -- to add to its queue of URLs to visit next. During any given web crawl, a crawler can collect millions of pages. The accumulated captures from many web crawls can run into the billions. An efficient format for storing and preserving the captures of large-scale web harvests is critical. This document describes the WARC (Web ARChive) file format.
The WARC file format offers a convention for concatenating multiple resource records, each consisting of a set of simple text headers and an arbitrary data block, into one long file. The WARC format is a revision of the ARC File Format (Burner, M. and B. Kahle, “The ARC File Format,” September 1996.) [ARC] which has traditionally been used to store web crawls as sequences of content blocks harvested from the World Wide Web. The ARC format file has been in active use by the Internet Archive (IA) since 1996. Currently over 50 billion objects are stored in ARCs at the IA. The ARC format is also being used by several national libraries.
The motivation to revise the ARC format arose from the discussion and experiences of the International Internet Preservation Consortium (IIPC) (, “International Internet Preservation Consortium (IIPC),” .) [IIPC], whose members include the national libraries of Australia, Canada, Denmark, Finland, France, Iceland, Italy, Norway, Sweden, The British Library (UK), The Library of Congress (USA), and the Internet Archive (IA). Input from the California Digital Library and Los Alamos National Laboratory, who have also established large repositories, was also considered.
In an ARC file, each capture is preceded by a one-line header that briefly describes the harvested content and its length. This is directly followed by the the retrieval protocol response messages and content. Only the response is recorded. There is no provision for noting the request or for adding metadata.
The WARC format generalizes the ARC format to better support the harvesting, access, and exchange needs of archiving organizations. Besides the primary content currently recorded, the revision accommodates related secondary content, such as assigned metadata, abbreviated duplicate detection events, and later-date transformations.
The WARC format is expected to become the standard way to structure, manage and store billions of collected web resources. It will be used as the output format by web harvesting applications, such as the open-source Heritrix (, “Heritrix Open Source Archival Web Crawler,” .) [HERITRIX] web crawler, and as an input format for a wide array of cataloging and access tools.
Goals of the WARC file format include the following.
The WARC file format is made sufficiently different from the legacy ARC format files so that software tools can unambiguously detect and correctly process both WARC and ARC records; given the large amount of existing archival data in the previous ARC format, it is important that access and use of this legacy not be interrupted when transitioning to the WARC format.
A WARC format file is the simple concatenation of one or more WARC records. A WARC record consists of a record header followed by a record content block and two newlines where newlines are CRLF as per other Internet standards such as [RFC0822] (Crocker, D., “Standard for the format of ARPA Internet text messages,” August 1982.). The first record in a WARC file usually describes the records to follow. Subsequent records contain content blocks that are either the direct result of a retrieval attempt — web pages, inline images, URL redirection information, DNS hostname lookup results, standalone files, etc. — or they are synthesized content blocks (e.g., metadata, transformed content) that provide additional information about archived content.
The format of a WARC file can be expressed in IETF Augmented Backus-Naur Form (ABNF) grammar as specified in [RFC2234] (Crocker, D., Ed. and P. Overell, “Augmented BNF for Syntax Specifications: ABNF,” November 1997.) as follows. (All-caps "core" elements are as defined in RFC2234.)
warc-file = 1*warc-record warc-record = header block CRLF CRLF header = header-line CRLF *anvl-field CRLF block = *OCTET
The WARC record header declares baseline identifying information about the current record, and allows additional per-record information. It consists of one line of required positional parameters, the record header-line, then a variable number of lines of named parameters anvl-fields.
The record header-line is a newline-terminated sequence of whitespace-delimited text tokens representing parameters such as record length, time of creation, and subject URI.
header-line = warc-id tsp data-length tsp record-type tsp subject-uri tsp creation-date tsp record-id tsp content-type tsp = 1*WSP
The amount of whitespace between header-line parameters is variable. This gives archive builders the flexibility to add padding and later adjust pre-written header parameters when final values are only completely known after the record content block has been written.
The WARC record header-line tokens are defined as follows:
warc-id = "WARC/" 1*DIGIT "." 1*DIGIT data-length = 1*DIGIT record-type = "warcinfo" / "response" / "request" / "metadata" / "revisit" / "conversion" / "continuation" / future-type future-type = 1*VCHAR subject-uri = uri uri = <'URI' per RFC3986> creation-date = timestamp timestamp = <date per below> record-id = uri content-type = type "/" subtype *(";" parameter) type = <'type' per Section 5.1 of RFC2045> subtype = <'subtype' per Section 5.1 of RFC2045> parameter = <'parameter' per Section 5.1 of RFC2045>
DIGIT and VCHAR are as specified in [RFC2234] (Crocker, D., Ed. and P. Overell, “Augmented BNF for Syntax Specifications: ABNF,” November 1997.). No parameter may be written with internal whitespace except the last, content-type.
A fixed pattern, "WARC/0.10", that appears first in every record and hence begins the WARC file itself. It serves to identify the file format and version to outside inspection, and to assist error recovery when a process reading a WARC file fails to find the next record boundary where expected. Occurrences of this string are not definitively the same as record boundaries, since the string may by chance occur inside a record. However, it may still be useful to locate such strings when attempting to recover from file corruption which renders one or more data-length parameters unreliable. The warc-id string may change in future versions, but will always begin "WARC/", continue with version numbers, and end at whitespace.
The number of octets in the record, starting with the first letter ("W") of the first token, through to the end of the content block — not including the 2 record-ending newlines. After proceeding this many octets from that first character of the record header, there should be two newlines and either the beginning of a new record or the end of the file. (WARC reading implementations may choose to tolerate more or fewer newlines at the end of a record.)
If the first next token does not match the first token of a WARC record, then the previous data-length should be considered in error; corrective action might include searching for a nearby occurrence of "WARC/0.10" and other character patterns indicative of a legal record beginning.
The kind of WARC record. All record types are optional, though starting all WARC files with a "warcinfo" record is recommended. Record types are defined in the section Record Types (Record Types).
The original URI whose collection gave rise to the information content in this record. In the context of web harvesting, this is the URI that was the target of a crawler's retrieval request. Indirectly, such as for a 'revisit', 'metadata', or 'conversion' record, it is a copy of the subject-uri appearing in the original record to which the newer record pertains. The URI in this value should be properly escaped according to [RFC2396] (Berners-Lee, T., Fielding, R., and L. Masinter, “Uniform Resource Identifiers (URI): Generic Syntax,” August 1998.) and written with no internal whitespace.
A 14-digit timestamp in the format YYYYMMDDhhmmss representing the GMT time when record creation began. Multiple records written as part of a single collection action may share the same creation-date, even though the times of their writing will not be exactly synchronized.
An identifier assigned to the record that is globally unique for its period of intended use. No identifier scheme is mandated by this specification, but each record-id should be a legal URI and clearly indicate a documented and registered scheme to which it conforms (e.g., via a URI scheme prefix such as "http:" or "urn:"). Care should be taken to ensure that this value is written with no internal whitespace.
The MIME type [RFC2045] (Freed, N. and N. Borenstein, “Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies,” November 1996.) of the information contained in the record's content block. For content in HTTP request and response records, this should be 'application/http' as per Section 19.1 of [RFC2616] (Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., and T. Berners-Lee, “Hypertext Transfer Protocol -- HTTP/1.1,” June 1999.) (or 'application/http; msgtype=request' and 'application/http; msgtype=response' respectively). In particular, the content-type is not the value of the HTTP Content-Type header in an HTTP response but a MIME type to describe the content body (hence 'application/http' if the content body contains response headers and the response itself). Whitespace (WSP [RFC2234] (Crocker, D., Ed. and P. Overell, “Augmented BNF for Syntax Specifications: ABNF,” November 1997.)) delimiting 'parameters' or inside 'quoted-string' is allowed. This is the only positional parameter that may legally contain whitespace.
Zero or more anvl-field named parameters expressed in A Name-Value Language [ANVL] (Kunze, J., Kahle, B., Masanes, J., and G. Mohr, “A Name-Value Language,” .) follow the header-line in a line-oriented syntax very similar to that of email headers [RFC0822] (Crocker, D., “Standard for the format of ARPA Internet text messages,” August 1982.) but with unrestricted "text" values. The precise format is as follows:
anvl-field = field-name ":" [ field-body ] CRLF field-name = 1*<any CHAR, excluding control-chars and ":"> field-body = text [CRLF 1*WSP field-body] text = 1*<any UTF-8 character, including bare CR and bare LF, but NOT including CRLF> ; (Octal, Decimal.) CHAR = <any ASCII/UTF-8 character> ; (0-177, 0.-127.) CR = <ASCII CR, carriage return> ; ( 15, 13.) LF = <ASCII LF, linefeed> ; ( 12, 10.) SPACE = <ASCII SP, space> ; ( 40, 32.) HTAB = <ASCII HT, horizontal-tab> ; ( 11, 9.) CRLF = CR LF WSP = SPACE / HTAB ; semantics = SPACE
Normally, named parameters are optional and their order is not significant, however, specific record types require that certain named parameters be present (and future extensions may have ordering requirements). If there are no named parameters present, the entire WARC record header is the line of positional parameters followed by one blank line (two consecutive newlines). See the section Named Parameters (Named Parameters) for the list of allowed values.
Each record's content block contains zero or more bytes of data, interpreted according to the record type and any preceding headers, up through the remaining number of octets as specified in the previously-given data-length parameter.
Every WARC record has a type. There are 8 currently defined WARC record types: 'warcinfo', 'response', 'resource', 'request', 'metadata', 'revisit', 'conversion', and 'continuation'. The purpose and use of each type is described below.
New record types that extend the WARC format may be defined in the future. WARC processing software should skip records of unknown type. A forum for proposals and discussion in advance of standardization is the discussion list email@example.com.
A 'warcinfo' record describes the records that follow it, up through end of file, end of input, or until next 'warcinfo' record. Typically, this appears once and at the beginning of a WARC file. For a web archive, it often contains a description of a web crawl (e.g., depth, timeout, purpose). The format of the description is outside the scope of this document, but may include such things as approximate maximum archive file size (e.g., 1GB) and site entry point URIs for a targeted crawl.
So that multiple record excerpts from inside WARC files may also be valid WARC files, it is not strictly required that the first record of a legal WARC be a 'warcinfo' description. Also, to allow the concatenation of WARC files into a larger valid WARC file, it is allowable for 'warcinfo' records to appear in the middle of a WARC file.
The subject-uri of a 'warcinfo' record should be a URI name which references the WARC file itself.
A 'response' record contains an entire protocol response, such as a full HTTP response including headers and content-body, from a network retrieval. Often the payload of such a response reflects the main collection objective of the archiving service, whose responsibility it is to distinguish payload from protocol headers during subsequent processing. A response record often includes the named parameters 'IP-Address' and 'Related-Record-ID'.
A 'resource' record contains a resource, without full protocol response information. For example: a file directly retrieved from a locally accessible repository, or the result of a networked retrieval where the protocol information has been discarded. A resource record often includes the named parameter 'Related-Record-ID'.
A 'request' record holds the manner in which a primary record's content was requested. In a web crawling context, this would hold the HTTP request. A request record often includes the named parameter 'Related-Record-ID'.
A 'metadata' record contains content created in order to further describe, explain, or accompany a harvested resource, in ways not covered by other record types. A 'metadata' record will almost always refer to another record of another type, with that other record holding original harvested or transformed content. (However, it is allowable for a 'metadata' record to refer to any record type, including other 'metadata' records.) Any number of metadata records may be created that reference one specific other record. The format of the metadata is outside the scope of this document, but potential formats are [ANVL] (Kunze, J., Kahle, B., Masanes, J., and G. Mohr, “A Name-Value Language,” .) and [RDF] (, “Resource Description Framework (RDF),” .). A metadata record often includes the named parameter 'Related-Record-ID'.
A 'revisit' record describes the revisitation of content already archived, and includes only an abbreviated content block which must be interpreted relative to a previous record. Most typically, a 'revisit' record is used instead of 'response' or 'resource' record to indicate that the content visited was either a complete or substantial duplicate of material previously archived.
A 'revisit' record should only be used when interpreting the record requires consulting a previous record; other record types should be preferred if the current record is understandable standing alone. (It is not required that any revisit of a previously-visited URI use 'revisit', only those which refer back to other records.)
The format of a 'revisit' record's content block is outside the scope of this document and may vary to accomplish different goals such as recording the apparent magnitude of difference from the previous visit, or to encode the visited content as a "diff" -- where "diff" is the file comparison utility that outputs the differences between two files -- of the content previously stored. The purpose of this record type is to reduce storage redundancy when repeatedly retrieving identical or little-changed content, while still recording that a revisit occurred, plus details about the current state of the visited content relative to the archived version. A revisit record requires the named parameter 'Related-Record-ID'.
A 'conversion' record contains an alternative version of another record's content that was created as the result of an archival process. Typically, this is used to hold content transformations that maintain viability of content after widely available rendering tools for the originally stored format disappear. As needed, the original content may be migrated (transformed) to a more viable format in order to keep the information usable with current tools while minimizing loss of information (intellectual content, look and feel, etc). Any number of transformation records may be created that reference a specific source record, which may itself contain transformed content. Each transformation should result in a freestanding, complete record, with no dependency on survival of the original record. Metadata records may be used to further describe transformation records. A conversion record requires the named parameter 'Related-Record-ID'.
Specification of the fields and metadata formats used to describe a 'conversion' record is outside the scope of this document,
A 'continuation' record needs to be logically appended to a prior record (e.g., from another WARC file) to create the logically complete full-sized record. This is used when a record that would otherwise cause the WARC file size to exceed a desired limit is broken into segments. See the section Truncated and Segmented Records (Truncated and Segmented Records) for more information. A continuation record requires the named parameters 'Segment-Origin-ID' and 'Segment-Number'.
Allowed named parameters are detailed below. Of note, there is nothing to preclude multiple instances of a particular named field per record. For example, a record may relate to multiple other records. In this case, a writer may output multiple instances of the Related-Record-IDs named field, one per referred-to record.
Additional named parameters may be proposed by WARC users, who are urged to publicly document and discuss with the WARC community new named parameters before use.
The numeric Internet address contacted to retrieve any included content. An IPv4 address should be written as a "dotted quad"; an IPv6 address as per [RFC1884] (Hinden, R. and S. Deering, “IP Version 6 Addressing Architecture,” December 1995.). For an HTTP retrieval, this will be the IP address used at retrieval time corresponding to the hostname in the record's subject-uri.
An optional parameter indicating the name of a digest algorithm run against the content block and the string representation of the resulting value computed. An example is:
As of this writing, this document recommends no particular algorithm.
The identifier of the record for which the present record holds related content. This parameter is required of the record types 'revisit' and 'conversion'. It is also required to associate records of types 'request', 'response', 'resource', and 'metadata' with one another, when desired. However, none of these record types necessarily takes precedence over the others to become the referred-to (primary) record. (Any of them may appear first or alone.)
A potential strategy, after choosing one record to be primary, is to extend its record-id as described in the appendix Considerations in Choice of record-id (Considerations in Choice of record-id). This creates satellite record-ids for related records that contain the primary record-id as an initial substring, which greatly optimizes the detection (and in some cases derivation) of related records.
In a continuation record, this named parameter is mandatory. It identifies the record of the first segment of the set.
In the first segment of a record that is completed in one or more later 'continuation' WARC records, this parameter is mandatory. Its value is "1". In a 'continuation' record, this parameter is also mandatory. Its value is the sequence number of the current segment in the logical whole record, increasing by 1 in each next segment.
When present, indicates that the current record ends before the apparent end of the source material, but no continuation records are forthcoming. Possible values indicate the reason for the truncation: 'length' for exceeding a desired length limit; 'time' for exceeding a desired time limit during collection.
When present, indicates the record-id of the associated 'warcinfo' record for this record. Typically, the Warcinfo-ID parameter is used when the context of the applicable 'warcinfo' record is unavailable, such as after distributing single records into separate WARC files. WARC writing applications (such web crawlers) may choose to record this parameter routinely. The Warcinfo-ID parameter overrides any association with a previously occurring (in the WARC) 'warcinfo' record, thus providing a way to protect the true association when records are combined from different WARCs. Use of this parameter in a record of type 'warcinfo' is undefined and reserved for possible future extension.
For practical reasons, writers of the WARC format may place limits on the time or storage allocated to archiving a single resource. As a result, only a truncated portion of the original resource may be available for saving into a WARC record.
Additionally, users will often want to keep individual WARC files near or below some target size, such as 100MB or 1GB. If some records would be too large to be contained by a single WARC file of desired maximum size, those records will have to be split between multiple WARC files.
This section defines mechanisms for indicating that a WARC record has been truncated or split into multiple records, called segments, across WARC files.
These mechanisms are provisional and subject to change. A superior method of indicating truncation and segmentation may be developed, which better allows the writing of records to begin without foreknowledge of their final length.
Any record may indicate that truncation has occurred and give the reason by the addition of a named 'Truncated' field in the record header. Acceptable values for this field include 'time' for truncation due to exceeding a time limit, and 'length' for truncation due to exceeding a length limit.
A record that will not fit into a single WARC file of desired maximum size may be broken into any number of separate records, called segments. As much as possible, segmentation should be avoided, and where necessary, segments other than the first must be of record-type 'continuation'.
The first segment must carry the record-type (not 'continuation') that the record would have had were it not broken into segments, and a 'Segment-Number' named field with a value of "1".
All subsequent segments must have a record type of 'continuation', with an incremented 'Segment-Number' field. They must also include a 'Segment-Origin-ID' field with a value of the Record-ID of the record containing the first segment of the set. All segments of a set must have identical subject-uri parameters.
The last segment must contain a "End-Length" named field specifying the total length, in bytes, of all segment content if reassembled. The last segment may also contain a 'Truncated' field, if appropriate. Segments other than the first should contain no other named parameters, as they merely serve to continue the record data block of the first record.
To reassemble all segments into the intended complete logical record, all records with the same 'Segment-Origin-ID' value must be collected and appended, in 'Segment-Number' order, to the origin record.
A full HTTP or HTTPS response, with protocol information and content-body (if any), can be saved verbatim into a WARC file as a 'response' type record, with a MIME content-type of 'application/http' (or 'application/http; msgtype=response').
A full HTTP or HTTPS request, including all request headers and content-body (if any), can similarly be saved verbatim into a WARC file as a 'request' type record, with a MIME content-type of 'application/http' (or 'application/http; msgtype=request').
For either a request or response, an 'IP-Address' field should be used to record the network IP address to which the request was directed, using the best available DNS information at the time.
Additional metadata about the HTTP or HTTPS transaction may be stored in a 'metadata' type record, in a format to be specified elsewhere. In particular, information about the secure session in which an HTTPS transaction occurs, such as certificates presented or consulted and authentication information exchanged, may be stored in one or more 'metadata' type records.
The multiple records which pertain to a single HTTP or HTTPS logical group of records will all have unique record-id values. In order to associate the records, all but one must use 'Related-Record-ID' fields to refer to another record in the set.
As any mixture of record types may appear for a single collection event, and in any order, here is no specific record type which is automatically considered primary. Generally, all may refer back to the one record which appeared first, but this is not required. (A 'request' record may refer to a 'response' record or vice-versa; either could refer to a 'metadata' record or a 'metadata' record could refer to either.) Multiple and bidirectional 'Related-Record-ID' fields may appear.
In the case where resources from a website have been harvested or otherwise received without performing normal HTTP operations, or where HTTP protocol information has been lost, it may be appropriate to store the plain content in WARC 'resource' type records, under their original subject-uri, but using the content MIME type in place of the 'application/http' type.
A request for DNS information can be summarized in a URI in accordance with [RFC4501] (Josefsson, S., “Domain Name System Uniform Resource Identifiers,” May 2006.). DNS information as retrieved can be represented in the formats specified by [RFC1035] (Mockapetris, P., “Domain names - implementation and specification,” November 1987.), [RFC2540] (Eastlake, D., “Detached Domain Name System (DNS) Information,” March 1999.), and [RFC4027] (Josefsson, S., “Domain Name System Media Types,” April 2005.).
The results of a DNS lookup can thus be straightforwardly archived in a WARC 'response' record under the appropriate DNS URI and MIME type. If present, the IP-Address named field should be the address of the DNS server that provided the DNS record.
Any resource that can be identified with a URI, even if it is not retrieved via an Internet operation, may be archived in a WARC file under a 'resource' type record. This includes files that have meaningful URIs retrieved from a locally-accessible filesystem or other repository.
Specific conventions for other protocols and media types are expected to be defined as necessary. In general, the WARC format should be capable of archiving any digital resource which has a URI, a specific time of collection, and a discrete length.
The 'request' and 'response' record types should be used for verbatim or lossless transcripts of collection activity, including protocol information. The 'resource' record type should be used for content without any protocol-specific enveloping. Additional information about a resource or transaction can be supplied in a protocol- or media-appropriate manner with 'metadata' type records.
This section describes, as per [RFC2048] (Freed, N., Klensin, J., and J. Postel, “Multipurpose Internet Mail Extensions (MIME) Part Four: Registration Procedures,” November 1996.), the MIME types associated with the WARC format.
MIME media type name: application
MIME subtype names: warc
Required parameters: None
Optional parameters: None
Content of this type is in 'binary' format.
The WARC record syntax poses no direct risk to computers and networks. Implementors need to be aware of source authority and trustworthiness of information structured in WARC. Readers and writers subject themselves to all the risks that accompany normal operation of data processing services (e.g., message length errors, buffer overflow attacks).
Interoperability considerations: None
Published specification: TBD
Applications which use this media type: Large- and small-scale archiving
Additional information: None
Person and email address to contact for further information:
Gordon Mohr firstname.lastname@example.org, John Kunze email@example.com
Intended usage: COMMON
Author/Change controller: IESG
After IESG approval, IANA is expected to register the WARC type "application/warc" using the application provided in this document.
This document could not have been written without major contributions from participants of the International Internet Preservation Consortium, especially Steen Christensen, and Julien Masanes.
The WARC format differs significantly from the ARC format in requiring the record-id parameter. The record-id must be globally unique for its period of intended use. If that period is indefinite, the record-id should be maintained to a level appropriate for any persistent identifier, in which case identifier opaqueness is usually desirable.
There is no reason why the archiving institution may not choose record-ids that are also "actionable" (submittable as retrieval requests to widely available tools such as web browsers) as long as there are providers to service them. This specification does not dictate what identifier scheme to use; suitable schemes include URN (Moats, R., “URN Syntax,” May 1997.) [RFC2141], [ARK] (Kunze, J. and R. Rodgers, “The ARK Persistent Identifier Scheme,” August 2005.), [GUID] (, “Wikipedia: Globally Unique Identifiers,” .), etc.
Also worth considering is the establishment of lexical conventions for record-ids that reveal or suggest relationships among content blocks. Although the 'Related-Record-ID' parameter required of 'metadata', 'revisit', and 'conversion' records is sufficient to convey relatedness in the context of a single WARC file, great optimization can be had when relatedness can be inferred by third parties through identifier comparison rather than by lookup in a database or examination of the relevant WARC files.
These conventions are suggested by [RFC2396] (Berners-Lee, T., Fielding, R., and L. Masinter, “Uniform Resource Identifiers (URI): Generic Syntax,” August 1998.), formalized by the [ARK] (Kunze, J. and R. Rodgers, “The ARK Persistent Identifier Scheme,” August 2005.) scheme, and are applicable to such things as the summarizing of large search results from Internet-wide indexing engines. As an example of a convention that could be adopted by users of any identifier scheme, the "/" character could be reserved as a separator used to introduce an extension string that is appended to a primary record-id. If the record-id of a primary block of captured content were,
The convention could also reserve the extension strings "_s", "_d", and "_t" to indicate record- ids for secondary, duplicate, and transform blocks, respectively. Over time this might result in he assignment of record-ids such as,
http://abc.org/12026/987654321/_s1 http://abc.org/12026/987654321/_s2 http://abc.org/12026/987654321/_d9 http://abc.org/12026/987654321/_d10 http://abc.org/12026/987654321/_t
...in which an integer count may further extend the identifier when more there is more than one relationship of the given type.
The WARC format defines no internal compression. Whether and how WARC files should be compressed is an external decision.
However, experience with the precursor ARC format at the Internet Archive has demonstrated that applying simple standard compression can result in significant storage savings, while preserving random access to individual records.
For this purpose, the GZIP format with customary "deflate" compression is recommended, as defined in [RFC1950] (Deutsch, L. and J-L. Gailly, “ZLIB Compressed Data Format Specification version 3.3,” May 1996.), [RFC1951] (Deutsch, P., “DEFLATE Compressed Data Format Specification version 1.3,” May 1996.), and [RFC1952] (Deutsch, P., Gailly, J-L., Adler, M., Deutsch, L., and G. Randers-Pehrson, “GZIP file format specification version 4.3,” May 1996.). Freely available source code implementing this format is available, and the technique is free of patent encumberances. The GZIP format is also widely used and supported across many free and commercial software packages and operating systems.
This section documents recommended, but optional, practices for compressing WARC files with GZIP.
Per section 2.2 of the GZIP specification, a valid GZIP file consists of any number of gzip "members", each independently compressed.
Where possible, this property should be exploited to compress each record of a WARC file independently. This results in a valid GZIP file whose per-record subranges also stand alone as valid GZIP files.
External indexes of WARC file content may then be used to record each record's starting position in the GZIP file, allowing for random access of individual records without requiring decompression of all preceding records.
Note that the application of this convention causes no change to the uncompressed contents of an individual WARC record. In particular, the declared record length remains the length of the uncompressed record.
A gzip compressed WARC file should have the customary ".gz" appended to it, making the complete suffix, ".warc.gz".
It is helpful to use practices within an institution that make it unlikely or impossible to duplicate aggregate WARC file names. The convention used inside the Internet Archive with ARC files is to name files according to the following pattern:
Prefix is an abbreviation usually reflective of the project or crawl that created this file. Timestamp is a 14-digit GMT timestamp indicating the time the file was initially begun. Serial is an increasing serial-number within the process creating the files, often (but not necessarily) unique with regard to the Prefix. Crawlhost is the domain name or IP address of the machine creating the file.
IIPC member institutions have expressed an interest in adopting a common naming strategy, with per-institution unique identifiers to assist in marking WARC files with their institution of origin. It is proposed that all such WARC file names adhering to this future convention begin "iipc".
This specification does not require any particular WARC file naming practice, but recommends conventions similar to the above be adopted within WARC-creating institutions. The file name prefix "iipc" should be avoided unless participating in the IIPC naming registry.
1GB (10^9 bytes) is recommended as a practical target size for WARC files, when record sizes allow. Oversized records may be truncated, segmented, or simply placed in oversized WARC files, at a project's discretion.
warc-file = 1*warc-record warc-record = header block CRLF CRLF header = header-line CRLF *anvl-field CRLF block = *OCTET header-line = warc-id tsp data-length tsp record-type tsp subject-uri tsp creation-date tsp record-id tsp content-type tsp = 1*WSP warc-id = "WARC/" 1*DIGIT "." 1*DIGIT data-length = 1*DIGIT record-type = "warcinfo" / "response" / "request" / "metadata" / "revisit" / "conversion" / "continuation" / future-type future-type = 1*VCHAR subject-uri = uri uri = <'URI' per RFC3986> creation-date = timestamp timestamp = 14*14DIGIT ; GMT formatted as YYYYMMDDhhmmss record-id = uri content-type = type "/" subtype *(";" parameter) type = <'type' per Section 5.1 of RFC2045> subtype = <'subtype' per Section 5.1 of RFC2045> parameter = <'parameter' per Section 5.1 of RFC2045> anvl-field = field-name ":" [ field-body ] CRLF field-name = 1*<any CHAR, excluding control-chars and ":"> field-body = text [CRLF 1*WSP field-body] text = 1*<any UTF-8 character, including bare CR and bare LF, but NOT including CRLF> ; (Octal, Decimal.) CHAR = <any ASCII/UTF-8 character> ; (0-177, 0.-127.) CR = <ASCII CR, carriage return> ; ( 15, 13.) LF = <ASCII LF, linefeed> ; ( 12, 10.) SPACE = <ASCII SP, space> ; ( 40, 32.) HTAB = <ASCII HT, horizontal-tab> ; ( 11, 9.) CRLF = CR LF WSP = SPACE / HTAB ; semantics = SPACE
All-caps "core" elements in the above are as defined in [RFC2234] (Crocker, D., Ed. and P. Overell, “Augmented BNF for Syntax Specifications: ABNF,” November 1997.).
Examples of each of record-type are provided here. In some cases, illustrative data is shown where conventions have not yet been specified. Each record header-line is split over multiple lines for readability; continuations of the single line are indented, and a newline should only be considered to appear at the end of the last indented line. Declared record lengths are approximate, and unique IDs and checksums shown are plausible random filler.
The following 'warcinfo' example includes an XML description of the enclosing WARC file that is loosely modelled after the descriptions currently used in Internet Archive ARC files. However, this is an abbreviated and speculative illustration; the referenced WARC-specific namespace .http://archive.org/warc/0.10. has not been formally defined anywhere, and may not reflect eventual practice with WARC files.
WARC/0.10 223 warcinfo urn:uuid:d7ae5c10-e6b3-4d27-967d-34780c58ba39 20060919172014 urn:uuid:d7ae5c10-e6b3-4d27-967d-34780c58ba39 text/xml <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <warcmetadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:warc="http://archive.org/warc/0.10/"> <warc:software> Heritrix 1.4.0 http://crawler.archive.org </warc:software> <warc:hostname>crawling017.archive.org</warc:hostname> <warc:ip>22.214.171.124</warc:ip> <dcterms:isPartOf>testcrawl-20050708</dcterms:isPartOf> <dc:description>testcrawl with WARC output</dc:description> <warc:operator>IA_Admin</warc:operator> <warc:http-header-user-agent> Mozilla/5.0 (compatible; heritrix/1.4.0 +http://crawler.archive.org) </warc:http-header-user-agent> <dc:format>WARC file version 0.10</dc:format> <dcterms:conformsTo xsi:type="dcterms:URI"> http://www.archive.org/documents/WarcFileFormat-0.10.html </dcterms:conformsTo> </warcmetadata>
The first line (spread over three lines for readability) shows the required line of positional parameters. The data-length is space-padded and the subject-uri for this warcinfo record is the same as its record-id. This record has no named parameters, as evidenced by the single blank line following he header-line. The content block is 'text/xml', as declared in the header-line. Two newlines follow the content block.
A 'request' record captures the protocol request used to collect a resource. For example, to collect the resource 'http://www.archive.org/images/logoc.jpg', the following 'request' record might be generated:
WARC/0.10 501 request http://www.archive.org/images/logoc.jpg 20060919172024 urn:uuid:4885803b-eebd-4b27-a090-144450c11594 application/http;msgtype=request Related-Record-ID: urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0 GET /images/logoc.jpg HTTP/1.0^M User-Agent: Mozilla/5.0 (compatible; heritrix/1.10.0) From: firstname.lastname@example.org Connection: close Referer: http://www.archive.org/ Host: www.archive.org Cookie: PHPSESSID=009d7bb11022f80605aa87e18224d824
The data-length is space-padded. The Related-Record-ID named field points to the related response record (See Example of 'response' Record (Example of 'response' Record)).
The response record to match the above example request might look like the following:
WARC/0.10 2210 response http://www.archive.org/images/logoc.jpg 20060919172024 urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0 application/http;msgtype=response Checksum: sha1:UZY6ND6CCHXETFVJD2MSS7ZENMWF7KQ2 IP-Address: 126.96.36.199 HTTP/1.1 200 OK Date: Tue, 19 Sep 2006 17:18:40 GMT Server: Apache/2.0.54 (Ubuntu) Last-Modified: Mon, 16 Jun 2003 22:28:51 GMT ETag: "3e45-67e-2ed02ec0" Accept-Ranges: bytes Content-Length: 1662 Connection: close Content-Type: image/jpeg [image/jpeg binary data here]
Note that the creation-date for the 'response' is identical to that of the previous 'request' record. The IP-Address named field is that of the server that generated the response.
This same file, 'logo.jpg', might be archived internally to an organization under its local filesystem name. This could result in a 'resource' record:
WARC/0.10 2210 resource file://var/www/htdoc/images/logoc.jpg 20060919172024 urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0 image/jpeg Checksum: sha1:UZY6ND6CCHXETFVJD2MSS7ZENMWF7KQ2 [image/jpeg binary data here]
If some crawl-time metadata should be archived near the above Example of 'response' Record (Example of 'response' Record), a 'metadata' record could be used like the following (using ANVL format):
WARC/0.10 261 metadata http://www.archive.org/images/logoc.jpg 20060919172024 urn:uuid:16da6da0-bcdc-49c3-927e-57494593b943 text/anvl Related-Record-ID: urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0 via: http://www.archive.org/ pathFromSeed: E downloadTimeMS: 565
Note again the same creation-date as the preceding related records. A relationship is declared to the preceding 'response' record but declaring a relationship to the 'request' would also be legal.
If the same URI is later revisited and the content is unchanged, a 'revisit' record like the following (again with a speculative content-type) could be generated:
WARC/0.10 395 revisit http://www.archive.org/images/logoc.jpg 20060919190040 urn:uuid:16da6da0-bcdc-49c3-927e-57494593bbbb text/xml Related-Record-ID: urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0 <?xml version="1.0"?> <revisit xmlns="http://archive.org/revisit/0.10/"> <server-response-excerpt> HTTP/1.x 304 Not Modified Date: Mon, 08 Aug 2005 01:01:01 GMT Etag: "914480-1b2e-42ab8245" </server-response-excerpt> </revisit>
Again, reference is made back to the original 'response' record. A new creation-date reflects the time of revisit. This content block hypothesizes including header excerpts from a server response to explain the results of the revisit. (In this case, the remote server indicated the resource was unchanged from the previous 'Etag' value.) The actual formats for describing the result of a revisit remain to be defined.
At some future date, the 'image/jpeg' format may no longer be considered viable, prompting a conversion of the original archive content into a hypothetical new format, 'image/neoimg', which generates a 3098 byte version of the same image. This could be accomodated with a 'conversion' record:
WARC/0.10 4111 conversion http://www.archive.org/images/logoc.jpg 20160919190040 urn:uuid:16da6da0-bcdc-49c3-927e-57494593dddd image/neoimg Related-Record-ID: urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0 Checksum: sha1:XQMRY75YY42ZWC6JAT6KNXKD37F7MOEK [image/neoimg binary data here]
An accompanying 'metadata' record, referring to this 'conversion' record, could contain additional details about the transformation. (Alternatively, new named-fields in this record could serve this role.)
If the 'response' above had been so large that it would not fit into a single WARC file of desired maximum size, it would have to be segmented into separate smaller records. The first record would be as before, except with one additional named field, 'Segment-Number', with a value of '1', indicating that the record was the beginning of a segmented record set.
The subsequent segment for that record would then look like this:
WARC/0.10 39514322 continuation http://www.archive.org/images/logoc.jpg 20160919172024 urn:uuid:16da6da0-bcdc-49c3-927e-57494593eeee application/http;msgtype=response Segment-Origin-ID: urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0 Segment-Number: 2 [Segment 2 application/http binary data here]
Note that the 'Segment-Origin-ID' refers to the first segment of the set, the one with the 'Segment-Number: 1' named field.
|[ANVL]||Kunze, J., Kahle, B., Masanes, J., and G. Mohr, “A Name-Value Language” (PDF).|
|[ARC]||Burner, M. and B. Kahle, “The ARC File Format,” September 1996 (HTML).|
|[ARK]||Kunze, J. and R. Rodgers, “The ARK Persistent Identifier Scheme,” August 2005 (PDF).|
|[GUID]||“Wikipedia: Globally Unique Identifiers” (HTML).|
|[HERITRIX]||“Heritrix Open Source Archival Web Crawler” (HTML).|
|[IIPC]||“International Internet Preservation Consortium (IIPC)” (HTML).|
|[RDF]||“Resource Description Framework (RDF)” (HTML).|
|[RFC0822]||Crocker, D., “Standard for the format of ARPA Internet text messages,” STD 11, RFC 822, August 1982.|
|[RFC1035]||Mockapetris, P., “Domain names - implementation and specification,” STD 13, RFC 1035, November 1987.|
|[RFC1884]||Hinden, R. and S. Deering, “IP Version 6 Addressing Architecture,” RFC 1884, December 1995.|
|[RFC1950]||Deutsch, L. and J-L. Gailly, “ZLIB Compressed Data Format Specification version 3.3,” RFC 1950, May 1996 (TXT, PS, PDF).|
|[RFC1951]||Deutsch, P., “DEFLATE Compressed Data Format Specification version 1.3,” RFC 1951, May 1996 (TXT, PS, PDF).|
|[RFC1952]||Deutsch, P., Gailly, J-L., Adler, M., Deutsch, L., and G. Randers-Pehrson, “GZIP file format specification version 4.3,” RFC 1952, May 1996 (TXT, PS, PDF).|
|[RFC2045]||Freed, N. and N. Borenstein, “Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies,” RFC 2045, November 1996.|
|[RFC2048]||Freed, N., Klensin, J., and J. Postel, “Multipurpose Internet Mail Extensions (MIME) Part Four: Registration Procedures,” BCP 13, RFC 2048, November 1996 (TXT, HTML, XML).|
|[RFC2141]||Moats, R., “URN Syntax,” RFC 2141, May 1997 (TXT, HTML, XML).|
|[RFC2234]||Crocker, D., Ed. and P. Overell, “Augmented BNF for Syntax Specifications: ABNF,” RFC 2234, November 1997 (TXT, HTML, XML).|
|[RFC2396]||Berners-Lee, T., Fielding, R., and L. Masinter, “Uniform Resource Identifiers (URI): Generic Syntax,” RFC 2396, August 1998 (TXT, HTML, XML).|
|[RFC2540]||Eastlake, D., “Detached Domain Name System (DNS) Information,” RFC 2540, March 1999.|
|[RFC2616]||Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., and T. Berners-Lee, “Hypertext Transfer Protocol -- HTTP/1.1,” RFC 2616, June 1999 (TXT, PS, PDF, HTML, XML).|
|[RFC4027]||Josefsson, S., “Domain Name System Media Types,” RFC 4027, April 2005.|
|[RFC4501]||Josefsson, S., “Domain Name System Uniform Resource Identifiers,” RFC 4501, May 2006.|
|John A. Kunze (editor)|
|California Digital Library|
|415 20th St, 4th Floor|
|Oakland, CA 94612-3550|
|Kungliga biblioteket (National Library of Sweden)|
|Fax:||+46 (0)8 463 4004|
|4 Funston Ave, Presidio|
|San Francisco, CA 94117|
|4 Funston Ave, Presidio|
|San Francisco, CA 94117|