IIPC Framework Working Group J. Kunze, Ed. California Digital Library A. Arvidson Kungliga biblioteket (National Library of Sweden) G. Mohr M. Stack Internet Archive January 2006 The WARC File Format (Version 0.9) Abstract The WARC (Web ARChive) format specifies a method for combining multiple digital resources into an aggregate archival file together with related information. Resources are dated, identified by URIs, and preceded by simple text headers. By convention, files of this format are named with the extension ".warc" and have the MIME type application/warc. The WARC file format is a revision and generalization of the ARC format used by the Internet Archive to store information blocks harvested by web crawlers. This document specifies version 0.9 of the WARC format. Kunze, et al. [Page 1] WARC File Format, 0.9 January 2006 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3. The WARC Record Model . . . . . . . . . . . . . . . . . . . . 5 4. Record Types . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.1. 'warcinfo' . . . . . . . . . . . . . . . . . . . . . . . . 7 4.2. 'response' . . . . . . . . . . . . . . . . . . . . . . . . 7 4.3. 'resource' . . . . . . . . . . . . . . . . . . . . . . . . 7 4.4. 'request' . . . . . . . . . . . . . . . . . . . . . . . . 8 4.5. 'metadata' . . . . . . . . . . . . . . . . . . . . . . . . 8 4.6. 'revisit' . . . . . . . . . . . . . . . . . . . . . . . . 8 4.7. 'conversion' . . . . . . . . . . . . . . . . . . . . . . . 9 4.8. 'continuation' . . . . . . . . . . . . . . . . . . . . . . 9 5. Record Header . . . . . . . . . . . . . . . . . . . . . . . . 10 5.1. Positional Parameters . . . . . . . . . . . . . . . . . . 11 5.2. Named Parameters . . . . . . . . . . . . . . . . . . . . . 12 6. Record Content Block . . . . . . . . . . . . . . . . . . . . . 15 7. Truncated and Segmented Records . . . . . . . . . . . . . . . 16 7.1. Record Truncation . . . . . . . . . . . . . . . . . . . . 16 7.2. Record Segmentation . . . . . . . . . . . . . . . . . . . 16 8. WARC Application to Specific Protocols . . . . . . . . . . . . 18 8.1. HTTP and HTTPS . . . . . . . . . . . . . . . . . . . . . . 18 8.2. DNS . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 8.3. Other Resources with URIs, and Other Protocols . . . . . . 19 9. Compression Recommendations . . . . . . . . . . . . . . . . . 20 9.1. Record-at-a-time Compression . . . . . . . . . . . . . . . 20 9.2. GZIP extra field: skip-lengths ('sl') . . . . . . . . . . 20 9.3. GZIP WARC File Name Suffix . . . . . . . . . . . . . . . . 21 10. WARC File Name and Size Recommendations . . . . . . . . . . . 22 11. Registration of MIME Media Type application/warc . . . . . . . 23 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 24 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 25 Appendix A. Consideratons in Choice of record-id . . . . . . . . 26 Appendix B. Examples of WARC Records . . . . . . . . . . . . . . 27 Appendix B.1. Example of 'warcinfo' Record . . . . . . . . . . . . 27 Appendix B.2. Example of 'request' Record . . . . . . . . . . . . 28 Appendix B.3. Example of 'response' Record . . . . . . . . . . . . 28 Appendix B.4. Example of 'resource' Record . . . . . . . . . . . . 29 Appendix B.5. Example of 'metadata' Record . . . . . . . . . . . . 29 Appendix B.6. Example of 'revisit' Record . . . . . . . . . . . . 29 Appendix B.7. Example of 'conversion' Record . . . . . . . . . . . 30 Appendix B.8. Example of 'continuation' Record . . . . . . . . . . 30 Appendix C. Collected BNF for WARC . . . . . . . . . . . . . . . 32 14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 35 Kunze, et al. [Page 2] WARC File Format, 0.9 January 2006 1. Introduction The WARC (Web ARChive) file format offers a convention for concatenating multiple resource records, each consisting of a set of simple text headers and an arbitary data block into one long file. The WARC format is a revision of the ARC File Format [ARC] format that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web. The original ARC format file is used internally by the Internet Archive (IA) to record a sequence of materials captured from the web (e.g., web "pages"). Each capture is preceded by a one-line header that very briefly describes the harvested content and its length. This is directly followed by the the retrieval protocol response messages and content. The motivation to revise the format arose from the discussion and experiences of the International Internet Preservation Consortium (IIPC) [IIPC], whose members include the IA and the national libraries of a dozen countries. The revised format is expected to be a standard way to structure, manage and store billions of collected web resources. For example, WARC will be an output format of harvesting software, such as the open-source Heritrix [HERITRIX] web crawler, and an input format for a wide array of cataloguing and access tools. The WARC format generalizes the older format to better support the harvesting, access, and exchange needs of archiving organizations. Besides the primary content currently recorded, the revision accommodates related secondary content, such as assigned metadata, abbreviated duplicate detection events, and later-date transformations. The revision may also be useful for more general applications than web archiving. To aid the development of tools that are backwards compatible, WARC content is clearly distinguishable from pre-revision ARC content. Kunze, et al. [Page 3] WARC File Format, 0.9 January 2006 2. Goals Goals of the WARC file format include the following. o Ability to store both the payload content and control information from mainstream Internet application layer protocols, including HTTP, FTP, NNTP, and SMTP. o Ability to store arbitrary metadata linked to other stored data (e.g., subject classifier, discovered language, encoding) o Support for data compression and maintenance of data record integrity. o Ability to store all control information from the harvesting protocol (e.g., request headers), not just response information. o Ability to store the results of data transformations linked to other stored data. o Ability to store a duplicate detection event linked to other stored data (to reduce storage in the presence of identical or substantially similar resources). o Amenable to efficient processing. o Sufficiently different from the legacy ARC format files that software tools can unambiguously detect and correctly process both WARC and ARC records; given the large amount of existing archival data in the previous ARC format, it is important that access and use of this legacy not be interrupted when transitioning to the WARC format. o Ability to store globally unique record identifiers. o Support for deterministic handling of long records (e.g., truncation, segmentation). Kunze, et al. [Page 4] WARC File Format, 0.9 January 2006 3. The WARC Record Model A WARC format file is the simple concatenation of one or more WARC records. A record consists of a record header followed by a record content block and two newlines. (Newlines are CRLF as per other Internet standards.) This can be summarized in the following [RFC2234] IETF ABNF grammar. (All-caps "core" elements are as defined in RFC2234.) warc-file = 1*warc-record warc-record = header block CRLF CRLF header = header-line CRLF *anvl-field CRLF block = *OCTET Elements of this grammar are further specified and explained in sections that follow. The record _header-line_ is a newline-terminated sequence of whitespace-delimited text tokens representing parameters such as record length, time of creation, and subject URI. header-line = warc-id tsp data-length tsp record-type tsp subject-uri tsp creation-date tsp content-type tsp record-id tsp = 1*WSP The amount of whitespace between _header-line_ tokens is variable. This gives archive builders the flexibility to add padding and later adjust pre-written header parameters when final values are only completely known after the record content _block_ has been written. After the _header-line_ come zero or more named ANVL [ANVL] fields in a line-oriented syntax very similar to that of email headers [RFC0822] but with unrestricted "text" values (none of its 13 reserved special characters). The precise format is as follows: anvl-field = field-name ":" [ field-body ] CRLF field-name = 1* field-body = text [CRLF LWSP-char field-body] text = 1* ; (Octal, Decimal.) CHAR = ; (0-177, 0.-127.) CR = ; ( 15, 13.) LF = ; ( 12, 10.) SPACE = ; ( 40, 32.) HTAB = ; ( 11, 9.) CRLF = CR LF Kunze, et al. [Page 5] WARC File Format, 0.9 January 2006 LWSP-char = SPACE / HTAB ; semantics = SPACE This document defines a number of named fields that may appear as an _anvl-field_. Note that the smallest possible _anvl-fields_ is a single CRLF, indicating no named fields. Following the headers comes the content _block_, if any, which may contain arbitrary binary data, up through the remaining number of octets as specified in the previously-given _data-length_ parameter. Finally come two CRLF newlines, not counted in the declared record _data-length_. It is often the case that the first record of a WARC to has the record-type 'warcinfo' and is used to describe the records that follow it. It is always the case that the concatenation of any two WARC files is a syntactically correct WARC file; care should be taken, however, when concatenation would inadvertently cause 'warcinfo' records to appear at points in the result that would create confusion. Subsequent records contain content blocks that are either the direct result of a retrieval attempt -- web pages, inline images, URL redirection information, DNS hostname lookup results, standalone files, etc. -- or they are synthesized content blocks (e.g., metadata, transformed content) that provide additional information about archived content. Any content block may contain arbitrary text or binary data. Kunze, et al. [Page 6] WARC File Format, 0.9 January 2006 4. Record Types There are 8 currently defined WARC record types: 'warcinfo', 'response', 'resource', 'request', 'metadata', 'revisit', 'conversion', and 'continuation'. The purpose and use of each type is described below. New record types that extend the WARC format may be defined in the future. WARC processing software should skip records of unknown type. A forum in which new types are likely to be proposed and discussed in advance of standardization is the discussion list standards@netpreserve.org. 4.1. 'warcinfo' A 'warcinfo' record describes the records that follow it, up through end of file, end of input, or another 'warcinfo' record. Typically, this appears once and at the beginning of a WARC file. For a web archive, it often contains a description of a web crawl (e.g., depth, timeout, purpose). The format of the description is outside the scope of this document, but may include such things as (a) approximate maximum archive file size (e.g., 500MB), (b) rate of crawling, and (c) site entry point URIs for a targeted crawl. So that multiple record excerpts from inside WARC files may also be valid WARC files, it is not strictly required that the first record of a legal WARC be a 'warcinfo' description. Also, to allow the concatenation of WARC files into a larger valid WARC file, it is allowable for 'warcinfo' records to appear in the middle of a WARC file. The subject-uri of a 'warcinfo' record should be a URI name, synthesized as necessary, which references the WARC file itself. 4.2. 'response' A 'response' record contains an entire protocol response, such as a full HTTP response including headers and content-body, from an Internet retrieval. Often the payload of such a response reflects the main collection objective of the archiving service, whose responsibility it is to distinguish payload from protocol headers during subsequent processing. A response record often includes the named parameters 'IP-Address' and 'Related-Record-ID'. 4.3. 'resource' A 'resource' record contains a resource, without full protocol response information. For example: a file directly retrieved from a Kunze, et al. [Page 7] WARC File Format, 0.9 January 2006 locally accessible repository, or the result of a networked retrieval where the protocol information has been discarded. A resource record often includes the named parameter 'Related-Record-ID'. 4.4. 'request' A 'request' record holds the manner in which a primary record's content was requested. (In a web crawling context, this would hold the HTTP request.) A request record often includes the named parameter 'Related-Record-ID'. 4.5. 'metadata' A 'metadata' record contains content created in order to further describe, explain, or accompany a harvested resource, in ways not covered by other record types. A 'metadata' record will almost always refer to another record of another type, with that other record holding original harvested or transformed content. (However, it is allowable for a 'metadata' record to refer to any record type, including other 'metadata' records, or to refer to no other individual record at all.) Any number of metadata records may be created that reference one specific other record. The format of the metadata is outside the scope of this document, but potential formats are [ANVL] and [RDF] or other XML-based formats. A metadata record often includes the named parameter 'Related-Record-ID'. 4.6. 'revisit' A 'revisit' record describes the revisitation of content already archived, and includes only an abbreviated content block which must be interpreted relative to a previous record. Most typically, a 'revisit' record is be used instead of 'response' or 'resource' record to indicate that the content visited was either a complete or substantial duplicate of material previously archived. A 'revisit' record should only be used when interpreting the record requires consulting a previous record; other record types should be preferred if the current record is understandable standing alone. (It is not required that any revisit of a previously-visited URI use 'revisit', only those which refer back to other records.) The format of a 'revisit' record's content block will be specified elsewhere, and may vary to accomplish different goals, such as recording the apparent magnitude of difference from the previous visit, or to encode the visited content as a "diff" of the content previously stored. The purpose of this record type is to reduce storage redundancy when repeatedly retrieving identical or little- changed content, while still recording that a revisit occurred, plus Kunze, et al. [Page 8] WARC File Format, 0.9 January 2006 details about the current state of the visited content relative to the archived version. A revisit record requires the named parameter 'Related-Record-ID'. 4.7. 'conversion' A 'conversion' record contains an alternative version of another record's content that was created as the result of an archival process. Typically, this is used to hold content transformations that maintain viability of content after widely available rendering tools for the originally stored format disappear. As needed, the original content may be migrated (transformed) to a more viable format in order to keep the information usable with current tools while minimizing loss of information (intellectual content, look and feel, etc). Any number of transformation records may be created that reference a specific source record, which may itself contain transformed content. Each transformation should result in a freestanding, complete record, with no dependency on survival of the original record. Metadata records may be used to further describe transformation records. A conversion record requires the named parameter 'Related-Record-ID'. Specification of the fields and metadata formats used to describe a 'conversion' record is outside the scope of this document, 4.8. 'continuation' A 'continuation' record needs to be logically appended to a prior record (e.g., from another WARC file) to create the logically complete full-sized record. This is used when a record that would otherwise cause the WARC file size to exceed a desired limit is broken into segments. See the section on Truncated and Segmented Records for more information. A continuation record requires the named parameters 'Segment-Origin-ID' and 'Segment-Number', and often includes the named parameter 'Related-Record-ID'. Kunze, et al. [Page 9] WARC File Format, 0.9 January 2006 5. Record Header The WARC record header declares baseline identifying information about the current record, and allows additional per-record information. It consists of one first line of required positional parameters, then a variable number of lines of named parameters. Positional parameters on the first header line are tokens separated from each other by one or more spaces. Positional parameter order is significant. One of the parameters, data-length, indicates the combined length of the header and block sections of this record (excepting the final CRLF CRLF), in octets, counting from the first character of the record header first line. (As specified below, this character is always "w"). The data-length is thus the most important header parameter for efficient bulk scanning which may need to skip entire records. The header-line parameters are: warc-id = "warc/0.9" data-length = 1*DIGIT record-type = "warcinfo" / "response" / "request" / "metadata" / "revisit" / "conversion" / "continuation" / future-type future-type = 1*VCHAR subject-uri = uri uri = <'URI' per RFC3986> creation-date = timestamp timestamp = content-type = type "/" subtype type = <'type' per RFC2045> subtype = <'subtype' per RFC2045> record-id = uri The warc-id string may change in future versions, but will always begin "warc/", continue with version numbers, and end at whitespace. Named parameters after the header-line, if any, follow the line- oriented syntax defined previously (also known as ANVL [ANVL]). Normally, named parameters are optional and their order is insignificant, however, specific record types require that certain named parameters be present (and future extensions may have ordering requirements). If there are no named parameters present, the entire WARC record header is the line of positional parameters followed by one blank line (two consecutive newlines). Kunze, et al. [Page 10] WARC File Format, 0.9 January 2006 5.1. Positional Parameters This section describes each of the individual positional parameters of the WARC header-line. warc-id A fixed pattern, "warc/0.9", that appears first in every record and hence begins the WARC file itself. It serves to identify the file format and version to outside inspection, and to assist error recovery when a process reading a WARC file fails to find the next record boundary where expected. Occurrences of this string are not definitively the same as record boundaries, since the string may by chance occur inside a record. However, it may still be useful to locate such strings when attempting to recover from file corruption which renders one or more data-length parameters unreliable. data-length The number of octets in the record, starting with the first letter ("w") of the first token, through to the end of the content block -- not including the 2 record-ending newlines. After proceeding this many octets from that first character of the record header, there should be two newlines and either the beginning of a new record or the end of the file. (WARC reading implementations may choose to tolerate more or fewer newlines at the end of a record.) If the first next token does not match the first token of a WARC record, then the previous data-length should be considered in error; corrective action might include searching for a nearby occurrence of "warc/0.9" and other character patterns indicative of a legal record beginning. record-type The kind of WARC record. All record types are optional, though starting all WARC files with a "warcinfo" record is recommended. Record types are defined in Record Types. subject-uri The original URI whose collection gave rise to the information content in this record. In the context of web harvesting, this is the URI that was the target of a crawler's retrieval request. Indirectly, such as for a 'revisit', 'metadata', or 'conversion' record, it is a copy of the subject- uri appearing in the original record to which the newer record pertains. For a 'warcinfo' record, this parameter is given a synthesized value for the creation name of the WARC file, as a URI. Care should be taken to ensure that the URI in this value is Kunze, et al. [Page 11] WARC File Format, 0.9 January 2006 properly escaped (per [RFC2396] and that it is written with no internal whitespace. creation-date A 14-digit timestamp in the format YYYYMMDDhhmmss representing the GMT time when record creation began. Multiple records written as part of a single collection action may share the same creation-date, even though the times of their writing will not be exactly synchronized. content-type The MIME type [RFC2045] of the information contained in the record's content block. (Type and subtype only.) For content in HTTP request and response records, this should be "message/ http"; in particular, it is not the content-type of any HTTP content body. Care should be taken to ensure that this value is written with no internal whitespace. record-id An identifier assigned to the record that is globally unique for its period of intended use. No identifier scheme is mandated by this specification, but each record-id should be a legal URI and clearly indicate a documented and registered scheme to which it conforms (e.g., via a URI scheme prefix such as "http:"). Care should be taken to ensure that this value is written with no internal whitespace. 5.2. Named Parameters Named parameters, also referred to as named fields, are optional except as noted otherwise. Additional named parameters may be proposed by WARC users, who are urged to publically document and discuss with the WARC community new named parameters before use. IP-Address: IP-address The numeric Internet address contacted to retrieve any included content. An IPv4 address should be written as a "dotted quad"; an IPv6 address as per [RFC1884]. For an HTTP retrieval, this will be the IP address used at retrieval time corresponding to the hostname in the record's subject-uri. Checksum: algorithm:value An optional parameter indicating that just before the content block for this record was stored, the named digest algorithm was run, and it computed the string represented by the given value. An example is: Kunze, et al. [Page 12] WARC File Format, 0.9 January 2006 Checksum: sha1:AB2CD3EF4GH5IJ6KL7MN8OPQ As of this writing, this document recommends no particular algorithm, though a future recommendation is possible. Related-Record-ID: record-id The identifier of the record for which the present record holds related content. This parameter is required of the record types 'revisit' and 'conversion'. It is also required to associate records of types 'request', 'response', 'resource', and 'metadata' with one another, when desired. However, none of these record types necessarily takes precedence over the others to become the referred-to (primary) record. (Any of them may appear first or alone.) A potential strategy, after choosing one record to be primary, is to extend its record-id as described in the Appendix about record-id considerations. This creates satellite record-ids for related records that contain the primary record-id as an initial substring, which greatly optimizes the detection (and in some cases derivation) of related records. Segment-Origin-ID: record-id In a continuation record, this identifies the record of the first segment of the set. Segment-Number: integer In the first segment of a record that is completed in one or more later 'continuation' WARC records, this parameter is "1". In a 'continuation' record, this parameter is the sequence number of the current segment in the logical whole record, increasing by 1 in each next segment. Truncated: reason-token When present, indicates that the current record ends before the apparent end of the source material, but no continuation records are forthcoming. Possible values indicate the reason for the truncation: 'length' for exceeding a desired length limit; 'time' for exceeding a desired time limit during collection. Warcinfo-ID: record-id When present, indicates the record-id of the associated 'warcinfo' record for this record. Typically, the Warcinfo-ID parameter is used when the context of the applicable 'warcinfo' record is unavailable, such as after distributing single records into separate WARC files. WARC writing applications (such web crawlers) may choose to record this parameter routinely (e.g., before computing checksums). The Warcinfo-ID parameter overrides any association with a previously occurring (in the WARC) 'warcinfo' record, thus providing a way to protect the true association when records are combined from Kunze, et al. [Page 13] WARC File Format, 0.9 January 2006 different WARCs. Use of this parameter in a record of type 'warcinfo' is undefined and reserved for possible future extension. Kunze, et al. [Page 14] WARC File Format, 0.9 January 2006 6. Record Content Block Each record's content block contains zero or more bytes of data, interpreted according to the record type and any preceding headers. For 'response', ' Kunze, et al. [Page 15] WARC File Format, 0.9 January 2006 7. Truncated and Segmented Records For practical reasons, writers of the WARC format may place limits on the time or storage allocated to archiving a single resource. As a result, only a truncated portion of the original resource may be available for saving into a WARC record. Additionally, users will often want to keep individual WARC files near or below some target size, such as 100MB or 500MB. If some records would be too large to be contained by a single WARC file of desired maximum size, those records will have to be split between multiple WARC files. This section defines mechanisms for indicating that a WARC record has been truncated or split into multiple records, called segments, across WARC files. These mechanisms are provisional and subject to change. A superior method of indicating truncation and segmentation may be developed, which better allows the writing of records to begin without foreknowledge of their final length. 7.1. Record Truncation Any record may indicate that truncation has occurred and give the reason by the addition of a named 'Truncated' field in the record header. Acceptable values for this field include 'time' for truncation due to exceeding a time limit, and 'length' for truncation due to exceeding a length limit. 7.2. Record Segmentation A record that will not fit into a single WARC file of desired maximum size may be broken into any number of separate records, called segments. As much as possible, segmentation should be avoided, and where necessary, segments other than the first must be of record-type 'continuation'. The first segment must carry the record-type (not 'continuation') that the record would have had were it not broken into segments, and a 'Segment-Number' named field with a value of "1". All subsequent segments must have a record type of 'continuation', with an incremented 'Segment-Number' field. They must also include a 'Segment-Origin-ID' field with a value of the Record-ID of the record containing the first segment of the set. All segments of a set must have identical subject-uri parameters. Kunze, et al. [Page 16] WARC File Format, 0.9 January 2006 The last segment must contain a "End-Length" named field specifying the total length, in bytes, of all segment content if reassembled. The last segment may also contain a 'Truncated' field, if appropriate. Segments other than the first should contain no other named fields, as they merely serve to continue the record data block of the first record. To reassemble all segments into the intended complete logical record, all records with the same 'Segment-Origin-ID' value must be collected and appended, in 'Segment-Number' order, to the origin record. Kunze, et al. [Page 17] WARC File Format, 0.9 January 2006 8. WARC Application to Specific Protocols 8.1. HTTP and HTTPS A full HTTP or HTTPS response, with protocol information and content- body (if any), can be saved verbatim into a WARC file as a 'response' type record, with a MIME content-type of "message/http" (or "message/ http;msgtype=response"). A full HTTP or HTTPS request, including all request headers and content-body (if any), can similarly be saved verbatim into a WARC file as a "request" type record, with a MIME content-type of "message/http" (or "message/http;msgtype=request"). For either a request or response, an 'IP-Address' field should be used to record the network IP address to which the request was directed, using the best available DNS information at the time. Additional metadata about the HTTP or HTTPS transaction may be stored in a 'metadata' type record, in a format to be specified elsewhere. In particular, information about the secure session in which an HTTPS transaction occurs, such as certificates presented or consulted and authentication information exchanged, may be stored in one or more 'metadata' type records. The multiple records which pertain to a single HTTP or HTTPS logical group of records will all have unique record-id values. In order to associate the records, all but one must use 'Related-Record-ID' fields to refer to another record in the set. As any mixture of record types may appear for a single collection event, and in any order, here is no specific record type which is automatically considered primary. Generally, all may refer back to the one record which appeared first, but this is not required. (A 'request' record may refer to a 'response' record or vice-versa; either could refer to a 'metadata' record or a 'metadata' record could refer to either.) Multiple and bidirectional 'Related-Record-ID' fields may appear. In the case where resources from a website have been harvested or otherwise received without performing normal HTTP operations, or where HTTP protocol information has been lost, it may be appropriate to store the plain content in WARC 'resource' type records, under their original subject-uri, but using the content MIME type in place of the "message/http" type. Kunze, et al. [Page 18] WARC File Format, 0.9 January 2006 8.2. DNS A request for DNS information can be summarized in a URI in accordance with a IETF Network Working Group draft proposal [DNS- URI]. DNS information as retrieved can be represented in the formats specified by [RFC1035], [RFC2540], and [RFC4027]. The results of a DNS lookup can thus be straightforwardly archived in a WARC 'response' record under the appropriate DNS URI and MIME type. 8.3. Other Resources with URIs, and Other Protocols Any resource that can be identified with a URI, even if it is not retrieved via an Internet operation, may be archived in a WARC file under a 'resource' type record. This includes files that have meaningful URIs retrieved from a locally-accessible filesystem or other repository. Specific conventions for other protocols and media types are expected to be defined as necessary. In general, the WARC format should be capable of archiving any digital resource which has a URI, a specific time of collection, and a discrete length. The 'request' and 'response' record types should be used for verbatim or lossless transcripts of collection activity, including protocol information. The 'resource' record type should be used for content without any protocol-specific enveloping. Additional information about a resource or transaction can be supplied in a protocol- or media-appropriate manner with 'metadata' type records. Kunze, et al. [Page 19] WARC File Format, 0.9 January 2006 9. Compression Recommendations The WARC format defines no internal compression. Whether and how WARC files should be compressed is an external decision. However, experience with the precursor ARC format at the Internet Archive has demonstrated that applying simple standard compression can result in significant storage savings, while preserving random access to individual records. For this purpose, the GZIP format with customary "deflate" compression is recommended, as defined in [RFC1950], [RFC1951], and [RFC1952]. Freely available source code implementing this format is available, and the technique is free of patent encumberances. The GZIP format is also widely used and supported across many free and commercial software packages and operating systems. This section documents recommended, but optional, practices for compressing WARC files with GZIP. 9.1. Record-at-a-time Compression Per section 2.2 of the GZIP specification, a valid GZIP file consists of any number of gzip "members", each independently compressed. Where possible, this property should be exploited to compress each record of a WARC file independently. This results in a valid GZIP file whose per-record subranges also stand alone as valid GZIP files. External indexes of WARC file content may then be used to record each record's starting position in the GZIP file, allowing for random access of individual records without requiring decompression of all preceding records. Note that the application of this convention causes no change to the uncompressed contents of an individual WARC record. In particular, the declared record length remains the length of the uncompressed record. 9.2. GZIP extra field: skip-lengths ('sl') Customarily, GZIP members do not declare their compressed length. This presents a problem for WARC processing which, after reading a small portion of a record, would like to skip to the next full record. In the absence of an external, precalculated index, using only the WARC record's uncompressed length would require the complete current record to be decompressed to find the start of the next record. Kunze, et al. [Page 20] WARC File Format, 0.9 January 2006 Section 2.3.1.1 of the GZIP format specification makes an allowance for arbitrary extension fields, called "extra-fields". We define here a new GZIP extra-field, "skip-lengths", identified by the two byte id "sl" (0x73, 0x6C). This field, when present, must contain two 4-byte unsigned integer values, with least significant byte first (as per other multi-byte values in the GZIP format). The first integer, compressed-skip- length, is a number of compressed bytes that may be skipped, from the beginning of the current GZIP member, to reach a distinct following member. (This value may be the exact length of the current member, but may also indicate a length of several related concatenated members.) The second integer, uncompressed-skip-length, is the number of uncompressed bytes that will be passed over when skipping the compressed-skip-length bytes forward. With the help of these values, a decompressor can often skip forward past large ranges of the compressed input that are not of interest, restarting decompression at the targetted next member, while retaining knowledge of exactly how many bytes of uncompressed data have been skipped. If the skip-length value is zero, the field should be ignored as if it were not present. (Compressors writing this field may use a zero value to reserve space for an as-yet-unknown skip-length, filling in the value if possible later.) This extra-field will be registered with the GZIP authors as appropriate. 9.3. GZIP WARC File Name Suffix A WARC file compressed with the extra GZIP field conventions described in this document is a legal GZIP file. To ensure that it is properly recognized by GZIP tools, its name should have the customary ".gz" appended to it, making the complete suffix, ".warc.gz". GZIP software that does not recognize the extra GZIP fields will simply pass over them without benefit or harm. Kunze, et al. [Page 21] WARC File Format, 0.9 January 2006 10. WARC File Name and Size Recommendations It is helpful to use practices within an institution that make it unlikely or impossible to duplicate aggregate WARC file names. The convention used inside the Internet Archive with ARC files is to name files according to the following pattern: Prefix-Timestamp-Serial-Crawlhost.warc.gz Prefix is an abbreviation usually reflective of the project or crawl that created this file. Timestamp is a 14-digit GMT timestamp indicating the time the file was initially begun. Serial is an increasing serial-number within the process creating the files, often (but not necessarily) unique with regard to the Prefix. Crawlhost is the domain name or IP address of the machine creating the file. IIPC member institutions have expressed an interest in adopting a common naming strategy, with per-institution unique identifiers to assist in marking WARC files with their institution of origin. It is proposed that all such WARC file names adhering to this future convention begin "iipc". This specification does not require any particular WARC file naming practice, but recommends conventions similar to the above be adopted within WARC-creating institutions. The file name prefix "iipc" should be avoided unless participating in the IIPC naming registry. 500MB (5x10^8 bytes) is recommended as a practical target size for WARC files, when record sizes allow. Oversized records may be truncated, segmented, or simply placed in oversized WARC files, at a project's discretion. Kunze, et al. [Page 22] WARC File Format, 0.9 January 2006 11. Registration of MIME Media Type application/warc This section describes, as per [RFC2048], the MIME types associated with the WARC format. MIME media type name: application MIME subtype names: warc Required parameters: None Optional parameters: None Encoding considerations: UTF-8 is the default character encoding for the textual information defined by the WARC format. However, any binary data may be included as blocks within the WARC format, and so only "8bit" and "binary" encoding is allowable. Security considerations: The WARC record syntax poses no direct risk to computers and networks. Implementors need to be aware of source authority and trustworthiness of information structured in WARC. Readers and writers subject themselves to all the risks that accompany normal operation of data processing services (e.g., message length errors, buffer overflow attacks). Interoperability considerations: None Published specification: TBD Applications which use this media type: Large- and small-scale archiving Additional information: None Person and email address to contact for further information: Gordon Mohr gojomo@archive.org, John Kunze jak@ucop.edu Intended usage: COMMON Author/Change controller: IESG Kunze, et al. [Page 23] WARC File Format, 0.9 January 2006 12. IANA Considerations After IESG approval, IANA is expected to register the WARC type "application/warc" using the application provided in this document. Kunze, et al. [Page 24] WARC File Format, 0.9 January 2006 13. Acknowledgements This document could not have been written without major contributions from participants of the International Internet Preservation Consortium, especially Steen Christensen, and Julien Masanes. Kunze, et al. [Page 25] WARC File Format, 0.9 January 2006 Appendix A. Consideratons in Choice of record-id The WARC format differs significantly from the ARC format in requiring the record-id parameter. The record-id must be globally unique for its period of intended use. If that period is indefinite, the record-id should be maintained to a level appropriate for any persistent identifier, in which case identifier opaqueness is usually desirable. There is no reason why the archiving institution may not choose record-ids that are also "actionable" (submittable as retrieval requests to widely available tools such as web browsers) as long as there are providers to service them. This specification does not dictate what identifier scheme to use; suitable schemes include URN [RFC2141], [ARK], [GUID], etc. Also worth considering is the establishment of lexical conventions for record-ids that reveal or suggest relationships among content blocks. Although the 'Related-Record-ID' parameter required of 'metadata', 'revisit', and 'conversion' records is sufficient to convey relatedness in the context of a single WARC file, great optimization can be had when relatedness can be inferred by third parties through identifier comparison rather than by lookup in a database or examination of the relevant WARC files. These conventions are suggested by [RFC2396], formalized by the [ARK] scheme, and are applicable to such things as the summarizing of large search results from Internet-wide indexing engines. As an example of a convention that could be adopted by users of any identifier scheme, the "/" character could be reserved as a separator used to introduce an extension string that is appended to a primary record-id. If the record-id of a primary block of captured content were, http://abc.org/12026/987654321 The convention could also reserve the extension strings "_s", "_d", and "_t" to indicate record- ids for secondary, duplicate, and transform blocks, respectively. Over time this might result in he assignment of record-ids such as, http://abc.org/12026/987654321/_s1 http://abc.org/12026/987654321/_s2 http://abc.org/12026/987654321/_d9 http://abc.org/12026/987654321/_d10 http://abc.org/12026/987654321/_t ...in which an integer count may further extend the identifier when more there is more than one relationship of the given type. Kunze, et al. [Page 26] WARC File Format, 0.9 January 2006 Appendix B. Examples of WARC Records Examples of each of record-type are provided here. In some cases, illustrative data is shown where conventions have not yet been specified. Each record header-line is split over multiple lines for readability; continuations of the single line are indented, and a newline should only be considered to appear at the end of the last indented line. Declared record lengths are approximate, and unique IDs and checksums shown are plausible random filler. Appendix B.1. Example of 'warcinfo' Record The following 'warcinfo' example includes an XML description of the enclosing WARC file that is loosely modelled after the descriptions currently used in Internet Archive ARC files. However, this is an abbreviated and speculative illustration; the referenced WARC- specific namespace "http://archive.org/warc/0.9" has not been formally defined anywhere, and may not reflect eventual practice with WARC files. warc/0.9 1012 warcinfo filedesc:test-20050708010101-00001-crawl017.archive.org.warc.gz 20050708010101 text/xml uuid:cbad35b7-e591-4b43-8a67-9d1d8f9ef4cd Heritrix 1.4.0 http://crawler.archive.org crawling017.archive.org 207.241.227.234 testcrawl-20050708 testcrawl with WARC output IA_Admin Mozilla/5.0 (compatible; heritrix/1.4.0 +http://crawler.archive.org) WARC file version 0.9 http://www.archive.org/documents/WarcFileFormat.php The first line (spread over three lines for readability) shows the Kunze, et al. [Page 27] WARC File Format, 0.9 January 2006 required line of positional parameters. This record has no named fields, as evidenced by the single blank line following he header- line. The content block is "text/xml", as declared in the header- line. Two newlines follow the content block. Appendix B.2. Example of 'request' Record A 'request' record captures the protocol request used to collect a resource. For example, to collect the resource "http://www.archive.org/images/logo.jpg", the following 'request' record might be generated: warc/0.9 298 request http://www.archive.org/images/logo.jpg 20050708010101 message/http uuid:f569983a-ef8c-4e62-b347-295b227c3e51 IP-Address: 207.241.224.241 GET /images/logo.jpg HTTP/1.0 Host: www.archive.org User-Agent: Mozilla/5.0 (compatible; crawler/1.4 +http://example.com) Appendix B.3. Example of 'response' Record The archived response to the above request might look like the following. warc/0.9 7583 response http://www.archive.org/images/logo.jpg 20050708010101 message/http uuid:a4b26b6b-f918-4136-af04-f859d75aebe5 IP-Address: 207.241.224.241 Related-Record-ID: uuid:f569983a-ef8c-4e62-b347-295b227c3e51 Checksum: sha1:2ZWC6JAT6KNXKD37F7MOEKXQMRY75YY4 HTTP/1.x 200 OK Date: Fri, 08 Jul 2005 01:01:01 GMT Server: Apache/1.3.33 (Debian GNU/Linux) PHP/5.0.4-0.3 Last-Modified: Sun, 12 Jun 2005 00:31:01 GMT Etag: "914480-1b2e-42ab8245" Accept-Ranges: bytes Content-Length: 6958 Keep-Alive: timeout=15, max=100 Connection: Keep-Alive Content-Type: image/jpeg [6958 bytes of binary data here] Note the 'Related-Record-ID' named field referring back to the Kunze, et al. [Page 28] WARC File Format, 0.9 January 2006 generating 'request' record, and the creation-date identical to the previous record. Appendix B.4. Example of 'resource' Record This same file, "logo.jpg", might be archived internally to an organization under its local filesystem name. This could result in a 'resource' record: warc/0.9 7141 resource file://webserver/htdoc/images/logo.jpg 20050710010101 image/jpeg uuid:a6c3132b-49b8-4fd5-8072-45ce66d48a4b Checksum: sha1:37F7MOEKXQMRY75YY42ZWC6JAT6KNXKD [6958 bytes of binary data here] Appendix B.5. Example of 'metadata' Record If some crawl-time metadata should be archived near the above response, a 'metadata' record could be used like the following (with a purely speculative XML format): warc/0.9 395 metadata http://www.archive.org/images/logo.jpg 20050708010101 text/xml uuid:a4acff63-c213-4f35-9652-41a0e2dfc492 Related-Record-ID: uuid:a4b26b6b-f918-4136-af04-f859d75aebe5 http://www.archive.org 565 Note again the same creation-date as the preceding related records. A relationship is declared o the preceding 'response' record, but declaring a relationship to the 'request' would also be legal. Appendix B.6. Example of 'revisit' Record If the same URI is later revisited and the content is unchanged, a 'revisit' record like the following (again with a speculative content-type) could be generated: Kunze, et al. [Page 29] WARC File Format, 0.9 January 2006 warc/0.9 395 revisit http://www.archive.org/images/logo.jpg 20050808010101 text/xml uuid:ad522b3b-d68c-464a-b5e2-38149cfb511d Related-Record-ID: uuid:a4b26b6b-f918-4136-af04-f859d75aebe5 HTTP/1.x 304 Not Modified Date: Mon, 08 Aug 2005 01:01:01 GMT Etag: "914480-1b2e-42ab8245" Again, reference is made back to the original 'response' record. A new creation-date reflects the time of revisit. This content block hypothesizes including header excerpts from a server response to explain the results of the revisit. (In this case, the remote server indicated the resource was unchanged from the previous 'Etag' value.) The actual formats for describing the result of a revisit remain to be defined. Appendix B.7. Example of 'conversion' Record At some future date, the "image/jpeg" format may no longer be considered viable, prompting a conversion of the original archive content into a hypothetical new format, "image/neoimg", which generates a 3098 byte version of the same image. This could be accomodated with a 'conversion' record: warc/0.9 4111 conversion http://www.archive.org/images/logo.jpg 20150708010101 image/neoimg uuid:c631da8a-e8db-44a8-84c5-9cc848dff35a Related-Record-ID: uuid:a4b26b6b-f918-4136-af04-f859d75aebe5 Checksum: sha1:XQMRY75YY42ZWC6JAT6KNXKD37F7MOEK [3098 bytes of binary data here] An accompanying 'metadata' record, referring to this 'conversion' record, could contain additional details about the transformation. (Alternatively, new named-fields in this record could serve this role.) Appendix B.8. Example of 'continuation' Record If the 'response' above had been so large that it would not fit into a single WARC file of desired maximum size, it would have to be Kunze, et al. [Page 30] WARC File Format, 0.9 January 2006 segmented into separate smaller records. The first record would be as before, except with one additional named field, 'Segment-Number', with a value of "1", indicating that the record was the beginning of a segmented record set. The subsequent segment for that record would then look like this: warc/0.9 39514322 continuation http://www.archive.org/images/logo.jpg 20150708010101 message/http uuid:c0d36ada-af8c-4608-8409-e60818b1d9e9 Segment-Number: 2 Segment-Origin-ID: uuid:a4b26b6b-f918-4136-af04-f859d75aebe5 [39514114 bytes of binary data here] Note that the 'Segment-Origin-ID' refers to the first segment of the set, the one with the "Segment-Number: 1" named field. Kunze, et al. [Page 31] WARC File Format, 0.9 January 2006 Appendix C. Collected BNF for WARC warc-file = 1*warc-record warc-record = header block CRLF CRLF header = header-line CRLF *anvl-field CRLF block = *OCTET header-line = warc-id tsp data-length tsp record-type tsp subject-uri tsp creation-date tsp content-type tsp record-id tsp = 1*WSP warc-id = "warc/" DIGIT "." DIGIT data-length = 1*DIGIT record-type = "warcinfo" / "response" / "request" / "metadata" / "revisit" / "conversion" / "continuation" / future-type future-type = 1*VCHAR subject-uri = uri uri = <'URI' per RFC3986> creation-date = timestamp timestamp = content-type = type "/" subtype type = <'type' per RFC2045> subtype = <'subtype' per RFC2045> record-id = uri anvl-field = field-name ":" [ field-body ] CRLF field-name = 1* field-body = text [CRLF LWSP-char field-body] text = 1* ; (Octal, Decimal.) CHAR = ; (0-177, 0.-127.) CR = ; ( 15, 13.) LF = ; ( 12, 10.) SPACE = ; ( 40, 32.) HTAB = ; ( 11, 9.) CRLF = CR LF LWSP-char = SPACE / HTAB ; semantics = SPACE 14. References [ANVL] Kunze, J., Kahle, B., Masanes, J., and G. Mohr, "A Name- Value Language", . Kunze, et al. [Page 32] WARC File Format, 0.9 January 2006 [ARC] Burner, M. and B. Kahle, "The ARC File Format", September 1996, . [ARK] Kunze, J. and R. Rodgers, "The ARK Persistent Identifier Scheme", August 2005, . [GUID] "Wikipedia: Globally Unique Identifiers", . [HERITRIX] "Heritrix Open Source Archival Web Crawler", . [IIPC] "International Internet Preservation Consortium (IIPC)", . [RDF] "Resource Description Framework (RDF)", . [RFC0822] Crocker, D., "Standard for the format of ARPA Internet text messages", STD 11, RFC 822, August 1982. [RFC1035] Mockapetris, P., "Domain names - implementation and specification", STD 13, RFC 1035, November 1987. [RFC1884] Hinden, R. and S. Deering, "IP Version 6 Addressing Architecture", RFC 1884, December 1995. [RFC1950] Deutsch, L. and J-L. Gailly, "ZLIB Compressed Data Format Specification version 3.3", RFC 1950, May 1996. [RFC1951] Deutsch, P., "DEFLATE Compressed Data Format Specification version 1.3", RFC 1951, May 1996. [RFC1952] Deutsch, P., Gailly, J-L., Adler, M., Deutsch, L., and G. Randers-Pehrson, "GZIP file format specification version 4.3", RFC 1952, May 1996. [RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies", RFC 2045, November 1996. [RFC2048] Freed, N., Klensin, J., and J. Postel, "Multipurpose Internet Mail Extensions (MIME) Part Four: Registration Procedures", BCP 13, RFC 2048, November 1996. Kunze, et al. [Page 33] WARC File Format, 0.9 January 2006 [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. [RFC2234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax Specifications: ABNF", RFC 2234, November 1997. [RFC2396] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform Resource Identifiers (URI): Generic Syntax", RFC 2396, August 1998. [RFC2540] Eastlake, D., "Detached Domain Name System (DNS) Information", RFC 2540, March 1999. [RFC4027] Josefsson, S., "Domain Name System Media Types", RFC 4027, April 2005. [DNS-URI] Josefsson, S., "Domain Name System Uniform Resource Identifiers", May 2005, . Kunze, et al. [Page 34] WARC File Format, 0.9 January 2006 Authors' Addresses John A. Kunze (editor) California Digital Library 415 20th St, 4th Floor Oakland, CA 94612-3550 US Fax: +1 510-893-5212 Email: jak@ucop.edu Allan Arvidson Kungliga biblioteket (National Library of Sweden) Box 5039 Stockholm 10241 SE Fax: +46 (0)8 463 4004 Email: allan.arvidson@kb.se Gordon Mohr Internet Archive 4 Funston Ave, Presidio San Francisco, CA 94117 US Email: gojomo@archive.org Michael Stack Internet Archive 4 Funston Ave, Presidio San Francisco, CA 94117 US Email: stack@archive.org Kunze, et al. [Page 35]