1 2 3 4 5 6 7 Network Working Group P. Deutsch 8 Request for Comments: 1950 Aladdin Enterprises 9 Category: Informational J-L. Gailly 10 Info-ZIP 11 May 1996 12 13 14 ZLIB Compressed Data Format Specification version 3.3 15 16 Status of This Memo 17 18 This memo provides information for the Internet community. This memo 19 does not specify an Internet standard of any kind. Distribution of 20 this memo is unlimited. 21 22 IESG Note: 23 24 The IESG takes no position on the validity of any Intellectual 25 Property Rights statements contained in this document. 26 27 Notices 28 29 Copyright (c) 1996 L. Peter Deutsch and Jean-Loup Gailly 30 31 Permission is granted to copy and distribute this document for any 32 purpose and without charge, including translations into other 33 languages and incorporation into compilations, provided that the 34 copyright notice and this notice are preserved, and that any 35 substantive changes or deletions from the original are clearly 36 marked. 37 38 A pointer to the latest version of this and related documentation in 39 HTML format can be found at the URL 40 <ftp://ftp.uu.net/graphics/png/documents/zlib/zdoc-index.html>. 41 42 Abstract 43 44 This specification defines a lossless compressed data format. The 45 data can be produced or consumed, even for an arbitrarily long 46 sequentially presented input data stream, using only an a priori 47 bounded amount of intermediate storage. The format presently uses 48 the DEFLATE compression method but can be easily extended to use 49 other compression methods. It can be implemented readily in a manner 50 not covered by patents. This specification also defines the ADLER-32 51 checksum (an extension and improvement of the Fletcher checksum), 52 used for detection of data corruption, and provides an algorithm for 53 computing it. 54 55 56 57 58 Deutsch & Gailly Informational [Page 1] 59 60 RFC 1950 ZLIB Compressed Data Format Specification May 1996 61 62 63 Table of Contents 64 65 1. Introduction ................................................... 2 66 1.1. Purpose ................................................... 2 67 1.2. Intended audience ......................................... 3 68 1.3. Scope ..................................................... 3 69 1.4. Compliance ................................................ 3 70 1.5. Definitions of terms and conventions used ................ 3 71 1.6. Changes from previous versions ............................ 3 72 2. Detailed specification ......................................... 3 73 2.1. Overall conventions ....................................... 3 74 2.2. Data format ............................................... 4 75 2.3. Compliance ................................................ 7 76 3. References ..................................................... 7 77 4. Source code .................................................... 8 78 5. Security Considerations ........................................ 8 79 6. Acknowledgements ............................................... 8 80 7. Authors' Addresses ............................................. 8 81 8. Appendix: Rationale ............................................ 9 82 9. Appendix: Sample code ..........................................10 83 84 1. Introduction 85 86 1.1. Purpose 87 88 The purpose of this specification is to define a lossless 89 compressed data format that: 90 91 * Is independent of CPU type, operating system, file system, 92 and character set, and hence can be used for interchange; 93 94 * Can be produced or consumed, even for an arbitrarily long 95 sequentially presented input data stream, using only an a 96 priori bounded amount of intermediate storage, and hence can 97 be used in data communications or similar structures such as 98 Unix filters; 99 100 * Can use a number of different compression methods; 101 102 * Can be implemented readily in a manner not covered by 103 patents, and hence can be practiced freely. 104 105 The data format defined by this specification does not attempt to 106 allow random access to compressed data. 107 108 109 110 111 112 113 114 Deutsch & Gailly Informational [Page 2] 115 116 RFC 1950 ZLIB Compressed Data Format Specification May 1996 117 118 119 1.2. Intended audience 120 121 This specification is intended for use by implementors of software 122 to compress data into zlib format and/or decompress data from zlib 123 format. 124 125 The text of the specification assumes a basic background in 126 programming at the level of bits and other primitive data 127 representations. 128 129 1.3. Scope 130 131 The specification specifies a compressed data format that can be 132 used for in-memory compression of a sequence of arbitrary bytes. 133 134 1.4. Compliance 135 136 Unless otherwise indicated below, a compliant decompressor must be 137 able to accept and decompress any data set that conforms to all 138 the specifications presented here; a compliant compressor must 139 produce data sets that conform to all the specifications presented 140 here. 141 142 1.5. Definitions of terms and conventions used 143 144 byte: 8 bits stored or transmitted as a unit (same as an octet). 145 (For this specification, a byte is exactly 8 bits, even on 146 machines which store a character on a number of bits different 147 from 8.) See below, for the numbering of bits within a byte. 148 149 1.6. Changes from previous versions 150 151 Version 3.1 was the first public release of this specification. 152 In version 3.2, some terminology was changed and the Adler-32 153 sample code was rewritten for clarity. In version 3.3, the 154 support for a preset dictionary was introduced, and the 155 specification was converted to RFC style. 156 157 2. Detailed specification 158 159 2.1. Overall conventions 160 161 In the diagrams below, a box like this: 162 163 +---+ 164 | | <-- the vertical bars might be missing 165 +---+ 166 167 168 169 170 Deutsch & Gailly Informational [Page 3] 171 172 RFC 1950 ZLIB Compressed Data Format Specification May 1996 173 174 175 represents one byte; a box like this: 176 177 +==============+ 178 | | 179 +==============+ 180 181 represents a variable number of bytes. 182 183 Bytes stored within a computer do not have a "bit order", since 184 they are always treated as a unit. However, a byte considered as 185 an integer between 0 and 255 does have a most- and least- 186 significant bit, and since we write numbers with the most- 187 significant digit on the left, we also write bytes with the most- 188 significant bit on the left. In the diagrams below, we number the 189 bits of a byte so that bit 0 is the least-significant bit, i.e., 190 the bits are numbered: 191 192 +--------+ 193 |76543210| 194 +--------+ 195 196 Within a computer, a number may occupy multiple bytes. All 197 multi-byte numbers in the format described here are stored with 198 the MOST-significant byte first (at the lower memory address). 199 For example, the decimal number 520 is stored as: 200 201 0 1 202 +--------+--------+ 203 |00000010|00001000| 204 +--------+--------+ 205 ^ ^ 206 | | 207 | + less significant byte = 8 208 + more significant byte = 2 x 256 209 210 2.2. Data format 211 212 A zlib stream has the following structure: 213 214 0 1 215 +---+---+ 216 |CMF|FLG| (more-->) 217 +---+---+ 218 219 220 221 222 223 224 225 226 Deutsch & Gailly Informational [Page 4] 227 228 RFC 1950 ZLIB Compressed Data Format Specification May 1996 229 230 231 (if FLG.FDICT set) 232 233 0 1 2 3 234 +---+---+---+---+ 235 | DICTID | (more-->) 236 +---+---+---+---+ 237 238 +=====================+---+---+---+---+ 239 |...compressed data...| ADLER32 | 240 +=====================+---+---+---+---+ 241 242 Any data which may appear after ADLER32 are not part of the zlib 243 stream. 244 245 CMF (Compression Method and flags) 246 This byte is divided into a 4-bit compression method and a 4- 247 bit information field depending on the compression method. 248 249 bits 0 to 3 CM Compression method 250 bits 4 to 7 CINFO Compression info 251 252 CM (Compression method) 253 This identifies the compression method used in the file. CM = 8 254 denotes the "deflate" compression method with a window size up 255 to 32K. This is the method used by gzip and PNG (see 256 references [1] and [2] in Chapter 3, below, for the reference 257 documents). CM = 15 is reserved. It might be used in a future 258 version of this specification to indicate the presence of an 259 extra field before the compressed data. 260 261 CINFO (Compression info) 262 For CM = 8, CINFO is the base-2 logarithm of the LZ77 window 263 size, minus eight (CINFO=7 indicates a 32K window size). Values 264 of CINFO above 7 are not allowed in this version of the 265 specification. CINFO is not defined in this specification for 266 CM not equal to 8. 267 268 FLG (FLaGs) 269 This flag byte is divided as follows: 270 271 bits 0 to 4 FCHECK (check bits for CMF and FLG) 272 bit 5 FDICT (preset dictionary) 273 bits 6 to 7 FLEVEL (compression level) 274 275 The FCHECK value must be such that CMF and FLG, when viewed as 276 a 16-bit unsigned integer stored in MSB order (CMF*256 + FLG), 277 is a multiple of 31. 278 279 280 281 282 Deutsch & Gailly Informational [Page 5] 283 284 RFC 1950 ZLIB Compressed Data Format Specification May 1996 285 286 287 FDICT (Preset dictionary) 288 If FDICT is set, a DICT dictionary identifier is present 289 immediately after the FLG byte. The dictionary is a sequence of 290 bytes which are initially fed to the compressor without 291 producing any compressed output. DICT is the Adler-32 checksum 292 of this sequence of bytes (see the definition of ADLER32 293 below). The decompressor can use this identifier to determine 294 which dictionary has been used by the compressor. 295 296 FLEVEL (Compression level) 297 These flags are available for use by specific compression 298 methods. The "deflate" method (CM = 8) sets these flags as 299 follows: 300 301 0 - compressor used fastest algorithm 302 1 - compressor used fast algorithm 303 2 - compressor used default algorithm 304 3 - compressor used maximum compression, slowest algorithm 305 306 The information in FLEVEL is not needed for decompression; it 307 is there to indicate if recompression might be worthwhile. 308 309 compressed data 310 For compression method 8, the compressed data is stored in the 311 deflate compressed data format as described in the document 312 "DEFLATE Compressed Data Format Specification" by L. Peter 313 Deutsch. (See reference [3] in Chapter 3, below) 314 315 Other compressed data formats are not specified in this version 316 of the zlib specification. 317 318 ADLER32 (Adler-32 checksum) 319 This contains a checksum value of the uncompressed data 320 (excluding any dictionary data) computed according to Adler-32 321 algorithm. This algorithm is a 32-bit extension and improvement 322 of the Fletcher algorithm, used in the ITU-T X.224 / ISO 8073 323 standard. See references [4] and [5] in Chapter 3, below) 324 325 Adler-32 is composed of two sums accumulated per byte: s1 is 326 the sum of all bytes, s2 is the sum of all s1 values. Both sums 327 are done modulo 65521. s1 is initialized to 1, s2 to zero. The 328 Adler-32 checksum is stored as s2*65536 + s1 in most- 329 significant-byte first (network) order. 330 331 332 333 334 335 336 337 338 Deutsch & Gailly Informational [Page 6] 339 340 RFC 1950 ZLIB Compressed Data Format Specification May 1996 341 342 343 2.3. Compliance 344 345 A compliant compressor must produce streams with correct CMF, FLG 346 and ADLER32, but need not support preset dictionaries. When the 347 zlib data format is used as part of another standard data format, 348 the compressor may use only preset dictionaries that are specified 349 by this other data format. If this other format does not use the 350 preset dictionary feature, the compressor must not set the FDICT 351 flag. 352 353 A compliant decompressor must check CMF, FLG, and ADLER32, and 354 provide an error indication if any of these have incorrect values. 355 A compliant decompressor must give an error indication if CM is 356 not one of the values defined in this specification (only the 357 value 8 is permitted in this version), since another value could 358 indicate the presence of new features that would cause subsequent 359 data to be interpreted incorrectly. A compliant decompressor must 360 give an error indication if FDICT is set and DICTID is not the 361 identifier of a known preset dictionary. A decompressor may 362 ignore FLEVEL and still be compliant. When the zlib data format 363 is being used as a part of another standard format, a compliant 364 decompressor must support all the preset dictionaries specified by 365 the other format. When the other format does not use the preset 366 dictionary feature, a compliant decompressor must reject any 367 stream in which the FDICT flag is set. 368 369 3. References 370 371 [1] Deutsch, L.P.,"GZIP Compressed Data Format Specification", 372 available in ftp://ftp.uu.net/pub/archiving/zip/doc/ 373 374 [2] Thomas Boutell, "PNG (Portable Network Graphics) specification", 375 available in ftp://ftp.uu.net/graphics/png/documents/ 376 377 [3] Deutsch, L.P.,"DEFLATE Compressed Data Format Specification", 378 available in ftp://ftp.uu.net/pub/archiving/zip/doc/ 379 380 [4] Fletcher, J. G., "An Arithmetic Checksum for Serial 381 Transmissions," IEEE Transactions on Communications, Vol. COM-30, 382 No. 1, January 1982, pp. 247-252. 383 384 [5] ITU-T Recommendation X.224, Annex D, "Checksum Algorithms," 385 November, 1993, pp. 144, 145. (Available from 386 gopher://info.itu.ch). ITU-T X.244 is also the same as ISO 8073. 387 388 389 390 391 392 393 394 Deutsch & Gailly Informational [Page 7] 395 396 RFC 1950 ZLIB Compressed Data Format Specification May 1996 397 398 399 4. Source code 400 401 Source code for a C language implementation of a "zlib" compliant 402 library is available at ftp://ftp.uu.net/pub/archiving/zip/zlib/. 403 404 5. Security Considerations 405 406 A decoder that fails to check the ADLER32 checksum value may be 407 subject to undetected data corruption. 408 409 6. Acknowledgements 410 411 Trademarks cited in this document are the property of their 412 respective owners. 413 414 Jean-Loup Gailly and Mark Adler designed the zlib format and wrote 415 the related software described in this specification. Glenn 416 Randers-Pehrson converted this document to RFC and HTML format. 417 418 7. Authors' Addresses 419 420 L. Peter Deutsch 421 Aladdin Enterprises 422 203 Santa Margarita Ave. 423 Menlo Park, CA 94025 424 425 Phone: (415) 322-0103 (AM only) 426 FAX: (415) 322-1734 427 EMail: <ghost@aladdin.com> 428 429 430 Jean-Loup Gailly 431 432 EMail: <gzip@prep.ai.mit.edu> 433 434 Questions about the technical content of this specification can be 435 sent by email to 436 437 Jean-Loup Gailly <gzip@prep.ai.mit.edu> and 438 Mark Adler <madler@alumni.caltech.edu> 439 440 Editorial comments on this specification can be sent by email to 441 442 L. Peter Deutsch <ghost@aladdin.com> and 443 Glenn Randers-Pehrson <randeg@alumni.rpi.edu> 444 445 446 447 448 449 450 Deutsch & Gailly Informational [Page 8] 451 452 RFC 1950 ZLIB Compressed Data Format Specification May 1996 453 454 455 8. Appendix: Rationale 456 457 8.1. Preset dictionaries 458 459 A preset dictionary is specially useful to compress short input 460 sequences. The compressor can take advantage of the dictionary 461 context to encode the input in a more compact manner. The 462 decompressor can be initialized with the appropriate context by 463 virtually decompressing a compressed version of the dictionary 464 without producing any output. However for certain compression 465 algorithms such as the deflate algorithm this operation can be 466 achieved without actually performing any decompression. 467 468 The compressor and the decompressor must use exactly the same 469 dictionary. The dictionary may be fixed or may be chosen among a 470 certain number of predefined dictionaries, according to the kind 471 of input data. The decompressor can determine which dictionary has 472 been chosen by the compressor by checking the dictionary 473 identifier. This document does not specify the contents of 474 predefined dictionaries, since the optimal dictionaries are 475 application specific. Standard data formats using this feature of 476 the zlib specification must precisely define the allowed 477 dictionaries. 478 479 8.2. The Adler-32 algorithm 480 481 The Adler-32 algorithm is much faster than the CRC32 algorithm yet 482 still provides an extremely low probability of undetected errors. 483 484 The modulo on unsigned long accumulators can be delayed for 5552 485 bytes, so the modulo operation time is negligible. If the bytes 486 are a, b, c, the second sum is 3a + 2b + c + 3, and so is position 487 and order sensitive, unlike the first sum, which is just a 488 checksum. That 65521 is prime is important to avoid a possible 489 large class of two-byte errors that leave the check unchanged. 490 (The Fletcher checksum uses 255, which is not prime and which also 491 makes the Fletcher check insensitive to single byte changes 0 <-> 492 255.) 493 494 The sum s1 is initialized to 1 instead of zero to make the length 495 of the sequence part of s2, so that the length does not have to be 496 checked separately. (Any sequence of zeroes has a Fletcher 497 checksum of zero.) 498 499 500 501 502 503 504 505 506 Deutsch & Gailly Informational [Page 9] 507 508 RFC 1950 ZLIB Compressed Data Format Specification May 1996 509 510 511 9. Appendix: Sample code 512 513 The following C code computes the Adler-32 checksum of a data buffer. 514 It is written for clarity, not for speed. The sample code is in the 515 ANSI C programming language. Non C users may find it easier to read 516 with these hints: 517 518 & Bitwise AND operator. 519 >> Bitwise right shift operator. When applied to an 520 unsigned quantity, as here, right shift inserts zero bit(s) 521 at the left. 522 << Bitwise left shift operator. Left shift inserts zero 523 bit(s) at the right. 524 ++ "n++" increments the variable n. 525 % modulo operator: a % b is the remainder of a divided by b. 526 527 #define BASE 65521 /* largest prime smaller than 65536 */ 528 529 /* 530 Update a running Adler-32 checksum with the bytes buf[0..len-1] 531 and return the updated checksum. The Adler-32 checksum should be 532 initialized to 1. 533 534 Usage example: 535 536 unsigned long adler = 1L; 537 538 while (read_buffer(buffer, length) != EOF) { 539 adler = update_adler32(adler, buffer, length); 540 } 541 if (adler != original_adler) error(); 542 */ 543 unsigned long update_adler32(unsigned long adler, 544 unsigned char *buf, int len) 545 { 546 unsigned long s1 = adler & 0xffff; 547 unsigned long s2 = (adler >> 16) & 0xffff; 548 int n; 549 550 for (n = 0; n < len; n++) { 551 s1 = (s1 + buf[n]) % BASE; 552 s2 = (s2 + s1) % BASE; 553 } 554 return (s2 << 16) + s1; 555 } 556 557 /* Return the adler32 of the bytes buf[0..len-1] */ 558 559 560 561 562 Deutsch & Gailly Informational [Page 10] 563 564 RFC 1950 ZLIB Compressed Data Format Specification May 1996 565 566 567 unsigned long adler32(unsigned char *buf, int len) 568 { 569 return update_adler32(1L, buf, len); 570 } 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 Deutsch & Gailly Informational [Page 11] 619