[ The following analysis was done by one of my colleagues at work. We are building an email search and storage subsystem and the sizes and uniqueness of attachments was interesting. The document is looking at the size of the body and attachments (not the headers) of over 18 million corporate emails. Here is his analysis. Thanks Peter. ]
The following is an analysis of a large collection of email messages. Some notes:
Item | Value | Percentage |
---|---|---|
Total Included Assets | 18,568,722 | |
Total Bad Assets | 36,619 | 0.2% |
Total Asset Parts | 40,885,616 | 100.0% |
Total Clear Size | 1,424,967,245,112 | 100.0% |
Total Deflated Size | 793,987,580,620 | 100.0% |
Unique Parts | 21,213,630 | 51.9% |
Unique Clear Size | 700,819,336,446 | 49.2% |
Unique Deflated Size | 361,929,754,290 | 45.6% |
So this table is a bit strange. Don't think of it as rows. The columns are independent (unfortunately). So in the <16b column, 14,356 clear files were <16 bytes, 108,351 files were deflated to that size, 7,409 unique clear text files were that size, and 4,249 deflated unique files.
Byte Size | Total Clear Bytes | Total Deflated Bytes | Unique Clear Bytes | Unique Deflated Bytes |
---|---|---|---|---|
<1b | 10,645,434 | 10,645,434 | 2 | 2 |
<2b | 51,275 | 0 | 12 | 0 |
<4b | 39,115 | 0 | 160 | 0 |
<8b | 10,065 | 0 | 1,422 | 0 |
<16b | 14,356 | 108,351 | 7,409 | 4,249 |
<32b | 41,912 | 40,519 | 23,212 | 21,505 |
<64b | 160,667 | 186,782 | 60,399 | 65,489 |
<128b | 971,882 | 1,126,347 | 242,473 | 350,998 |
<256b | 1,496,287 | 2,580,006 | 1,127,850 | 1,713,909 |
<512b | 2,878,892 | 5,742,009 | 1,658,630 | 3,828,936 |
<1k | 4,519,315 | 6,743,444 | 3,134,157 | 5,328,728 |
<2k | 4,795,537 | 5,979,510 | 3,775,815 | 5,005,490 |
<4k | 4,719,521 | 3,059,155 | 3,851,320 | 2,510,388 |
<8k | 3,506,609 | 1,360,555 | 2,931,410 | 941,587 |
<16k | 2,173,706 | 767,800 | 1,752,428 | 394,225 |
<32k | 1,554,018 | 623,130 | 1,040,344 | 249,027 |
<64k | 1,219,832 | 578,461 | 666,137 | 233,756 |
<128k | 757,401 | 505,886 | 348,406 | 207,676 |
<256k | 535,854 | 281,989 | 242,087 | 120,564 |
<512k | 313,384 | 250,260 | 136,979 | 106,641 |
<1m | 229,601 | 162,524 | 100,396 | 68,163 |
<2m | 143,667 | 85,135 | 62,202 | 35,827 |
<4m | 67,973 | 39,598 | 31,307 | 17,065 |
<8m | 26,946 | 11,237 | 12,337 | 5,483 |
<16m | 6,390 | 6,516 | 3,518 | 3,418 |
<32m | 5,260 | 786 | 2,799 | 388 |
<64m | 575 | 136 | 321 | 88 |
<128m | 110 | 39 | 77 | 22 |
<256m | 24 | 5 | 14 | 4 |
<512m | 6 | 2 | 5 | 2 |
The following table is estimating the amount of space wasted based on the blocksize of the filesystem.
Block Size Bytes | Total Unused Slack Space | Percent Overhead |
---|---|---|
512b | 5,326,831,438 | 1.5% |
1k | 11,470,299,470 | 3.1% |
2k | 26,209,376,590 | 6.8% |
4k | 62,533,699,918 | 14.7% |
8k | 142,984,419,662 | 28.3% |
Free Spam Protection Android ORM Simple Java Zip JMX using HTTP Great Eggnog Recipe