256
Logo

Email Size Analysis

[ The following analysis was done by one of my colleagues at work. We are building an email search and storage subsystem and the sizes and uniqueness of attachments was interesting. The document is looking at the size of the body and attachments (not the headers) of over 18 million corporate emails. Here is his analysis. Thanks Peter. ]

The following is an analysis of a large collection of email messages. Some notes:

Item Value Percentage
Total Included Assets 18,568,722  
Total Bad Assets 36,619 0.2%
Total Asset Parts 40,885,616 100.0%
Total Clear Size 1,424,967,245,112 100.0%
Total Deflated Size 793,987,580,620 100.0%
Unique Parts 21,213,630 51.9%
Unique Clear Size 700,819,336,446 49.2%
Unique Deflated Size 361,929,754,290 45.6%

So this table is a bit strange. Don't think of it as rows. The columns are independent (unfortunately). So in the <16b column, 14,356 clear files were <16 bytes, 108,351 files were deflated to that size, 7,409 unique clear text files were that size, and 4,249 deflated unique files.

Byte Size Total Clear Bytes Total Deflated Bytes Unique Clear Bytes Unique Deflated Bytes
<1b 10,645,434 10,645,434 2 2
<2b 51,275 0 12 0
<4b 39,115 0 160 0
<8b 10,065 0 1,422 0
<16b 14,356 108,351 7,409 4,249
<32b 41,912 40,519 23,212 21,505
<64b 160,667 186,782 60,399 65,489
<128b 971,882 1,126,347 242,473 350,998
<256b 1,496,287 2,580,006 1,127,850 1,713,909
<512b 2,878,892 5,742,009 1,658,630 3,828,936
<1k 4,519,315 6,743,444 3,134,157 5,328,728
<2k 4,795,537 5,979,510 3,775,815 5,005,490
<4k 4,719,521 3,059,155 3,851,320 2,510,388
<8k 3,506,609 1,360,555 2,931,410 941,587
<16k 2,173,706 767,800 1,752,428 394,225
<32k 1,554,018 623,130 1,040,344 249,027
<64k 1,219,832 578,461 666,137 233,756
<128k 757,401 505,886 348,406 207,676
<256k 535,854 281,989 242,087 120,564
<512k 313,384 250,260 136,979 106,641
<1m 229,601 162,524 100,396 68,163
<2m 143,667 85,135 62,202 35,827
<4m 67,973 39,598 31,307 17,065
<8m 26,946 11,237 12,337 5,483
<16m 6,390 6,516 3,518 3,418
<32m 5,260 786 2,799 388
<64m 575 136 321 88
<128m 110 39 77 22
<256m 24 5 14 4
<512m 6 2 5 2

The following table is estimating the amount of space wasted based on the blocksize of the filesystem.

Block Size Bytes Total Unused Slack Space Percent Overhead
512b 5,326,831,438 1.5%
1k 11,470,299,470 3.1%
2k 26,209,376,590 6.8%
4k 62,533,699,918 14.7%
8k 142,984,419,662 28.3%

Free Spam Protection   Android ORM   Simple Java Zip   JMX using HTTP   Great Eggnog Recipe