MAMA: Basic document structure
- Previous article—MAMA: W3C validator research
- Next article—MAMA: Document encodings
- Table of contents
Index:
- Introduction
- Document statistics
- Byte order marks
- Doctypes
- A document's "Tag Ratio"
- Markup elements
- Markup attributes
Introduction
To get started in MAMA's look at markup practices and trends, we will first look at overall document sizes; then, we will examine some of the basic document structural components (Byte Order Marks and Doctypes). Finally, the full frequency tables for both elements and attributes will be presented. It is expected that most readers will find the breakdowns in the individual sections sufficient for most purposes. Those wishing to dig into the meat of this research are encouraged to look deeply at the complete, unvarnished elements and attributes frequency tables for quicker cross-comparison between markup topics.
Document statistics
Document size
This first metric is the integer character length of the original main document. No document dependencies are counted in this number. The average document size of MAMA's analyzed URLs was 16,406 characters. In all, ~55 URLs hit MAMA's hard limit ceiling of 5 Megabytes.
| Size range | Frequency | Size range | Frequency | Size range | Frequency | ||
|---|---|---|---|---|---|---|---|
| =0 | 2,217 | >8000 && <=9000 | 136,348 | >35000 && <=40000 | 76,277 | ||
| >0 && <= 500 | 137,827 | >9000 && <=10000 | 127,766 | >40000 && <=45000 | 59,142 | ||
| >500 && <=1000 | 202,031 | >10000 && <=12000 | 229,676 | >45000 && <=50000 | 44,190 | ||
| >1000 && <=2000 | 255,084 | >12000 && <=14000 | 194,834 | >50000 && <=75000 | 112,481 | ||
| >2000 && <=3000 | 188,206 | >14000 && <=16000 | 162,359 | >75000 && <=100000 | 40,349 | ||
| >3000 && <=4000 | 170,332 | >16000 && <=18000 | 135,076 | >100000 && <=150000 | 27,382 | ||
| >4000 && <=5000 | 159,744 | >18000 && <=20000 | 112,276 | >150000 && <=200000 | 7,972 | ||
| >5000 && <=6000 | 156,531 | >20000 && <=25000 | 213,093 | >200000 && <=250000 | 3,092 | ||
| >6000 && <=7000 | 152,619 | >25000 && <=30000 | 147,698 | >250000 && <=300000 | 1,643 | ||
| >7000 && <=8000 | 144,561 | >30000 && <=35000 | 104,822 | >300000 | 3,552 | 
Document Frame/IFrame sizes
This is an integer character length that is the aggregate sum of all Frames and IFrames used in a document. In all, 80.78% of all pages had a Frame/IFrame length of 0, and this is an expected result—any non-zero value means that Frames or IFrames are part of the document infrastructure. The average length of the combined Frame/IFrame components was 3,060.4 characters, but this factors in all the cases where there were no Frames or IFrames. The average length of the Frame/IFrame components where they were actually used was 15,919.8 characters.
| Size range | Frequency | Size range | Frequency | Size range | Frequency | ||
|---|---|---|---|---|---|---|---|
| =0 | 2,834,569 | >8000 && <=9000 | 27,863 | >35000 && <=40000 | 12,034 | ||
| >0 && <= 500 | 26,035 | >9000 && <=10000 | 25,025 | >40000 && <=45000 | 8,786 | ||
| >500 && <=1000 | 35,043 | >10000 && <=12000 | 43,865 | >45000 && <=50000 | 6,408 | ||
| >1000 && <=2000 | 50,640 | >12000 && <=14000 | 37,049 | >50000 && <=75000 | 16,642 | ||
| >2000 && <=3000 | 41,304 | >14000 && <=16000 | 29,324 | >75000 && <=100000 | 6,411 | ||
| >3000 && <=4000 | 38,274 | >16000 && <=18000 | 24,789 | >100000 && <=150000 | 4,929 | ||
| >4000 && <=5000 | 35,519 | >18000 && <=20000 | 20,177 | >150000 && <=200000 | 3,313 | ||
| >5000 && <=6000 | 32,526 | >20000 && <=25000 | 41,618 | >200000 && <=250000 | 880 | ||
| >6000 && <=7000 | 31,593 | >25000 && <=30000 | 27,106 | >250000 && <=300000 | 376 | ||
| >7000 && <=8000 | 29,351 | >30000 && <=35000 | 17,032 | >300000 | 699 | 
Document "extras" sizes
This value is an aggregate sum length of all the "extra" dependencies in a document (not counting embedded objects). It consists of all frames and IFrames content (the Frame/IFrame size count from the previous table), all external script content, and all CSS from external and imported stylesheets. Values of 0 are still expected to have a high representation, but now that we have multiple factors in play, the chances of that happening are greatly reduced. The overall average length of all "extras" is 20,295.7 characters, but it increases to 28,038.3 characters factoring in only the cases where any of the "extras" exist.
| Size range | Frequency | Size range | Frequency | Size range | Frequency | ||
|---|---|---|---|---|---|---|---|
| =0 | 969,042 | >9000 && <=10000 | 53,271 | >40000 && <=45000 | 69,438 | ||
| >0 && <= 500 | 84,747 | >10000 && <=12000 | 92,431 | >45000 && <=50000 | 53,694 | ||
| >500 && <=1000 | 117,985 | >12000 && <=14000 | 76,680 | >50000 && <=60000 | 81,219 | ||
| >1000 && <=2000 | 178,577 | >14000 && <=16000 | 89,519 | >60000 && <=70000 | 68,595 | ||
| >2000 && <=3000 | 154,796 | >16000 && <=18000 | 73,095 | >70000 && <=80000 | 43,553 | ||
| >3000 && <=4000 | 120,169 | >18000 && <=20000 | 57,694 | >80000 && <=90000 | 33,830 | ||
| >4000 && <=5000 | 97,118 | >20000 && <=22500 | 101,774 | >90000 && <=100000 | 24,456 | ||
| >5000 && <=6000 | 88,678 | >22500 && <=25000 | 88,265 | >100000 && <=150000 | 68,781 | ||
| >6000 && <=7000 | 89,053 | >25000 && <=30000 | 137,810 | >150000 && <=200000 | 26,312 | ||
| >7000 && <=8000 | 66,891 | >30000 && <=35000 | 116,366 | >200000 && <=250000 | 13,022 | ||
| >8000 && <=9000 | 63,567 | >35000 && <=40000 | 89,906 | >250000 | 18,866 | 
Byte Order Marks
A co-worker asked for MAMA to detect the presence of Byte Order Marks (BOMs), which are used to signal the use of some encoding flavors. The intent was to find real-world examples of pages that used these BOMs so that they could be tested in Opera. Alas, MAMA only detected 3 of the 8 types of BOMs it looked for in the URLs analyzed. A Perl regular expression match against the first 5 characters in each URL document was done to detect the following encodings.
| BOM type | Perl regexp | 
|---|---|
| utf-32 (little-endian) | /^(\xff\xfe\x00\x00)/ | 
| utf-32 (big-endian) | /^(\x00\x00\xfe\xff)/ | 
| utf-16 (little-endian) | /^(\xff\xfe)/ | 
| utf-16 (big-endian) | /^(\xfe\xff)/ | 
| utf-8 | /^(\xef\xbb\xbf)/ | 
| utf-7 | /^(\x2b\x2f\x76\x38\x2d)/ | 
| scsu | /^(\x0e\xfe\xff)/ | 
| bocu-1 | /^(\xfb\xee\x28)/ | 
BOMs detected
The 3 BOMs were found in a total of 17,649 URLs (0.50% of all URLs analyzed). The BOM found most often is utf-8.
| BOM | Frequency | 
|---|---|
| utf-8 | 17,006 | 
| utf-16 (little-endian) | 647 | 
| utf-16 (big-endian) | 26 | 
Doctypes
The Doctype statement is used in two ways. Passively, it proclaims the markup standard to which the document is supposed to adhere. A markup validator can use this information to analyze its conformance to that standard. We examine the validation aspect of the Doctype and its implications in a separate document. In this section we will look at some of the things we can easily glean from the Doctype, as well as the more active role that Doctypes have taken in recent years in their role as arbiter of the rendering mode that a browser will use.
Anatomy of a Doctype statement
Now, we can take a look at the components of a Doctype to see what sort of information it can offer us:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
| Component | Description | 
|---|---|
| "<!DOCTYPE" | The beginning of the Doctype | 
| "html" | This string specifies the name of the root element for the markup type | 
| "PUBLIC" | This indicates the availability of the DTD resource. It can be a publicly accessible object ("PUBLIC") or a system resource ("SYSTEM") such as a local file or URL. HTML/XHTML DTDs are specified by "PUBLIC" identifiers. | 
| "-//W3C//DTD XHTML 1.0 Transitional//EN" | This is the Formal Public Identifier (FPI). This compact, quoted string gives a lot of information about the DTD, such as its Registration, Organization, Type, Label, and the Encoding language. For HTML/XHTML DTDs, the most interesting part of this is the label portion (the "XHTML 1.0 Transitional" part). If the processing entity does not already have local access to this DTD, it can get it from the System Identifier (next portion). | 
| "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" | The System Identifier (SI); the URL location of the DTD specified in the FPI | 
| ">" | The ending of the Doctype | 
Doctypes found by MAMA
The entire Doctype statement was stored in MAMA. In all, 1,788,294 of the URLs analyzed (50.96%) had a Doctype present. For the purposes of the full frequency table for Doctype, the values were normalized to lower case.
Doctype versions
Different HTML standards can be detected via unique strings in the Doctype statement. The leading space in most of the values below helps differentiate between HTML and XHTML versions. HTML 4 variants are twice as popular as any of the other versions.
| Doctype-version substring | Frequency | Doctype-version substring | Frequency | |
|---|---|---|---|---|
| " html 4" (HTML 4 variants) | 1,122,392 | "softquad" || "//sq//" | 9,950 | |
| " xhtml 1.0" | 548,307 | " html 2" | 7,640 | |
| " html 3.2" | 57,354 | " html 3.0" | 1,711 | |
| "ietf" | 34,965 | "WAP" | 131 | |
| " xhtml 1.1" | 20,958 | " xhtml 2" | 18 | 
Doctype flavors
Beginning with HTML 4.0, HTML was stratified into 3 separate variants: Strict, Transitional, and Frameset. The Label portion of the Doctype FPI reflects these variants, and we can easily discern the "flavors" of HTML by searching for the substrings. The Transitional configuration is more than 10 times as likely as the other types.
| Doctype-flavor substring | Frequency | 
|---|---|
| "Transitional" | 1,459,912 | 
| "Strict" | 130,191 | 
| "Frameset" | 64,516 | 
System Identifiers (SIs)
A look at the full Doctype frequency table shows that there is a good balance between Doctypes that specify a SI versus those that do not. A simplistic way to find SIs that use an absolute URI would be to look for the string "http://" in the Doctype statements; doing so finds 880,702 matching URLs. However, URIs can be relative too, so we should expand our search. If instead of "http://" we look for ".dtd", this might be a good usage indicator for ALL Doctypes with SIs. Doing so finds 897,601 URLs, or 50.19% of all MAMA cases where a Doctype is present.
Doctype switching: Standards vs. Quirks mode
Saarsoo produced a comparison of what pages were rendered in Standards vs. Quirks mode based on Henri Sivonen's excellent page on doctype switching. Using this page as a guide, we can construct a similar table, but with MAMA numbers included. To reduce the complexity of Sivonen's original table, we'll only show the columns of the most popular current browser sets in use: Mozilla/Safari, Opera 9, IE7/Opera7.1 and IE6/Opera7. Note that these groupings pair up browsers that have very similar quirks, almost standards, and standards modes. Standards, Almost Standards, and Quirks modes are listed as S , A and Q respectively.
With the complexity of Sivonen's chart, one would expect the numbers for the different browsers to vary by a wider margin. It appears the main differences in most browsers are in Doctypes with lower representation in the wild. Generally, about 85% of all Web pages are rendered using Quirks mode, while the remaining ~15% of URLs are rendered using either Standards or Almost Standards modes. If we only look at URLs that have a Doctype, Standards, and Almost Standards are used in ~30% of those cases.
| Doctype | MAMA frequency | Moz/ Safari | Opera9 | IE7/ Opera7.1 | IE6/ Opera7 | 
|---|---|---|---|---|---|
| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN"> | 6,745 | S | S | A | A | 
| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"> | 2,488 | S | S | A | A | 
| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" "http://www.w3.org/TR/html4/strict.dtd"> | 42 | S | S | A | A | 
| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> | 14,471 | S | S | A | A | 
| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> | 90,296 | A | A | A | A | 
| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/1999/REC-html401-19991224/loose.dtd"> | 2,732 | A | A | A | A | 
| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> | 2,185 | Q | Q | A | A | 
| <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> (w/o XML prolog): | 10,563 | S | S | A | A | 
| <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.0//EN" "http://www.w3.org/TR/xhtml-basic/xhtml-basic10.dtd"> (w/o XML prolog): | 26 | S | S | A | A | 
| <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> (w/o XML prolog): | 58,086 | S | S | A | A | 
| <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> (w/o XML prolog): | 295,687 | A | A | A | A | 
| <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> (w/ XML prolog): | 3,475 | S | S | A | Q | 
| <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.0//EN" "http://www.w3.org/TR/xhtml-basic/xhtml-basic10.dtd"> (w/ XML prolog): | 14 | S | S | A | Q | 
| <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> (w/ XML prolog): | 5,842 | S | S | A | Q | 
| <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> (w/ XML prolog): | 54,765 | A | A | A | Q | 
| <!DOCTYPE HTML PUBLIC "ISO/IEC 15445:2000//DTD HTML//EN"> | 10 | S | Q | Q | Q | 
| <!DOCTYPE html> | 199 | S | S | A | A | 
| Browser | Standards Mode [%] | Almost Standards Mode [%] | Quirks Mode [%] | 
|---|---|---|---|
| Mozilla/Safari | 101,961 [2.91%] | 443,480 [12.64%] | 2,963,739 [84.46%] | 
| Opera 9 | 101,951 [2.91%] | 443,480 [12.64%] | 2,963,749 [84.46%] | 
| IE7/Opera 7.1 | 0 [0.0%] | 547,616 [15.61%] | 2,961,564 [84.39%] | 
| IE6/Opera 7.0 | 0 [0.0%] | 483,520 [13.78%] | 3,025,660 [86.22%] | 
A document's "Tag Ratio"
During MAMA's analysis, it kept track of the size of all the markup tags used, as well as the overall page size. The ratio of these two values provides some minor insight into authoring practices, and how much plain text content authors have on their pages. Saarsoo did something similar in his study, but he called his ratio a "text percentage". In his study, the plain-text portion of the page averaged about 20% of the overall size. In MAMA's case, the "Tag Ratio" was the total content within all tags divided by the overall page size. A low Tag Ratio value reflects a relatively small amount of markup tags compared to the text content while a high tag ratio would be a large amount of markup tags compared to the text content. A Tag Ratio of 0 would be all plain-text, while a Tag Ratio of 100.0 would be completely tags, without even having linefeeds or spaces between the tags. The average document had a Tag Ratio of 61.64%, with almost 2/3 of each document being tags. A full frequency table of Tag Ratios is also available.
Markup elements
We will discuss many of these elements in more detail in their appropriate sections; here we will just take a quick look at the top 20, and say a little something about the overall rankings before moving on. There are no real surprises here in the rankings of the top elements. Comparing the chart below to Saarsoo's study, there is a little movement in the rankings but not until we get out of the top 10—and the top 50 elements from both share 49 elements in common! Hickson's study has some differences in ranking order even in its top 10. The discrepancies are very minor however, involving values that have very similar totals and adjacent positions in MAMA's list.
The most popular elements
- Basic document elements: HTML,HEADandBODY
- Hyperlinks and images: AandIMG
- Tables (TABLE,TDandTR)
- A smattering of important elements used in the HEAD:TITLE,META,SCRIPT,LINKandSTYLE
- Simple structural and formatting elements: BR,P,DIV,FONT,B,SPANandSTRONG
No real surprises here; the full, unvarnished element list also reveals a significant number of irrelevant entries as you go deeper down the roster—it seems there is a lot of custom markup, typos, and script fragments out there (the script fragments may be artifacts of MAMA's parsing strategy).
| ELEMENT | Frequency | ELEMENT | Frequency | ELEMENT | Frequency | ||
|---|---|---|---|---|---|---|---|
| HEAD | 3,464,519 | TABLE | 2,894,184 | FONT | 2,061,417 | ||
| TITLE | 3,459,207 | TD | 2,891,972 | LINK | 2,018,510 | ||
| HTML | 3,452,975 | TR | 2,891,205 | B | 1,805,495 | ||
| BODY | 3,452,907 | BR | 2,859,662 | SPAN | 1,527,964 | ||
| A | 3,307,397 | P | 2,702,935 | STYLE | 1,313,454 | ||
| META | 3,276,347 | SCRIPT | 2,528,823 | STRONG | 1,102,056 | ||
| IMG | 3,219,487 | DIV | 2,499,779 | CENTER | 1,076,535 | 
Markup attributes
As with the discussions about markup elements, we will wait to talk more about attributes in the sections appropriate for each. Right now, we will again look at a top 20 list. The attributes found in the top 20 all come from only 7 different elements:
- A
- META
- IMG
- TABLE
- TD
- LINK
- SCRIPT
These results and their ordering compare favorably to the brief attribute data listed in Hickson's study.
| ELEMENT[Attribute] | Frequency | ELEMENT[Attribute] | Frequency | ELEMENT[Attribute] | Frequency | ||
|---|---|---|---|---|---|---|---|
| A[Href] | 3,304,834 | META[Name] | 2,710,638 | TD[Valign] | 2,189,287 | ||
| META[Content] | 3,273,610 | TABLE[Border] | 2,691,899 | LINK[Href] | 2,016,007 | ||
| IMG[Src] | 3,219,304 | TABLE[Width] | 2,637,117 | LINK[Rel] | 2,001,105 | ||
| IMG[Width] | 2,957,808 | TABLE[Cellpadding] | 2,585,020 | A[Target] | 1,978,018 | ||
| IMG[Height] | 2,945,989 | TABLE[Cellspacing] | 2,578,416 | TD[Align] | 1,977,367 | ||
| META[Http-equiv] | 2,826,859 | IMG[Alt] | 2,520,939 | SCRIPT[Language] | 1,965,725 | ||
| IMG[Border] | 2,810,265 | TD[Width] | 2,324,752 | LINK[Type] | 1,777,982 | 
- Previous article—MAMA: W3C validator research
- Next article—MAMA: Document encodings
- Table of contents
This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.
Comments
The forum archive of this article is still available on My Opera.
No new comments accepted.