HTML5: HTML5 SYNTAX

The HTML and XHTML Syntax

It is useful to make a distinction between the vocabulary of an HTML document—the elements and attributes, and their meanings—and the syntax in which it is written.

HTML has a defined set of elements and attributes which can be used in a document; each designed for a specific purpose with their own meaning. Consider this set of elements to be analogous to the list of words in a dictionary. This includes elements for headings, paragraphs, lists, tables, links, form controls and many other features. This is the vocabulary of HTML. Similarly, just as natural languages have grammatical rules for how different words can be used, HTML has rules for where and how each element and attribute can be used.

The basic structure of elements in an HTML document is a tree structure. Most elements have at most one parent element, (except for the root element), and may have any number of child elements. This structure needs to be reflected in the syntax used to write the document.

Syntactic Overview

There are two syntaxes that can be used: the traditional HTML syntax, and the XHTML syntax. While these are similar, each is optimised for different needs and authoring habits. The former is more lenient in its design and handling requirements, and has a number of convenient shorthands for authors to use. The latter is based on XML and has much stricter syntactic requirements, designed to discourage the proliferation of syntactic errors.
The HTML syntax is loosely based upon the older, though very widely used syntax from HTML 4.01. Although it is inspired by its SGML origins, in practice, it really only shares minor syntactic similarities. This features a range of shorthand syntaxes, designed to make hand coding more convenient, such as allowing the omission of some optional tags and attribute values. Authors are free to choose whether or not they wish to take advantage of these shorthand features based upon their own personal preferences.
The following example illustrates a basic HTML document, demonstrating some shorthand syntax:

HTML Example:

<!DOCTYPE html>
<html>
 <head>
   <title>An HTML Document</title>
 </head>
 <body class=example>
   <h1>Example</h1>
   <p>This is an example HTML document.
 </body>
</html>

XHTML, however, is based on the much more strict XML syntax. While this too is inspired by SGML, this syntax requires documents to be well-formed, which some people prefer because of its stricter error handling, forcing authors to maintain cleaner markup.

XHTML Example:

<html xmlns="http://www.w3.org/1999/xhtml">
 <head>
   <title>An HTML Document</title>
 </head>
 <body class="example">
   <h1>Example</h1>
   <p>This is an example HTML document.</p>
 </body>
</html>

Note: The XHTML document does not need to include the DOCTYPE because XHTML documents that are delivered correctly using an XML MIME type and are processed as XML by browsers, are always rendered in no quirks mode. However, the DOCTYPE may optionally be included, and should be included if the document uses the compatible subset of markup that is conforming in both HTML and XHTML, and is ever expected to be used in text/html environments.

Due to the similarities of both the HTML and XHTML syntaxes, it is possible to mark up documents using a common subset of the syntax that is the same in both, while avoiding the syntactic sugar that is unique to each. This type of document is known as a polyglot document because it simultaneously conforms to both syntaxes and may be handled as either. There are a number of issues involved with creating such documents and authors wishing to do so should familiarise themselves with the similarities and differences between HTML and XHTML.

The Syntax

There are a number of basic components make up the syntax of HTML, that are used throughout any document. These include the DOCTYPE declaration, elements, attributes, comments, text and CDATA sections.

DOCTYPE Declaration

The Document Type Declaration needs to be present at the beginning of a document that uses the HTML syntax. It may optionally be used within the XHTML syntax, but it is not required. The canonical DOCTYPE that most HTML documents should use is as follows:

<!DOCTYPE html>

For compatibility with legacy producers of HTML — that is, software that outputs HTML documents — an alternative DOCTYPE is available for use by systems that are unable to output the DOCTYPE given above. This limitation occurs in software that expects a DOCTYPE to include either a PUBLIC or SYSTEM identifier, and is unable to omit them. The canonical form of this DOCTYPE is as follows:

<!DOCTYPE html SYSTEM "about:legacy-compat">

Note: The term "legacy-compat" refers to compatibility with legacy producers only. In particular, it does not refer to compatibility with legacy browsers, which, in practice, ignore SYSTEM identifiers and DTDs.

In HTML, the DOCTYPE is case insensitive, except for the quoted string "about:legacy-compat", which must be written in lower case. This quoted string, however, may also be quoted with single quotes, rather than double quotes. The emphasised parts below illustrate which parts are case insensitive.

HTML Example:

<!DOCTYPE html>

<!DOCTYPE html SYSTEM "about:legacy-compat">

<!DOCTYPE html SYSTEM 'about:legacy-compat'>

The following are also valid alternatives in the HTML syntax:

HTML Example:

<!doctype html>

<!DOCTYPE HTML>

<!doctype html system 'about:legacy-compat'>

<!Doctype HTML System "about:legacy-compat">

For XHTML, it is recommended that the DOCTYPE be omitted because it is unnecessary. However, should you wish to use a DOCTYPE, note that the DOCTYPE is case sensitive, and only the canonical versions of these DOCTYPEs given above may be used.

XHTML Example:

<!DOCTYPE html>

<!DOCTYPE html SYSTEM "about:legacy-compat">

<!DOCTYPE html SYSTEM 'about:legacy-compat'>

However, there are no restrictions placed on the use of alternative DOCTYPEs in XHTML. You may, if you wish, use a custom DOCTYPE referring to a custom DTD, typically for validation purposes. Although, be advised that DTDs have a number of limitations compared with other alternative schema languages and validation techniques.

Historical Notes

This section needs revising and may be moved to an external document and simply referred to.

The DOCTYPE originates from HTML’s SGML lineage and, in previous levels of HTML, was originally used to refer to a Document Type Definition (DTD) — a formal declaration of the elements, attributes and syntactic features that could be used within the document. Those who are familiar with previous levels of HTML will notice that there is no PUBLIC identifier present in this DOCTYPE, which were used to refer to the DTD. Also, note that the about: URI scheme in the SYSTEM identifier of the latter DOCTYPE is used specifically because it cannot be resolved to any specific DTD.

As HTML5 is no longer formally based upon SGML, the DOCTYPE no longer serves this purpose, and thus no longer needs to refer to a DTD. However, due to legacy constraints, it has gained another very important purpose: triggering no-quirks mode in browsers.

HTML 5 defines three modes: quirks mode, limited quirks mode and no quirks mode, of which only the latter is considered conforming to use. The reason for this is due to backwards compatibility. The important thing to understand is that there are some differences in the way documents are visually rendered in each of the modes; and to ensure the most standards compliant rendering, it is important to ensure no-quirks mode is used.

HTML5

Wednesday, 4 January 2012

HTML5 SYNTAX

The HTML and XHTML Syntax

Syntactic Overview

The Syntax

DOCTYPE Declaration

No comments:

Post a Comment