The HTML and XHTML Syntax
It is useful to make a distinction between the vocabulary of an HTML
document—the elements and attributes, and their meanings—and the syntax
in which it is written.
HTML has a defined set of elements and attributes which can be used in
a document; each designed for a specific purpose with their own meaning.
Consider this set of elements to be analogous to the list of words in a
dictionary. This includes elements for headings, paragraphs, lists,
tables, links, form controls and many other features. This is the
vocabulary of HTML. Similarly, just as natural languages have grammatical
rules for how different words can be used, HTML has rules for where and
how each element and attribute can be used.
The basic structure of elements in an HTML document is a tree structure.
Most elements have at most one parent element, (except for the root
element), and may have any number of child elements. This structure needs
to be reflected in the syntax used to write the document.
Syntactic Overview
There are two syntaxes that can be used: the traditional HTML syntax, and the XHTML syntax. While these are similar, each is optimised for different needs and authoring habits. The former is more lenient in its design and handling requirements, and has a number of convenient shorthands for authors to use. The latter is based on XML and has much stricter syntactic requirements, designed to discourage the proliferation of syntactic errors.The HTML syntax is loosely based upon the older, though very widely used syntax from HTML 4.01. Although it is inspired by its SGML origins, in practice, it really only shares minor syntactic similarities. This features a range of shorthand syntaxes, designed to make hand coding more convenient, such as allowing the omission of some optional tags and attribute values. Authors are free to choose whether or not they wish to take advantage of these shorthand features based upon their own personal preferences.
The following example illustrates a basic HTML document, demonstrating some shorthand syntax:
HTML Example:
<!DOCTYPE html> <html> <head> <title>An HTML Document</title> </head> <body class=example> <h1>Example</h1> <p>This is an example HTML document. </body> </html>
XHTML Example:
<html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>An HTML Document</title> </head> <body class="example"> <h1>Example</h1> <p>This is an example HTML document.</p> </body> </html>
Note: The XHTML document does not need to include the DOCTYPE because XHTML documents that are delivered correctly using an XML MIME type and are processed as XML by browsers, are always rendered in no quirks mode. However, the DOCTYPE may optionally be included, and should be included if the document uses the compatible subset of markup that is conforming in both HTML and XHTML, and is ever expected to be used in
text/html
environments.
Due to the similarities of both the HTML and XHTML syntaxes, it is
possible to mark up documents using a common subset of the syntax
that is the same in both, while avoiding the syntactic sugar that is
unique to each. This type of document is known as a polyglot
document because it simultaneously conforms to both syntaxes and may
be handled as either. There are a number of issues involved with
creating such documents and authors wishing to do so should
familiarise themselves with the similarities and differences between
HTML and XHTML.
The Syntax
There are a number of basic components make up the syntax of HTML,
that are used throughout any document. These include the
DOCTYPE
declaration, elements, attributes, comments,
text and CDATA sections.DOCTYPE Declaration
The Document Type Declaration needs to be present at the beginning of a document that uses the HTML syntax. It may optionally be used within the XHTML syntax, but it is not required. The canonicalDOCTYPE
that most HTML documents should use is as follows:<!DOCTYPE html>
DOCTYPE
is available
for use by systems that are unable to output the DOCTYPE
given above. This limitation occurs in software that expects a
DOCTYPE
to include either a PUBLIC
or
SYSTEM
identifier, and is unable to omit them.
The canonical form of this DOCTYPE
is as follows:<!DOCTYPE html SYSTEM "about:legacy-compat">
Note: The term "legacy-compat" refers to compatibility with legacy
producers only. In particular, it does not refer to compatibility with
legacy browsers, which, in practice, ignore SYSTEM identifiers and DTDs.
In HTML, the
DOCTYPE
is case insensitive, except for the quoted string
"about:legacy-compat"
, which must be written in lower case. This quoted
string, however, may also be quoted with single quotes, rather than double quotes.
The emphasised parts below illustrate which parts are case insensitive.
HTML Example:
<!DOCTYPE html>
<!DOCTYPE html SYSTEM "about:legacy-compat">
<!DOCTYPE html SYSTEM 'about:legacy-compat'>
The following are also valid alternatives in the HTML syntax:
HTML Example:
<!doctype html>
<!DOCTYPE HTML>
<!doctype html system 'about:legacy-compat'>
<!Doctype HTML System "about:legacy-compat">
For XHTML, it is recommended that the
DOCTYPE
be
omitted because it is unnecessary. However, should you wish to
use a DOCTYPE
, note that the DOCTYPE
is case sensitive, and only the canonical versions of these
DOCTYPE
s given above may be used.XHTML Example:
<!DOCTYPE html>
<!DOCTYPE html SYSTEM "about:legacy-compat">
<!DOCTYPE html SYSTEM 'about:legacy-compat'>
However, there are no restrictions placed on the use of alternative
DOCTYPE
s in XHTML. You may, if you wish, use a custom
DOCTYPE
referring to a custom DTD, typically for
validation purposes. Although, be advised that DTDs have a number
of limitations compared with other alternative schema languages
and validation techniques.
Historical Notes
This section needs revising and may be moved to an external document and simply referred to.
The
DOCTYPE
originates from HTML’s SGML lineage and,
in previous levels of HTML, was originally used to refer to a
Document Type Definition (DTD) — a formal declaration of the
elements, attributes and syntactic features that could be used
within the document. Those who are familiar with previous levels
of HTML will notice that there is no PUBLIC
identifier present in this DOCTYPE
, which were used
to refer to the DTD. Also, note that the about:
URI
scheme in the SYSTEM
identifier of the latter
DOCTYPE
is used specifically because it cannot be
resolved to any specific DTD.
As HTML5 is no longer formally based upon SGML, the
DOCTYPE
no longer serves this purpose, and thus no
longer needs to refer to a DTD. However, due to legacy
constraints, it has gained another very important purpose:
triggering no-quirks mode in browsers.
HTML 5 defines three modes: quirks mode,
limited quirks mode and no quirks mode,
of which only the latter is considered conforming to use. The reason for
this is due to backwards compatibility. The important thing to understand
is that there are some differences in the way documents are visually
rendered in each of the modes; and to ensure the most standards compliant
rendering, it is important to ensure no-quirks mode is used.
No comments:
Post a Comment