COMP60370
A Tale of Two Formats
Bijan Parsia
{bparsia@cs.man.ac.uk}
Things "on the Web"
- Documents
- Home pages, documentation, papers, magazines, manuals,
novels....
- Data
- Addresses, bank records, appointments
- Applications
- "User agents"
- Non-user agents
- Web hosted/native applications
- For end users
- For other programs
- There are at lease 45.4 billion web
pages indexed in search engines!
Making such things
- Who creates documents?
- Who creates data?
- Who creates applications?
- How do we make such things for (more or less) arbitrary
(re)use?
- Web applicaitons are, create,
and consume Web documents and data!
The most basic picture

A less basic picture

The data
- Initial URI, say
<http://ex.org/test/example.html>
- The HTML:
- Consider how complex this is!
Today: Authoring Web Stuff
- Validation & error handling
- Semantic Markup & Styling
Case to Study
- Consider weblogs
- Chronologically reversed series of "items"
- Each item has an author and a timestamp
- Items are generally short, but can contain all sorts of
hypermedia
- Generally intended to be read by people
- Closer to a magazine than to a stock ticker
- Different aspects
- Writing
- Reading
- Publishing
- As a web site
- As a "feed" for syndication
- Aggregating
A Weblog Workflow

Weblog Data Formats
- For writing
- HTML (directly or by a Web App)
- "Markdown" languages
- Reading
- Publishing
- HTML for web sites
- RSSx or Atom for syndication
- Aggregation
A Brief History of (X)HTML
- Original HTML "inspired by" some SGML formats
- But no DTD; first browser didn't use SGML parser
- At least somewhat LaTeXish
- HTML 2.0-3.2 had DTDs
- HTML 4.0x
- DTD (again, not used by browsers)
- Deprecation of presentational features
- XHTML
- XMLized version of HTML 4.0x (with DTD)
- (X)HTML5
HTML as SSD
- HTML files tend to correspond to documents
- Text/narrative heavy
- Complex, irregular (treelike) structure
- Lots of features (doc structure, formatting, tables,
forms, etc.)
- HTML is Not XML
- No need for well formedness
- Tags don't need to be closed
- Attributes don't need to be quoted (etc. etc.)
- Many
others
- HTML is Not SGML
HTML as SSD
- HTML files tend to correspond to documents
- Text/narrative heavy
- Complex, irregular (treelike) structure
- Lots of features (doc structure, formatting, tables,
forms, etc.)
- HTML is Not XML
- No need for well formedness
- Tags don't need to be closed
- Attributes don't need to be quoted (etc. etc.)
- Many
others
- HTML is Not SGML
A simple HTML weblog (1)
Authentic Voice of a Person. Reverse Chronological
Order. On the web. These are essential
characteristics of a online Journal or weblog.
Given the statements above, a well formed log entry would
contain at a minimum an author, a creationDate, and a
permaLink. And, of course, content. -- Sam Ruby
<h1>My Weblog</h1>
<h2>What I Did Today</h2>
<h3>Feb. 11, 2008; Bijan Parsia</h3>
<p>Taught a class and it went <i>very</i> well.</p>
A simple HTML weblog (2)
We can radically change the markup.
<h1>My Weblog</h1>
<ul>
<li>
<b>What I Did Today</b><br/>
<i>Feb. 11, 2008; Bijan Parsia</i></br>
<p>Taught a class and it went <em>very</em> well.
</li>
</ul>
A simple Atom entry
<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>My Weblog</title>
<updated>2008-02-13T18:30:02Z</updated>
<id>urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6</id>
<entry>
<author>
<name>Bijan Parisa</name>
</author>
<title>What I Did Today</title>
<id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
<updated>2008-02-13T18:30:02Z</updated>
<content type="xhtml" xml:lang="en"
xmlns="http://www.w3.org/1999/xhtml">
<p>Taught a class and it went <em>very</em> well.</p>
</content>
</entry>
</feed>
Validation in the Wild
- HTML
- 1%-5% of
web pages are valid
- All sorts of breakage
- E.g., overlapping tags
<b>hi
<i>there</b>, my good
friend</i>
- Syndication Formats
- 10%
feeds not well-formed
- Where do the problems come from?
- Hand authoring
- Generation bugs
- String concat based generation
- Composition from random sources
Seeking Validation
- Validation and conformance criteria
- Schema language expressible
- Expressible at all
- Painfully expressible
- Not usefully expressible
- Machine checkable
- But might require arbitrary programming
- Requires human judgment
- E.g, for alt
text: "Do not specify meaningless alternate text (e.g.,
"dummy text")."
- Reporting and repairing
- Hard failure (alone) not helpful
- Forbidding harmless stuff pointless
- Requiring the impossible pointless
Lesson #1
- We are dealing with socio-political (and economic) phenomena
- Complex ones!
- Many players; many sorts of player
- Lots of historical specifics
- Lots of interaction effects
- Human factors critical
- What do people do (and why?)
- Affordances and incentives
- Dealing with "bozos"
Error
- Be
liberal in what you accept, and conservative in what
you send.
- Validation should help, not punish
- De facto XML motto
- Be strict about the well formedness of what you accept,
and strict in what you send
- Draconian error handling
- What about higher levels?
- Validity and other analysis?
- Most schema languages poor at error reporting
- Current thinking (some of the time)
- Deterministic error handling, i.e., speced error recovery
- Live
DOM viewer
Schematron
- A different
sort of schema language
- Not grammar or object based
- Rule based
- Test oriented
- Complimentary
- Conceptual simple
- Patterns contain rules
- Rules set a context and contain
asserts and reports
- A&Rs contain tests and assertions
- Tests are XPath querys with the context as the current
node
- Assertions are natural language text describing the
condition
From HTML5: Exclusions
- HTML5 validator
- Relax NG schema
- Schemetron assertions
- Custom code
- Often want contextual exclusions
- To break circles:
- Paragraphs contain footnotes
- Footnotes contain paragraphs
- Footnotes may not contain paragraphs
- Without exclusions, would need many paragraph productions
Exclusions Examples
<schema xmlns="http://purl.oclc.org/dsdl/schematron">
<ns prefix="h" uri="http://www.w3.org/1999/xhtml"/>
<pattern name='dfn cannot nest'>
<rule context="h:dfn">
<report test="ancester::h:dfn">
The "dfn" element cannot contain any nested
"dfn" elements.</report>
</rule>
</pattern>
<pattern name='noscript cannot nest'>
<rule context="h:noscript">
<report test="ancester::h:>noscript">
The "noscript element cannot contain any nested
"noscript" elements.</report>
</rule>
</pattern>
</schema>
Dfn Defined
From common.rnc:
common.elem.embedded = ( notAllowed )
common.elem.phrase = ( common.elem.embedded )
common.inner.phrase =( text & common.elem.phrase* )
From phrase.rnc:
dfn.elem = element dfn { dfn.inner & dfn.attrs }
dfn.attrs =
( common.attrs )
dfn.inner =
( common.inner.phrase )
common.elem.phrase |= dfn.elem
An Atom Example
<ns uri="http://www.w3.org/2005/Atom" prefix="atom"/>
<rule context="atom:feed">
<assert test="atom:author or not(atom:entry[not(atom:author)])">
An atom:feed must have an atom:author unless all
of its atom:entry children have an atom:author.
</assert>
</rule>
Schematron Presumes...
- ...Well formed XML
- As do all XML schema languages
- So can't help with e.g., overlapping tags
- ...Authorial repair
- At least in the default case
- Thus, not the basis of a browser!
- Parse
phase can handle both
- Parser generates (the moral equiv of) a DOM
- Parser can repair some problems
- E.g., elements in the wrong place
Structure and Style
- Both HTML and Atom have specific vocabularies
- Both have structural terms
- Indeed, except in content, Atom has nothing else
- HTML terms have specific default renderings
- And some (e.g.,
<font>)
are purely presentational
- Or arguably presentational
- Atom terms have no default renderings
- Often rendered to/scraped from HTML
- HTML has two "purely structural" elements:
<div>
and <span>
- Instances distinguished by the
class
and id attributes
- Styled by CSS style rules or the
style
attribute
Why separate them?
- Presentation is more fluid than structure
- The "look" may need updating
- Presentation needs may vary
- What works for 21" screens doesn't for mobile phones
- Accessibility (content should be perceivable by people
with disabilities)
- Programmatic processing needs
CSS vs. XSL
- CSS is
- not an XML/angle brackets format
- annotative, not transformative
- mostly "formats" nodes
- ubiquitous on the Web, esp. client side
- XSL
- has two parts: XSLT and XSL-FO
- or really, any target language such as HTML
- generates a new tree
- so is free to rearrange things
- mostly server side
CSS Basics
- Rules
- Selectors
- Similar to XPath expressions
- But "forward" directed
- Special syntax for
class
attributes
- "Pseudo" classes and elements
- Declaration blocks
- @-Rules (esp.:)
CSS with <div>, <span>
<style type="text/css">
.title {font-weight: bold}
div.title {text-align:center; font-size: 24; }
div.entry div.title {text-align: left; font-variant: normal}
span.date {font-style: italic}
span.date:after{content:" by"}
div.content {font-style: italic}
div.content i {font-style: normal; font-weight: bold}
#one {color: red}</style>
<div class=title>My Weblog</div>
<div class="entry">
<div class=title>What I Did Today</div>
<div class=byline>
<span class=date>Feb. 11, 2008</span> <span class=author>Bijan Parsia</span>
</div>
<div class="content" id="one">
<p>Taught a class and it went <i>very</i> well.</p>
</div>
</div>