Stupid me, I thought this would be relatively easy for someone of my skill set.
So I want to build an application which can display ePub files. Step 1, of course, is to look at the ePub specification, which is simple enough: an ePub file is basically a zip file with a particular format.
Building a zip file reader is relatively straightforward, and parsing through the ePub file specification really doesn’t take a lot. As verbose as the ePub file specification is, really you only need to do a handful of things to get at the contents of the file:
1. First, you build a piece of code which allows you to randomly access the contents of the zip file. If you are doing this in Java that’s pretty easy; in Objective C that requires a little more finesse. But at the end of the day you build a zip file scanner which scans for the table of contents, load that into a structure, and use the table of contents to find the data in the file containing the compressed file structure.
There are plenty of sources and examples of reading the zip file structure; Wikipedia gives a good overview, the file format is documented fairly well, and there are examples on how to parse and decompress the file.
2. You need to load the META-INF/container.xml file from the zip file, and parse the contents.
The contents are well-defined, and generally contain a single top level .opf file reference which you can then use to parse the contents of the ePub book. The file generally looks like this:
<?xml version="1.0" encoding="UTF-8"?> <container xmlns="urn:oasis:names:tc:opendocument:xmlns:container" version="1.0"> <rootfiles> <rootfile full-path="OPS/epb.opf" media-type="application/oebps-package+xml"/> </rootfiles> </container>
So just find the rootfile object in the rootfiles object, grab the full-path attribute, and away you go. (Now ePub books can have multiple oebps package files, and there is a method for parsing through each of them and offering the “version” of the book to display to the user. This may be used, for example, with comic books where you may want a text version with images, and a SGML version with illustrator images.) But parsing and displaying that is beyond the scope of this post.
3. You then parse the .opf file, which is a <package> file that contains metadata, a manifest, a spine (that is, a list of the chapters in the book), and potentially a table of contents of some form.
For ePub 3 (which is what I’m interested in parsing), you can grab the author and title of the book from the metadata. (Generally the name of the book is under metadata/dc:title, and the author under metadata/dc:creator, though there are exceptions; [see the specs](http://www.idpf.org/epub/31/spec/epub-packages.html#sec-metadata-elem) on how to handle multiple authors or subtitles.)
For ePub 3, the table of contents can be found by scanning the manifest for an <item> which has the property “nav” set. So, for our example file with a manifest:
<manifest> <item href="toc.xhtml" id="toc" media-type="application/xhtml+xml" properties="nav"/> <item href="cover.xhtml" id="cover" media-type="application/xhtml+xml"/> <item href="chapter-1.xhtml" id="chapter-1" media-type="application/xhtml+xml"/> <item href="images/IMG_0087.jpg" id="dataItem1" media-type="image/jpeg"/> <item href="css/book.css" id="stylesheet" media-type="text/css"/> <item href="epb.ncx" id="ncx" media-type="application/x-dtbncx+xml"/> </manifest>
The table of contents is the file toc.xhtml, which, while it is in XHTML format, must honor the format specified in the ePub specification. This means the xhtml table of contents can be parsed easily into a table of contents data structure and presented as, for example, a drop-down menu, rather than being forced to display the table of contents as an HTML page.
The spine XML contains the actual contents of the ePub book, and presents a list of item references referring to files in the manifest for display. A spine looks like:
<spine toc="ncx"> <itemref idref="cover" linear="no"/> <itemref idref="chapter-1" linear="yes"/> </spine>
Each <itemref> item contains an idref attribute which refers to an id of an item in the manifest. The spine is in the presentation order of the book. And notice that each manifest item contains an href; that is a relative reference within the zip file for the contents of each chapter. The manifest also contains references for all the images and other contents, again, relative to the location of the original .opf file.
There are other elements of the ePub file, including an .ncx file and other non-standard parts; they exist either for ePub 2 compatibility or to provide extra metadata for certain readers, such as iBooks specific information.
At the bottom of the stack, an ePub file is essentially an HTML web site in compressed forms, where individual pages (generally, chapters) should be presented in a certain order. You would not be wrong, by the way, to unzip all of this into a directory, find the table of contents and open that in a browser. If you want to get fancy, you can also add a “previous chapter”/”next chapter” button somewhere in your UI to progress through the web pages in the spine, or you can synthesize a single page that the user can use for navigation.
However, suppose you’re me and you don’t want to display your book as a series of very long web pages the user scrolls through, but as a series of pages–like the Kindle app or the iBooks app.
Well, that’s where my adventure down the rabbit hole comes into play.
As it turns out, on iOS the UIWebView has the ability to display web sites in “page” mode. But this doesn’t quite do the right thing. You get formatting errors all over the place, and fundamentally you want to have greater control over the layout of individual pages than UIWebView (which is deprecated in favor of a new web browser technology which does not do page-level layout), you need to lay the pages out yourself.
And ePub says that its pages are XHTML, using the HTML 5 specification (meaning you can parse the XHTML pages using an XML parser), and formatted with CSS 3.
And thus, my adventure into the land of understanding CSS 3 begins.
If you’re like me, the first thing you do when you scan the CSS 3 specs is you look for a document which describes the syntax of the file, and you implement it.
Which is what I did: I found the CSS 3 syntax specification, implemented the tokenizer or lexical parser (which I did by hand and was no more complex than most tokenizers), and started in on the syntax parser.
It started out fine. Section 4 of the Syntax document turned out to be incredibly easy to implement: section 4.3 pretty much holds your hand and tells you in prose exactly what to implement in code. (I’m sure there are some optimizations that can be applied here, but if you’re like me, you want to see something work before you go back and optimize.)
Then I got to Section 5, and did the same thing.
And that’s where the ground started to give way into quicksand.
First, notice the multiple entry points. Okay, this makes sense since CSS can be in its own separate file, embedded in a <style> tag, and snippets of CSS can be in the style attribute of the HTML file.
Sure, fine. I can deal with that.
But then you start building the parser and you notice things like this:
Note: Despite the name, this actually parses a mixed list of declarations and at-rules, as CSS 2.1 does for @page. Unexpected at-rules (which could be all of them, in a given context) are invalid and should be ignored by the consumer.
An “at-rule,” by the way, is sort of like a preprocessor macro in C: it can be just about anything, though what it is legally allowed to be at any point in the parse tree is context-dependent.
And then you see things like this: Consume a component value:
- Consume the next input token.
- If the current input token is a <{-token>, <[-token>, or <(-token>, consume a simple block and return it.
- Otherwise, if the current input token is a , consume a function and return it.
- Otherwise, return the current input token.
Hang on a minute, Sparky; did I just see what I thought I saw?
We’re returning tokens?
At this point you realize the syntax parser algorithm does not actually fully specify the CSS language. It basically finds components of the CSS file, but then at later points you need separate parsers to parse the language in order to find out what’s going on. That is, at the end of this parsing session you’ll see things like “qualified rules” which contain a list of tokens; you later have to parse those tokens to figure out you have a list of declarations–and so forth.
You are not, in other words, left with an abstract syntax tree. You’re left with a partially parsed file which still needs further work.
And how do you deal with a component being either a “function”, a “block” or a “token?” (Personally I reached back into the token object and added synthesized token values for “block” and “function”; that way I could use a recursive parser for further parsing tasks.)
So why the hell don’t you get a full abstract syntax tree?
Well…
First, let me note that the CSS 3 specification does not exist.
By that I mean the CSS 3 specification is not a single specification like CSS 1 or 2. There is no single document stamped “CSS 3” which is a complete description of the CSS 3 specification.
Instead, CSS 3 is essentially a series of “deltas”; smaller documents which describe changes to the CSS 2 specification which make the specification “CSS 3” compliant. The standards committe calls this a “modular approach”:
However for CSS beyond Level 2, the CSS Working Group chose to adopt a modular approach, where each module defines a part of CSS, rather than to define a single monolithic specification. This breaks the specification into more manageable chunks and allows more immediate, incremental improvement to CSS.
In practice, what this means is that to understand CSS 3, you must first understand CSS 2, then update any of the “out of date” chapters in CSS 2 with their relevant CSS 3 documents.
Think that’s bullshit? Well, here’s the official definition.
In practice, that means the syntax parser specification cannot fully parse the CSS 3 language, since future modules may supersede the current specification.
It also means in practice there is no definitive list of CSS 3 attributes anywhere in the document. Now you are not required to implement all attributes; with the introduction of media types in CSS 2 you may want to only implement a subset of the attributes relevant for your media type. And some media types introduce attributes that are only relevant for that media; for example, paged media (that is, web sites rendered to printers or to eBooks with flippy pages) introduce formatting attributes only relevant to print media.
This, by the way, also means to fully parse CSS 3 you must understand which attributes you want to implement (and a full list of CSS 2 attributes which form the foundation of CSS 3 are listed here), then you must build specialized parsers to parse the tokens in the relevant declarations.
And it means you must expand supported shorthand properties; properties which are shorthand for multiple other properties, some of which can have some mind-bending syntax of their own.
And you must expand shorthand properties, because in the following:
border: black; border-top: red;
This should expand to:
border-top: red; border-right: black; border-bottom: black; border-left: black; border-image: none;
Meaning a shorthand property is exactly the same as if you wrote out the full properties–and later in the CSS you can override the individual properties without resetting all of the other components of the shorthand property.
So which properties are shorthand properties and what do they expand to? Well, that requires wading through the specification. And ignoring properties you don’t understand–which can be added later to the CSS specification, as it’s modular.
Once you realize the CSS specification is a hot mess, you find yourself writing a hell of a lot of code to handle all the various properties.
But wait, it gets worse! Not all properties promulgate. And that’s a good thing from a user’s perspective; just because you indicate a <div> tag should sport a nice sexy light-gray border doesn’t mean you want all the contained content to also display borders. On the other hand, if you specify the font of a <div> tag, you probably want the same font to be used inside every element of that tag unless specified otherwise.
But from an implementation standpoint that requires a table somewhere of all your properties and which ones promulgate and which ones don’t.
Resolving which attributes are associated with which elements is also well defined. But there are no good strategies given for how to do this quickly; a naive approach would be pretty brute force. (I’ve elected to build a cache of all selectors who have a final selector value that matches a particular cache key of the tag name (i.e., “<p>”), the id attribute and the class attribute. The theory is that the majority of HTML elements will share similar attributes, so the cache should be hit repeatedly. And if a tag hasn’t been seen yet, I construct a subset of the selectors which match the key. My hope is that for most HTML we’ll have zero lookups beyond the basic selection process.)
It just gives me a headache.
And I haven’t even gotten into page layout yet.
*sigh*