Anyone with a Mac who uses command line knows that Apple is fond of storing things in directory structure and Apple’s UI uses directory extension to interpret and navigate the directory. For example, Apple programs are actually stored in directory with the <appname>.app directory convention. It appears that iWorks store its data the same way. Pages store its data in a directory with <name>.pages extension where <name> is something you type in the “File Save” dialog.
For some reason unknown to me except for the fact that I was not looking into my .pages directory to write a blog post about ODF vs OOXML vs Pages XML structure, I ventured into the directory. You can do that on “Finder” by right-clicking your <name>.pages than hit “Show Contents”, therefore I am not divulging any trade secret of Apple either. What I got was something that looks like <appname>.app directory. Not exactly a surprise but it does confirm that Apple is using the same convention for applications and its data. Following Apple’s convension, the first place I looked was PkgInfo in the Contents subdirectory and was disappointed to find only eight “?” in it. I am sure each of the eight “?” meant something for the operating system but is frankly not interested in what they meant. Looking back in the original <name>.pages directory I find all my picture files as used in the document, and a gzipped file named index.xml.gz.
One important note that I immediately made is to trim down my picture size before including in the document, since all the pictures appear to be copied into the .pages directory as it is. This kinda reduce the advantage of the very useful “mask” facility in iWorks because for large file, you are better off cropping the image separately or reduce the resolution of the images before importing into your iWorks document. Otherwise, you will find your document size increase very quickly, as no compression is performed. Anyway, this is not the most interesting part.
The most interesting part is index.xml.gz. I ungzipped it using standard tool and as expected, it is Apple’s own XML description of the file. What is surprising and refreshing is, by using words in my document and filenames for the pictures as keywords, I quickly decipher the structure of the file, without having any Apple documentation on the format. What I really like is, unlike some naive implementation of IDs which we see in OOXML and some if not most ODF documents, IDs are readable, along the line “STDrawItem-9999″ rather than the useless and clueless “987654321”.
Like ODF single file format, index.xml is self-contained. It actually resembles ODF more closely than OOXML in this and many other respects. In particular, we do not see abbreviated XML element name or attribute name except those that are extremely obvious and borrowed from HTML. This, of course, is the key to why I can decipher the file quickly. We do not see “run tags” (rPr tags for example) that MS insist it needs to fully document the properties of the text. Those property tags are attributes of the element it refers to, the way it is done in ODF and the way I think it should be.
What Apple should had done was to keep the root directory clean. All pictures should be shifted into a subdirectory. Doing so will make the directory structure clearer. As it did not, it makes the file structure difficult to see if, like me, you use a lot of pictures. And I do miss the manifest file. The reason why I went straight into Contents/PkgInfo is I thought it was the equivalent of META-INF/MANIFEST.MF file in java. Although small developers like me is unlikely to read this file, it does give an overview into the directory content and is useful for machine readability and cross-check when programming. What I think it should not do is to make a manifest file for all files, as in the case of OOXML with its proliferation of <name>.rels file.
Also, compared to OpenOffice.org and MSOffice, iWorks, while having innovative UI, is still immature. I could not write a technical document or thesis using Pages because it lacks several vital element like consistent display of heading numbering. This requires me to refrain from calling Apple’s Pages XML a work of art as there may be cracks in the XML schema that makes advanced features required in technical document or thesis hard to implement in Pages.
What does all these says about Apple? It has the competency to implement a good XML structure for office document. I cannot help but use this to take a swipe at OOXML. While I can see why from a business point of view, participating in OOXML’s ECMA TC45 make sense, from a technical point of view, it tarnishes Apple’s reputation when one considered that it set on, deliberated and approve that lousily written OOXML in ECMA TC45. Also, since Apple is already brewing such a XML for its own document use, this further confirm my suspicion that Apple is there to ensure it can implement import/export filter only.
I do not blame it for not using OOXML or ODF as native format. If any, Numbers proved that both are inadequate for Apple’s need.
Now that I have a preliminary look at Apple’s own XML design for office document, what can I say about the debate on XML beauty pageant point of view? (1)Use of abbreviated tags, except of tags borrowed from HTML, should be discouraged. Not only ODF did not think it is a good idea, Apple thinks so as well. Obviously, both ODF and Apple considered the speed argument proposed by OOXML and decided it is worth sacrificing; (2) The “run” tag concept in OOXML is not a good idea. If it were, Apple would had done it. Moreover, it smells like left-over from the binary file format era, and (3)please use human readable string in XML, no “98egs3gfe” but “STTextBox-109873″, even if you use a program to generate the XML which is intended for another machine to read. You never know when a human will eyeball it. Do you really think Apple thought that I will actually eyeball the XML?
One final note, on closer reading of iWorks website and documentation, it appears that iWorks can only read OOXML file but cannot write it. That’s a pity.