Subscribe to Dr. Macro's XML Rants

NOTE TO TOOL OWNERS: In this blog I will occasionally make statements about products that you will take exception to. My intent is to always be factual and accurate. If I have made a statement that you consider to be incorrect or innaccurate, please bring it to my attention and, once I have verified my error, I will post the appropriate correction.

And before you get too exercised, please read the post, date 9 Feb 2006, titled "All Tools Suck".

Tuesday, January 05, 2016

Some DITA and DocBook History: Common Origins, Very Different Results

The following was originally posted to the DITA Users' Yahoo Group 4 Jan 2016 in the context of a discussion of DITA vs. DocBook. My intent with this bit of history is to show how both DocBook and DITA (through it's ancestor, IBM ID Doc) started development around the same time, more or less from a single meeting. Those of us at IBM took things in one direction, those in the Unix-focused community went in a different direction.

The original post (edited for typos):

If you look at the history of DocBook and DITA both descend from the same time period, the late 80’s, when the technical communication industry in particular (but not exclusively) was trying to figure out how to apply this new SGML technology to their particular information management and document production challenges.

In the case of DocBook the genesis was primarily standardizing Unix man pages. In the case of DITA it was IBM’s attempt to standardize the markup used across the many different divisions and product groups within IBM as well as satisfy the requirements of online delivery of hyperlinked documents, something IBM was doing in the 80’s, long before anyone else outside of hypertext research groups, as far as I know.

There was a meeting the late 80’s, I think 1989, where representatives from the major software and hardware vendors met to discuss ways of standardizing the markup across their documentation, including IBM, HP, Digital Equipment, Group Bull, and one or more Unix vendors (the names escape me now—all except IBM and HP are long gone) in order to have some hope of interchange among them.

The meeting was hosted by Fred Dalrymple of the Open Software Foundation at offices in the Boston area. The work was led by Eve Maler, who was pioneering approaches to DTD design and modularization (she popularized the “pizza” model, adopted by the TEI and also reflected somewhat in DocBook and DITA). I was there with Wayne Wohler representing IBM. (Eve wrote the first book on SGML DTD design: "Developing SGML DTDs: From Text to Model to Markup”, with Jeanne El Andaloussi, who was at Group Bull at the time.)

One of the key things that Eve did was make a table that related the markup vocabularies of each participant to each other vocabulary. There was a row for “paragraph”, a row for “H1”, etc. [I’m sure I don’t have a copy of this table anywhere but it would be interesting to see it now—I have a clear picture of it in my mind but not clear enough to reproduce. But this table was, in many ways, the direct inspiration for my approach to markup design and set the direction of my technical career from then to now.]

What this table made clear was that all these languages had the same basic set of semantic elements but they all used different tag names and had different detailed rules for the content. But they all had some kind of paragraph (, , ,, etc.), headings, tables, lists, etc. (Remember that this was before HTML had been defined by Sir Berners-Lee—he based HTML off of the basic tag set in IBM’s GML Starter Set language, which predated SGML and was in use at Cern at the time Berners-Lee developed HTML.)

What Wayne and I got from this meeting was that A) there was this semantic correspondence and B) we needed a way to allow differences in markup details (tag names, content models) that still allowed interoperation. I realized that one could define a layered architecture with these base types as its foundation and, given a way to map specific element types to their bases, allow variety in the markup naming and content details while allowing interchange and common processing.

Soon after this Wayne and I, along with Don Day, Simcha Gralla, and others, started working on IBM’s SGML replacement for the GML-based BookMaster language, which was used for most of IBM’s documentation and had more than 600 element types, reflecting a very broad range of requirements. BookMaster allowed for very efficient creation of documentation delivered in print and online on 5 different computer platforms using IBM’s BookManager tool, which provided electronic books starting in the mid 80’s. But BookMaster was also big and difficult to change or extend. It suffered the same problems that all large all-encompassing vocabularies suffer: it became a tarball that was difficult to adapt to new requirements. IBM had a committee that considered BookMaster change requests and it worked on a 6-month cycle at best. BookMaster was also based on proprietary IBM composition technology, the Document Composition Facility, which was becoming obsolete with the development of PCs and more modern processing languages and systems.

At this same time Dr. Charles Goldfarb, inventor of GML and SGML, was now working on HyTime, an SGML-based language for hypertext representation. Dr. Goldfarb knew that he couldn’t impose a specific tag set but had to have a way to allow any element type to indicate what kind of HyTime thing it was. His solution was “architectural forms”, a mechanism that relied on specific SGML features to allow elements to declare how they related to the HyTime-defined element types and attributes. It also imposed basically the same content model constraints that DITA specialization imposes, namely that the content models of the derived element types had to be consistent with those of their architectural bases, but HyTime was necessarily less restrictive.

For the SGML BookMaster replacement, which we called IBM ID Document Type (IDDoc), we needed robust linking and we needed something like architectural forms. So we adopted HyTime both for linking and for the architectural forms mechanism. [As a side effect I became involved with Dr. Goldfarb and Dr. Newcomb with the development of the HyTime standard itself. You can ask my wife about “No, Charles.” sometime…]

For IBM ID Doc we defined a base set of elements that reflected the 25-or-so basic semantic elements that Eve had identified at that meeting at the OSF. The rest of the vocabulary was then build up from those base types. This layered architecture allowed the implementation of common processing while allowing local creation of new vocabulary to meet new requirements. Interchange and interoperation were preserved but the overall system became more flexible. This design was completed in about 1993 and implementation and use proceeded and continues to this day, although I understand that use of IDDoc is almost completely replaced by use of DITA within IBM. I left IBM in 1994. Don Day stayed.

Thus DITA reflects one ancestral branch from those early days of SGML application design.

Soon after or at the same time as the OSF meeting, another group of people founded the Davenport group, focused on standardizing Unix MAN pages. I was not directly involved in these meetings so I can’t comment on the details but their work became the basis for DocBook. I did attend one DocBook meeting sometime in the early 90’s (I remember I was still wearing suits per the IBM dress code, so it had to be before ’92 or ’93) and presented my attempt to use architectural forms to formally map DocBook to IDDoc and to try to plant the idea of architectural forms and layered architectures but I was not successful. I think I was seen mostly as a disruptive crank, which I probably was to some degree.

[From Fred Dalrymple’s LinkedIn page, on his time at OSF: "Designed the book style and created formatting tools for all OSF technical publications, published by Prentice-Hall. Led migration of OSF technical publications from legacy format (UNIX nroff/troff) to SGML, including definition of the OSF DTD and development of transformation tools. This work led directly to the creation of DocBook and the Topic Maps standard, ISO/IEC 13250:2000.”]

Don and Michael Priestley can give the history of the development of DITA within IBM after I left at the end of ’93 but the result is apparent today: the DITA we know and love.

In the ensuing decade between 93 and 2003 I became an editor of HyTime 2nd Edition and a founding member of the XML Working Group. I did a lot of client work developing custom SGML and XML vocabularies and tried to apply the same layered architectural model that we had defined at IBM. XML omitted the SGML features required for HyTime’s architectural forms mechanism (which is why DITA has the @class attribute it does) and the publication of the XML standard in 1997 made HyTime instantly obsolete (we published HyTime 2nd Edition in 1996, just in time for it be completely ignored by most people, although its influence is still felt in newer applications, including DITA, XLink, TEI, JATS, and DocBook).

When Don approached me in 2000 or 2001 about this DITA standard thing he was staring I was very eager to participate because I saw it as a potential way to fully realize many of the ideas I’d been working with over the previous decade or so.

[This is the end of the original posting. Obviously there is lots more history here but I think this provides some insight into how DITA and DocBook came to be. Would definitely like to hear the DocBook side of this story as I'm sure I've either omitted important events or misrepresented important aspects.]

Labels: , , , , , , , ,