Subscribe to Dr. Macro's XML Rants

NOTE TO TOOL OWNERS: In this blog I will occasionally make statements about products that you will take exception to. My intent is to always be factual and accurate. If I have made a statement that you consider to be incorrect or innaccurate, please bring it to my attention and, once I have verified my error, I will post the appropriate correction.

And before you get too exercised, please read the post, date 9 Feb 2006, titled "All Tools Suck".

Wednesday, May 18, 2016

Delivering HTML from DITA in The Face of Reuse

Delivering HTML from DITA in The Face of Reuse of Topics in a Single Publication

In DITA you can use the same topic multiple times from the same map. For example, the same user interface component might be used by several different parts of a program and you want to include that topic in the descriptions for each of those parts. Or you might have a common installation topic that uses conditional content to apply to different operating systems.
In monolithic output formats like PDF and EPUB this reuse does not present any particular practical problems: because the rendered publication is a single linear flow of content, each use simply occurs in the flow and reflects whatever conditions are in effect at that point in the publication.
However, with multi-file output formats like HTML, there are several practical problems.
The most obvious problem is the "one result file or multiple result files?" question: When a topic is used multiple times, do you want to have just one result HTML file or do you want one result HTML for each use? The DITA Open Toolkit, through version 1.8.5, only generates a single result HTML file unless the map author specifies @copy-to on the topicrefs to the topic. (The @copy-to attribute specifies the effective source path and filename for the referenced topic so that the processor then treats that use of the topic as though it was a copy of the real topic with the specified filename.)
The "every page is page one" philosophy says you should have just one result HTML file for a given topic. Likewise, searching is usually more effective if there is just one HTML file, otherwise you end up getting multiple search results for the same content, which confuses users and makes it hard to know which version to use (and may throw off search ranking algorithms that take the number of copies of a file into account in some way).
On the other hand, if a user comes to a topic that is used in multiple places in the publication, how do they know which use they care about in their current access session?
For the re-used installation topic example, if it reflects multiple operating systems and there is only one copy, you would appear to be required to show all the operating system versions and use flagging to distinguish them. On the other hand, if you have one HTML file for each copy of the topic, each HTML file only reflecting a single operating system, a search on installation will find all the copies, making it hard for the user to choose the right one.
DITA 1.3 adds an important new feature, key scopes, which allows keys to have different values in different parts of the same map. This lets you reuse the same topic in different contexts and have content references, hyperlinks, and key-defined text strings resolve to different values in the different use contexts.
For the installation example, you could have three key scopes, one for each of the operating systems Windows, OSX, and Linux.
DITA 1.3 also adds the new branch filtering feature. With branch filtering you can apply different filtering conditions to branches within a single map. This lets you use the same topic in different parts of the map with different filtering conditions applied.
For the installation topic you can now have a single topic as authored with content that is conditional to each operating system and then have only the operating system for the branch filtered in.
It should be obvious that this must result in either three result HTML files, one reflecting each different set of filtering conditions, or a single HTML file constructed so that the browser can do the filtering dynamically, such as through different CSS files for the different filtering conditions or through Javascript or some combination.
This all means that, with DITA 1.3, DITA-to-HTML processors must handle multiple uses of the same topic in a sophisticated way. The OT 1.x approach of generating a single HTML result will not work. Likewise, the OT 2.x approach of always generating a new result file works (in that it ensures a correct result) but does not necessarily satisfy requirements for minimizing content duplication in the result.
So basically there is a fundamental conflict between ensuring correct content in the generated HTML when branch filtering and key scopes are in effect and satisfying the "every page is page one" philosophy.
If every use of a topic results in a new HTML file then searching is impaired but HTML generation is as simple as it can be. 
In the context of the Open Toolkit, branch filtering (and @copy-to) is applied to create new intermediate topic files and then those intermediate topics are filtered to produce another set of intermediate topics which are then the input to the normal HTML generation process. All the data processing complexity is in the preprocessing.
In order to produce a single result HTML file the processor has to determine, for the conditional content in a given topic, which content would be filtered out of all uses and which content would be filtered in for any use context and produce an intermediate topic that omits the globally-excluded elements but retains the elements included in any use. It also has to somehow record each use and how it relates to the included conditional elements so that the final HTML generation stage can retain that information in the generated HTML so that CSS or Javascript can act on it. For example, the processor might translate each unique set of filtering conditions into a single value included in the conditional element's @class values or it might embed some JSON data structure that establishes the map context the element was referenced in.
Given this kind of information in the generated HTML it would then be possible to have the browser dynamically show or hide specific elements based on the active conditions selected by the reader. By default the content could be flagged as it would be in the normal flagged output result produced by the normal Open Toolkit flagging preprocessing.
However, with this dynamically-filtered HTML file there's still the problem of the reader knowing what use context they want to view the topic in terms of.
For example, if you do a search and find the installation HTML page and open it you then have to decide which operating system you want to view it in terms of. 
How is this decision presented to the reader? 
How does the Web site track access in order to establish this use context automatically when it can?
And of course the situation could be much more complicated: there could be a number of conditions against which the content is filtered, e.g., operating system, hardware platform, region, product features active, etc.
I think this is a delivery challenge that the DITA community needs to address generally by both establishing best practices around content authoring and delivery, by implementing the DITA-to-HTML processing that supports generating these more-sophisticated HTML pages, and by implementing general CSS and Javascript libraries for use in DITA-based Web sites.

Labels: , , , , , ,