Subscribe to Dr. Macro's XML Rants

NOTE TO TOOL OWNERS: In this blog I will occasionally make statements about products that you will take exception to. My intent is to always be factual and accurate. If I have made a statement that you consider to be incorrect or innaccurate, please bring it to my attention and, once I have verified my error, I will post the appropriate correction.

And before you get too exercised, please read the post, date 9 Feb 2006, titled "All Tools Suck".

Wednesday, May 18, 2016

Delivering HTML from DITA in The Face of Reuse

Delivering HTML from DITA in The Face of Reuse of Topics in a Single Publication

In DITA you can use the same topic multiple times from the same map. For example, the same user interface component might be used by several different parts of a program and you want to include that topic in the descriptions for each of those parts. Or you might have a common installation topic that uses conditional content to apply to different operating systems.
In monolithic output formats like PDF and EPUB this reuse does not present any particular practical problems: because the rendered publication is a single linear flow of content, each use simply occurs in the flow and reflects whatever conditions are in effect at that point in the publication.
However, with multi-file output formats like HTML, there are several practical problems.
The most obvious problem is the "one result file or multiple result files?" question: When a topic is used multiple times, do you want to have just one result HTML file or do you want one result HTML for each use? The DITA Open Toolkit, through version 1.8.5, only generates a single result HTML file unless the map author specifies @copy-to on the topicrefs to the topic. (The @copy-to attribute specifies the effective source path and filename for the referenced topic so that the processor then treats that use of the topic as though it was a copy of the real topic with the specified filename.)
The "every page is page one" philosophy says you should have just one result HTML file for a given topic. Likewise, searching is usually more effective if there is just one HTML file, otherwise you end up getting multiple search results for the same content, which confuses users and makes it hard to know which version to use (and may throw off search ranking algorithms that take the number of copies of a file into account in some way).
On the other hand, if a user comes to a topic that is used in multiple places in the publication, how do they know which use they care about in their current access session?
For the re-used installation topic example, if it reflects multiple operating systems and there is only one copy, you would appear to be required to show all the operating system versions and use flagging to distinguish them. On the other hand, if you have one HTML file for each copy of the topic, each HTML file only reflecting a single operating system, a search on installation will find all the copies, making it hard for the user to choose the right one.
DITA 1.3 adds an important new feature, key scopes, which allows keys to have different values in different parts of the same map. This lets you reuse the same topic in different contexts and have content references, hyperlinks, and key-defined text strings resolve to different values in the different use contexts.
For the installation example, you could have three key scopes, one for each of the operating systems Windows, OSX, and Linux.
DITA 1.3 also adds the new branch filtering feature. With branch filtering you can apply different filtering conditions to branches within a single map. This lets you use the same topic in different parts of the map with different filtering conditions applied.
For the installation topic you can now have a single topic as authored with content that is conditional to each operating system and then have only the operating system for the branch filtered in.
It should be obvious that this must result in either three result HTML files, one reflecting each different set of filtering conditions, or a single HTML file constructed so that the browser can do the filtering dynamically, such as through different CSS files for the different filtering conditions or through Javascript or some combination.
This all means that, with DITA 1.3, DITA-to-HTML processors must handle multiple uses of the same topic in a sophisticated way. The OT 1.x approach of generating a single HTML result will not work. Likewise, the OT 2.x approach of always generating a new result file works (in that it ensures a correct result) but does not necessarily satisfy requirements for minimizing content duplication in the result.
So basically there is a fundamental conflict between ensuring correct content in the generated HTML when branch filtering and key scopes are in effect and satisfying the "every page is page one" philosophy.
If every use of a topic results in a new HTML file then searching is impaired but HTML generation is as simple as it can be. 
In the context of the Open Toolkit, branch filtering (and @copy-to) is applied to create new intermediate topic files and then those intermediate topics are filtered to produce another set of intermediate topics which are then the input to the normal HTML generation process. All the data processing complexity is in the preprocessing.
In order to produce a single result HTML file the processor has to determine, for the conditional content in a given topic, which content would be filtered out of all uses and which content would be filtered in for any use context and produce an intermediate topic that omits the globally-excluded elements but retains the elements included in any use. It also has to somehow record each use and how it relates to the included conditional elements so that the final HTML generation stage can retain that information in the generated HTML so that CSS or Javascript can act on it. For example, the processor might translate each unique set of filtering conditions into a single value included in the conditional element's @class values or it might embed some JSON data structure that establishes the map context the element was referenced in.
Given this kind of information in the generated HTML it would then be possible to have the browser dynamically show or hide specific elements based on the active conditions selected by the reader. By default the content could be flagged as it would be in the normal flagged output result produced by the normal Open Toolkit flagging preprocessing.
However, with this dynamically-filtered HTML file there's still the problem of the reader knowing what use context they want to view the topic in terms of.
For example, if you do a search and find the installation HTML page and open it you then have to decide which operating system you want to view it in terms of. 
How is this decision presented to the reader? 
How does the Web site track access in order to establish this use context automatically when it can?
And of course the situation could be much more complicated: there could be a number of conditions against which the content is filtered, e.g., operating system, hardware platform, region, product features active, etc.
I think this is a delivery challenge that the DITA community needs to address generally by both establishing best practices around content authoring and delivery, by implementing the DITA-to-HTML processing that supports generating these more-sophisticated HTML pages, and by implementing general CSS and Javascript libraries for use in DITA-based Web sites.

Labels: , , , , , ,

3 Comments:

Blogger Don R. Day said...

The expeDITA approach is to provide a shorthand RESTful code for "topic in map context". Normally a resource name is preceded by its type, since things of a type are a collection (using the REST terminology). Hence:

- topic/ means a list of all topics
- map /
means a list of all maps
- topic/about designates a single topic resource: about.dita
- map/interocitor designates map resource: interocitor.ditamap
- interocitor/ means a list of all topics in the named collection "interocitor{.ditamap}"
- interocitor/about designates the about.dita resource in the interocitor.ditamap context

These addresses say nothing about how a map is to be processed. I have a default that maps are generated as navigators, but it is also possible that a user wants to see a map as an aggregate, or perhaps a map as aggregated at the first level (all first level topicrefs resolve to content) with links to branches from that level down, or any other imaginable rendition. I don't think the address has to provide the details on visualization, just a pointer to the node from which that rendition happens.

Obviously, a full map may be too much context for a single deliverable. I interpret EPPO as being to a thesis, not to the scope of a single topic. As an author, you've said enough when you assess that a user will be happy that their search got them the one scope of content that answers their curiosity. From a DITA perspective, I think this involves an analysis of what level of branching constitutes the appropriate scope for a self-sufficient EPPO page.

DITA offers several approaches. A compound topic is the most obvious way to author a nested thesis of appropriate internal structure. This can still be reused in other builds, but requires an eyes open awareness of how to reuse components at an internal level rather than at file level. Not a hard problem, just planning. The other approach is to craft submaps at that same level of scope and designate the submap as the resource that is processed into a single file (or other display variant as I mentioned before). An intriguing approach is to render the structure as an outline and let the user pick the branch they want to be dynamically rendered from that point into a single result.

On top of all this is that we can apply conditionality to the rendition based on user preferences or other context (how about a JS activity monitor that suggests a user's possible skill level, which modifies a transform's progressive disclosure policy). I think this suggests that the aggregation happens first, and then the conditional rendition is applied. And yes, this means that a single web address may produce subtly different renditions depending on who is asking, as opposed to pre-building different deliverables at different addresses, meaning that search facets must ensure that a user is steered to right right rendition (vs rendering as needed on request, applying 'facets as conditions' on the fly).

There is no single right answer, as will all things in life. I favor the single address that responds to a usage context to provide the appropriate render-on-demand metadata for conditionality. I don't think any of the scoping approaches I mentioned is right or wrong other than that it provide the behavior expected by the user in the manner they are requesting it from (casual search, out of an application's F1 contextual help, via a faceted search interface in an infocenter, whatever).

12:48 PM  
Blogger Unknown said...

Sometimes the correct way to do this is subject-matter specific.

For an example of how this can play out in practice, have a look at the PTC InService product. This allows one to utilise heavily componentised content (e.g. S1000D) but provides a viewing experience with search/filter/etc. built in, and is in active use by heavy equipment industries (mining, agriculture, airlines, defense, etc.). Another example is the Autodata product set from the automotive industry. This allows you to search, filter and view service information, procedures and data for any on-road vehicle. I guess my point is that careful consideration should be given to the various use-cases if looking for a generic solution to this problem.

A complicating factor on top of all this is multi-lingual information sets, but I think that is starting to stray from your original topic, but worthy of consideration in the proposed solution space.

6:30 PM  
Blogger Unknown said...

Very good article and I fully agree to everything what you have written.

1:15 AM  

Post a Comment

<< Home