Semantic Documents: the Relevance of LibreOffice and OpenDocument

I love LaTeX and nothing will ever completely replace it, but let us be realistic: although LaTeX’s goal is to create documents in which the focus is on logical structure and semantic content, in actual fact most people use markup and commands which pertain more to presentation than structure. Using semantic commands exclusively in LaTeX is unintuitive and complicated. What is more, a LaTeX document’s semantic markup is not easily exported to other formats, at least with the included tools. Of course, there is TeX4ht, but as of yet I have had little success with this package, due to my custom commands. There is also Pandoc, which is absolutely magical and wonderful, but then again custom commands are not optimally digested by it.

It is much easier to create semantic documents with the OpenDocument format. LibreOffice Writer, in particular, presents three tools of choice to create semantic markup: styles, sections and text fields.

As I understand it, at the level of the XML code, character styles and text fields present themselves in much the same way. However, their purposes are different. Styles are potentially just as semantic as text fields, but the latter allow automatic generation of content and can also be updated automatically. It is also possible to automatically update the contents of styled text, but this process is technically more demanding and no simple way of doing it is provided by the application’s interface (it is necessary to use scripts). On the other hand, text fields are not meant to apply a given visual style to the text contents, though it is possible to carry out this task manually or through scripting. From the OpenDocument 1.2 Specification, part 1, section 7.2:

The presentation of information in a field is determined by the style applied to the field.

The relationship between paragraph styles and sections is not as straightforward. Sections do not contain a single block-level element, but rather an arbitrary number of objects of any kinds, thus offering an additional level of organization for the document. Just like text fields, sections can be automatically updated, notably from external sources. Actually, in some ways, sections function like LaTeX’s environments and \input commands.

A LibreOffice ODT document contains style, text field and section definitions. By employing those three types of definitions, one can probably cover all semantic needs. The target formats (generally, HTML and PDF) conserve those properties vith very variable degrees of fidelity. The HTML format and, by extension, ePub, can definitely conserve all the semantic properties, but in most cases it is necessary to modify the generated HTML, chiefly in order to modify the classes and identifiers of div and span elements. In most cases, one must also edit the related CSS. By combining intelligent styling and adequate definition of fields and sections, one can conserve everything in this target format.

In the case of the PDF format, almost everything is lost, except the fields of tables of contents, hyperlinks and form fields. The appearance and style of sections is conserved, but nothing allows one to visually distinguish custom fields to which no particular style has been applied or sections that have no special formatting. Obviously, because PDF is meant for printing or simple reading as opposed to editing or data mining, none of this is alarming; the important thing is to always keep the source ODT file. In all cases, one must consider this source file just like program source code: it is more important than the compiled program, since reproducing the program from the source code is easily done, while reproducing the source code from the program is nearly impossible, except perhaps for extremely simple programs with very little structure. The analogy is relevant for documents that are mostly textual and the situation is the same with TeX source files and their corresponding targets (DVI, PostScript, PDF), insofar as TeX files are properly considered to be pogram source code.

The conclusion is simple: OpenDocument Text is the ideal starting format, while HTML, TeX and PDF are more properly target formats. Of course, HTML and TeX are also excellent starting formats, but ODT seems to be a relatively better choice for semantic and structured documents, depending on the producer of the document and their intention. In “real life”, most people work with a word processor and, notwithstanding the relevance of the many criticisms targeted at such pieces of software, the fact that they are useful and indispensable is undeniable. The average person will probably never want to learn TeX, nor write HTML. Visual editors are indispensable. The solution is not to hate them and to wish for their disappearance (this will never happen), but to improve them and to make sure that we use the best one available. In this day and age, that is LibreOffice (or, of course, OpenOffice, but I have developed a habit of using LibreOffice specifically). The format is excellent, the software is complete and surprising (in a good way), the API is fascinating. Admittedly, sometimes LibreOffice hurts, sometimes it disappoints, but one could say the same of all such visual editors and, actually, LibreOffice is probably the least aggravating of them all. In a perfect world, maybe books and Web documents would still be systematically produced by professionnals and specialists, but that is a beautiful utopia. In our world, what matters most is to give everyone poweful and liberating tools to produce reliable, useful, accessible and efficient documents. Since most everyone produces documents nowadays, is this not luminously evident?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s