Jul 262012

There are some tools available to convert RTF files to text on Linux.



OpenSource Unoconv is a frontend written in Python for OpenOffice (“it needs a recent OpenOffice with UNO bindings”) to convert between many file formats.

Note: Libreoffice (a fork of OpenOffice), and probably OpenOffice itself, can also be invoked from the commandline.

libreoffice --invisible --convert-to pdf file1.doc



OpenSource AbiWord can be used from the command line to export to HTML and text.

abiword –to=txt –to-name=output.txt myfile.doc



The OpenSource DocToText by Silvercoders is available for Windows (where it runs out of the box) and Linux, where some adjustments are necessary.

  • error: ./doctotext: error while loading shared libraries: libxlsreader.so.0: cannot open shared object file: No such file or directory
    • solution:

      ldconfig /path/to/directory/of/doctotext

      you have to add the dynamically linked libraries which this software brings along with it to your system

  • error: ./doctotext: error while loading shared libraries: libgsf-1.so.114: cannot open shared object file: No such file or directory
    • solution:

      apt-get install libgsf-bin

      Under Debian / Ubuntu; Unfortunately this also installs the X-Window System.



DocSplit is an OpenSource project by the folks at DocumentCloud. Offers a wide array of conversion facilities, including OCR to UTF-8.

It will OCR the text for each page for which it fails to extract the text (using Tesseract as a backend for that).

Uses JODConverter, which in turn uses OpenOffice.

DocSplit is both a ruby gem, and a commandline tool.

“Because documents need to be in PDF format before any metadata, text, or images are extracted, it's faster to use docsplit pdf to convert it up front, if you're planning to run more than one extraction. Otherwise Docsplit will write out the PDF version to a temporary file before proceeding with each command.”



CatDoc reads Microsoft Word files and outputs text to the standard output.



A Python script frontend to OpenOffice conversion. According to the author, meant as easier command line option than JODConverter.



Java OpenDocument Converter, uses OpenOffice as backend. Also includes command line tools, from the same author as PyODConverter. It is no longer mantained, the author would be happy for someone to fork him on GitHUB.



AntiWord exists for a huge number of platforms; Unfortunately, it opens .doc documents only.


Apache Tika

OpenSource Apache Tika is a Java-based content analysis toolkit. It is not a ready-to-use program, though – it’s a toolkit for other software applications. It can be scripted with Python.



OpenSource wvWare reads Word formats, there are some tools for command-line usage, but the author recommends to use Abiword to do conversion tasks. Abiword uses wvWare libraries to do Word file handling internally.



GNU’s UnRTF converts RTF to HTML, which in turn can be converted into other formats.


RTF to HTML converter

This platform-independent tool Converts RTF to HTML file (in ISO-8859-2 encoding)


Sources: SuperUser question

Sep 142011

You want to use fixed positioning with CSS

If you want to make use of the simple CSS position:fixed; to display an element static to the viewport, you will run into problems with the Internet Explorer.

position:fixed is only supported since Internet Explorer 7

I tested this solution with IE 7.0.5730.13 on Windows XP. IE 6 does not support fixed positioning.

You need to enforce strict mode

Internet Explorer will default to "quirky rendering mode", if you don't add special tags to your HTML document:

<!DOCTYPE html>
<meta http-equiv="content-type" content="text/html; charset=iso-8859-1" />
<title>Progress Test</title>
<meta http-equiv= "X-UA-Compatible"content="IE=Edge">
<script language = "JavaScript" src="../head.min.js"></script>

Both are essential! If you try IE 9 on the page without having added the meta tag X-UA-Compatible, it will render the page in quirky mode. And in quirky mode, it will ignore your position:fixed, and render your element where you have put it in the flow of the document.

Please note: I have read that this meta tag actually is not valid for this DOCTYPE.

It still does not work? I am testing with local documents!

Internet Explorer also takes into account where the element is being served from. This solution will work, if you upload your test documents to a server and access them over the Internet. It will not work (see note below) for local files, even if you use a local webserver to serve up the files (i.e. XAMPP). Internet Explorer will render your document in quirky mode, in this case.

Note: actually it does work offline (opening a document from the harddrive) in Internet Explorer 9 after fiddling somewhat with the document. I removed a IE compatibility script (not shown above), which I had included from Google previously. Check if you have one of those, and remove it!

Apr 032011

Lectora inserts anchors (<A name=""> elements) which sometimes can break things!

I have programmed exercises with a sliding animation from "page" to "page" of exercises using the Scrollable from JQuery Tools. It would work fine in an older FireFox, but broke apparently in the newest version, it newer worked in Chrome, but it works in IE. The exercises would jump to the last page, skipping over everything in between.

After one hour of debugging I found out that an additional <A></A> element had to be inserted in each "page" of the exercises. (I do it programmaticaly using PHP.) Now all items show up … with an additional item (depending on the browser).

This additional item is an anchor inserted by Lectora. There's no use putting code to take it out again to run on page load, as Lectora's code seems to run later. You have to run your code as soon, as the user triggers an action, just at the beginning of it to clean up.

//fixing Lectora interfering with our code …
$("#exercise > .items > a").remove();

A simple fix for a complex problem … Lectora does require a lot of workarounds.

May 232010

There's the possibility to add an "external HTML object". If you choose "header scripting", your content would be added to the <script> section of Lectora in the header. In most cases that's not what you want to achieve. There's another strange option "Top of file scripting" which simply inserts your content just before the <!DOCTYPE> tag. The only use I can see for this is for PHP scripting which needs to be done before any output was sent to the browser (i.e. header modification).

What you want to choose to include arbitrary content to the <HEAD> section of Lectora generated HTML documents is the "META Tag" option. Unfortunately, this does not allow to include external .txt files, so you have to paste your code in Lectora's input window. (You would have needed to update the .txt file anyways, as Lectora stays with the initial .txt file version instead of updating it from the hard disk every time. That way it's not really a big loss that you can't use .txt files.)

And there IS a way to include a file's contents in the header. Just drop a PHP file somewhere on your server and include it using the following as a "meta tag":

<?php include('your/path/to/header.php');?>