There are some tools available to convert RTF files to text on Linux.
Unoconv
OpenSource Unoconv is a frontend written in Python for OpenOffice (“it needs a recent OpenOffice with UNO bindings”) to convert between many file formats.
Note: Libreoffice (a fork of OpenOffice), and probably OpenOffice itself, can also be invoked from the commandline.
libreoffice --invisible --convert-to pdf file1.doc
Abiword
OpenSource AbiWord can be used from the command line to export to HTML and text.
abiword –to=txt –to-name=output.txt myfile.doc
DocToText
The OpenSource DocToText by Silvercoders is available for Windows (where it runs out of the box) and Linux, where some adjustments are necessary.
- error: ./doctotext: error while loading shared libraries: libxlsreader.so.0: cannot open shared object file: No such file or directory
- solution:
ldconfig /path/to/directory/of/doctotextyou have to add the dynamically linked libraries which this software brings along with it to your system
- error: ./doctotext: error while loading shared libraries: libgsf-1.so.114: cannot open shared object file: No such file or directory
- solution:
apt-get install libgsf-binUnder Debian / Ubuntu; Unfortunately this also installs the X-Window System.
DocSplit
DocSplit is an OpenSource project by the folks at DocumentCloud. Offers a wide array of conversion facilities, including OCR to UTF-8.
It will OCR the text for each page for which it fails to extract the text (using Tesseract as a backend for that).
Uses JODConverter, which in turn uses OpenOffice.
DocSplit is both a ruby gem, and a commandline tool.
“Because documents need to be in PDF format before any metadata, text, or images are extracted, it's faster to use docsplit pdf to convert it up front, if you're planning to run more than one extraction. Otherwise Docsplit will write out the PDF version to a temporary file before proceeding with each command.”
CatDoc
CatDoc reads Microsoft Word files and outputs text to the standard output.
PyODConverter
A Python script frontend to OpenOffice conversion. According to the author, meant as easier command line option than JODConverter.
JODConverter
Java OpenDocument Converter, uses OpenOffice as backend. Also includes command line tools, from the same author as PyODConverter. It is no longer mantained, the author would be happy for someone to fork him on GitHUB.
AntiWord
AntiWord exists for a huge number of platforms; Unfortunately, it opens .doc documents only.
Apache Tika
OpenSource Apache Tika is a Java-based content analysis toolkit. It is not a ready-to-use program, though – it’s a toolkit for other software applications. It can be scripted with Python.
wvWare
OpenSource wvWare reads Word formats, there are some tools for command-line usage, but the author recommends to use Abiword to do conversion tasks. Abiword uses wvWare libraries to do Word file handling internally.
UnRTF
GNU’s UnRTF converts RTF to HTML, which in turn can be converted into other formats.
RTF to HTML converter
This platform-independent tool Converts RTF to HTML file (in ISO-8859-2 encoding)
Sources: SuperUser question