catdoc(1)

NAME

   catdoc  -  reads  MS-Word  file  and  puts its content as plain text on
   standard output

SYNOPSIS

   catdoc [-vlu8btawxV] [-m number] [ -s  charset]  [  -d  charset]  [  -f
   output-format] file

DESCRIPTION

   catdoc  behaves much like cat(1) but it reads MS-Word file and produces
   human-readable text on standard output.  Optionally it can use latex(1)
   escape  sequences  for characters which have special meaning for LaTeX.
   It also makes some effort to  recognize  MS-Word  tables,  although  it
   never  tries  to  write  correct headers for LaTeX tabular environment.
   Additional output formats, such is HTML can be easily defined.

   catdoc doesn't attempt to extract  formatting  information  other  than
   tables  from  MS-Word  document, so different output modes means mainly
   that different characters should be escaped and different ways used  to
   represent  characters,  missing  from  output  charset.  See  CHARACTER
   SUBSTITUTION below

   catdoc uses internal unicode(4) representation of text, so it  is  able
   to  convert texts when charset in source document doesn't match charset
   on target system.  See CHARACTER SETS below.

   If no file names supplied, catdoc processes its standard  input  unless
   it  is  terminal. It is unlikely that somebody could type Word document
   from keyboard, so if catdoc invoked without arguments and stdin is  not
   redirected,  it  prints  brief  usage message and exits.  Processing of
   standard input (even among other files) can be forced using dash '-' as
   file name.

   By  default,  catdoc  wraps lines which are more than 72 chars long and
   separates paragraphs by blank lines. This behavior can be turned of  by
   -w  switch. In wide mode catdoc prints each paragraph as one long line,
   suitable for import into word processors that perform word wrapping.

OPTIONS

   -a      - shortcut  for  -f  ascii.  Produces  ASCII  text  as  output.
           Separates table columns with TAB

   -b      - process broken MS-Word file. Normally, catdoc checks if first
           8 bytes of file is Microsoft OLE signature. If so, it processes
           file,  otherwise  it just copies it to stdin. It is intended to
           use catdoc as filter for viewing all files with .doc extension.

   -dcharset
           - specifies destination charset name. Charset file  has  format
           described   in  CHARACTER  SETS  below  and  should  have  .txt
           extension   and  reside   in   catdoc   library   directory   (
           ${prefix}/lib/x86_64-linux-gnu/catdoc).   By  default,  current
           locale charset is used if langinfo support compiled in.

   -fformat
           -  specifies  output   format   as   described   in   CHARACTER
           SUBSTITUTION  below.   catdoc  comes  with two output formats -
           ascii and tex. You can add your own if you wish.

   -l      Causes catdoc to list names of available charsets to the stdout
           and exit successfully.

   -mnumber
           Specifies  right  margin  for  text   (default  72).   -m  0 is
           equivalent to -w

   -scharset
           Specifies source charset. (one used in Word document), if  Word
           document   doesn't  contain  UTF-16   text.  When  reading  rtf
           documents, it is typically not necessary, because rtf documents
           contain  ansicpg specification. But it can be set wrong by Word
           (I've  seen  RTF  documents  on  Russian,  where   cp1252   was
           specified). In this case this option would take precedence over
           charset,  specified  in  the   document.   But   source_charset
           statement  in  the  configuration  file have less priority than
           charset in the document.

   -t      - shortcut for -f tex
            converts all printable chars, which have special  meaning  for
           LaTeX(1)  into  appropriate  control sequences. Separates table
           columns by &.

   -u      - declares that Word   document   contain   UNICODE    (UTF-16)
           representation  of  text (as some Word-97 documents). If catdoc
           fails to correct  Word document with   default  charset,    try
           this  option.

   -8      - declares is Word document is 8 bit. Just in case that catdoc
            recognizes file format incorrectly.

   -w      disables  word wrapping. By default catdoc output is split into
           lines not longer than 72 (or  number, specified by -m   option)
           characters  and  paragraphs  are  separated by blank line. With
           this option each paragraph is one long line.

   -x      causes catdoc to output unknown UNICODE  character  as  \xNNNN,
           instead of question marks.

   -v      causes  catdoc  to  print  some  useless information about word
           document structure to stdout before actual start of text.

   -V      outputs catdoc version

CHARACTER SETS

   When  processing  MS-Word  file  catdoc  uses  information  about   two
   character sets, typically different
    -   input  and  output.  They are stored in plain text files in catdoc
   library directory. Character set files should contain  two  whitespace-
   separated  hexadecimal numbers - 8-bit code in character set and 16-bit
   Unicode code.  Anything from hash mark to end of line  is  ignored,  as
   well as blank lines.

   catdoc  distribution  includes some of these character sets. Additional
   character set definitions, directly usable by catdoc  can  be  obtained
   from  ftp.unicode.org.  Charset files have .txt suffix, which shouldn't
   be specified in command-line or configuration files.

   Note that catdoc is distributed with Cyrillic charsets as  default.  If
   you  are not Russian, you probably don't want it, an should reconfigure
   catdoc at compile time or in runtime configuration file.

   When dealing with documents with charsets other than default,  remember
   that  Microsoft  never  uses ISO charsets. While letters in, say cp1252
   are at the same position as in ISO-8859-1, some punctuation signs would
   be lost, if you specify ISO-8859-1 as input charset. If you use cp1252,
   catdoc  would  deal  with  those  signs  as  described   in   CHARACTER
   SUBSTITUTION below.

CHARACTER SUBSTITUTION

   catdoc   converts    MS-Word   file  into  following  internal  Unicode
   representation:

   1. Paragraphs are separated by ASCII Line Feed symbol (0x000A)

   2. Table cells within row are separated by ASCII Field Separator symbol
       (0x001C)

   3. Table rows are separated by ASCII Record Separator (0x001E)

   4. All printable characters, including whitespace are represented  with
   their
       respective UNICODE codes.

   This  UNICODE  representation is subsequently converted into 8-bit text
   in target character set using following four-step algorithm:

   1. List of special characters is searched for given Unicode character.
       If found,  then  appropriate  multi-character  sequence  is  output
       instead of character.

   2. If there is an equivalent in target character set, it is output.

   3.  Otherwise,  replacement  list  is  searched and, if there is multi-
   character
       substitution for this UNICODE char, it is output.

   4. If all above fails, "Unknown char" symbol (question mark) is output.

   Lists of special characters and list of substitution are character set-
   independent,  because  special  chars  should  be escaped regardless of
   their existence in target character set  (usually, they  are  parts  of
   US-ASCII,  and  therefore  exist  in any character set) and replacement
   list is searched only for those characters,  which  are  not  found  in
   target character set.

   These lists are stored in catdoc library directory in files with prefix
   of format name. These files have following format:

   Each line can be either comment (starting with hash  mark)  or  contain
   hexadecimal  UNICODE  value, separated by whitespace from string, which
   would be substituted instead of it. If string contain no whitespace  it
   can  be used as is, otherwise it should be enclosed in single or double
   quotes. Usual backslash sequences like '\n','\t' can be used  in  these
   string.

RUNTIME CONFIGURATION

   Upon startup catdoc reads its system-wide configuration file ( catdocrc
   in catdoc library directory) and then user-specific configuration  file
   ${HOME}/.catdocrc.

   These files can contain following directives:

   source_charset = charset-name
           Sets  default  source  charset,  which  would  be used if no -s
           option  specified.  Consult  configuration  of  nearby  windows
           workstation to find one you need.

   target_charset = charset-name
            Sets  default output charset. You probably know, which one you
           use.

   charset_path = directory-list
           colon-separated list of directories,  which  are  searched  for
           charset  files.  This allows you to install additional charsets
           in your home directory.  If first directory component  of  path
           is  ~  it is replaced by contents of HOME environment variable.
           On MS-DOS platform, if directory name starts  with  %s,  it  is
           replaced  with  directory  of executable file. Empty element in
           list  (i.e.  two  consequitve  colons)  is  considered  current
           directory.

   map_path = directory-list
           colon-separated  list  of  directories,  which are searched for
           special character map and replacement map.   Same  substitution
           rules as in charset_path are applied.

   format = format name
           Output  format  which  would  be used by default.  catdoc comes
           with two formats - ascii and tex but nothing prevents you  from
           writing  your own format (set two map files - special character
           map and replacement map).

   unknown_char = character specification
           sets character to output instead of unknown  Unicode  character
           (default '?')  Character specification can have one of two form
           - character enclosed in single quotes or hexadecimal code.

   use_locale =(yes|no)
           Enables or  disables  automatic  selection  of  output  charset
           (default yes),
            based  on system locale settings (if enabled at compile time).
           If automatic detection is enabled, than output charset settings
           in  the  configuration  files (but not in the command line) are
           ignored, and current system locale  charset  is  used  instead.
           There are no automatic choice of input charset, based of locale
           language, because most modern Word files (since  Word  97)  are
           Unicode anyway

BUGS

   Doesn't  handle  fast-saves  properly.  Prints  footnotes  as  separate
   paragraphs at the end of  file,  instead  of  producing  correct  LaTeX
   commands.  Cannot distinguish between empty table cell and end of table
   row.

SEE ALSO

   xls2csv(1), catppt(1), cat(1), strings(1), utf(4), unicode(4)

AUTHOR

   V.B.Wagner <[email protected]>



Opportunity


Personal Opportunity - Free software gives you access to billions of dollars of software at no cost. Use this software for your business, personal use or to develop a profitable skill. Access to source code provides access to a level of capabilities/information that companies protect though copyrights. Open source is a core component of the Internet and it is available to you. Leverage the billions of dollars in resources and capabilities to build a career, establish a business or change the world. The potential is endless for those who understand the opportunity.

Business Opportunity - Goldman Sachs, IBM and countless large corporations are leveraging open source to reduce costs, develop products and increase their bottom lines. Learn what these companies know about open source and how open source can give you the advantage.


Free Software


Free Software provides computer programs and capabilities at no cost but more importantly, it provides the freedom to run, edit, contribute to, and share the software. The importance of free software is a matter of access, not price. Software at no cost is a benefit but ownership rights to the software and source code is far more significant.

Free Office Software - The Libre Office suite provides top desktop productivity tools for free. This includes, a word processor, spreadsheet, presentation engine, drawing and flowcharting, database and math applications. Libre Office is available for Linux or Windows.


Free Books


The Free Books Library is a collection of thousands of the most popular public domain books in an online readable format. The collection includes great classical literature and more recent works where the U.S. copyright has expired. These books are yours to read and use without restrictions.

Source Code - Want to change a program or know how it works? Open Source provides the source code for its programs so that anyone can use, modify or learn how to write those programs themselves. Visit the GNU source code repositories to download the source.


Education


Study at Harvard, Stanford or MIT - Open edX provides free online courses from Harvard, MIT, Columbia, UC Berkeley and other top Universities. Hundreds of courses for almost all major subjects and course levels. Open edx also offers some paid courses and selected certifications.

Linux Manual Pages - A man or manual page is a form of software documentation found on Linux/Unix operating systems. Topics covered include computer programs (including library and system calls), formal standards and conventions, and even abstract concepts.