Linux Manual Pages

Free Software * Books

Source Code

Free Media

Linux

dbacl(1)

NAME

   dbacl - a digramic Bayesian classifier for text recognition.

SYNOPSIS

   dbacl [-01dvnirmwMNDXW] [-T type ] -l category [-h size] [-H gsize] [-x
          decim] [-q quality] [-w max_order] [-e deftok] [-o  online]  [-L
          measure] [-g regex]...  [FILE]...

   dbacl  [-vnimNRX] [-h size] [-T type] -c category [-c category]...  [-f
          keep]...  [FILE]...

   dbacl -V

OVERVIEW

   dbacl is a Bayesian text  and  email  classifier.  When  using  the  -l
   switch,  it  learns  a  body  of text and produce a file named category
   which summarizes the text. When using the -c  switch,  it  compares  an
   input  text  stream  with any number of category files, and outputs the
   name of the closest  match,  or  optionally  various  numerical  scores
   explained below.

   Whereas  this manual page is intended as a reference, there are several
   tutorials and documents you can read to  get  specialized  information.
   Specific  documentation  about  the design of dbacl and the statistical
   models that it uses can be found in dbacl.ps.  For a basic overview  of
   text   classification  using  dbacl,  see  tutorial.html.  A  companion
   tutorial geared towards email filtering  is  email.html.  If  you  have
   trouble  getting  dbacl  to classify reliably, read is_it_working.html.
   The USAGE section of this manual page also has some examples.

   /usr/share/doc/dbacl/dbacl.ps

   /usr/share/doc/dbacl/tutorial.html

   /usr/share/doc/dbacl/email.html

   /usr/share/doc/dbacl/is_it_working.html

   dbacl uses  a  maximum  entropy  (minimum  divergence)  language  model
   constructed  with  respect  to  a  digramic  reference measure (unknown
   tokens are predicted from digrams, i.e. pairs of letters). Practically,
   this  means  that a category is constructed from tokens in the training
   set, while previously unseen tokens can be predicted automatically from
   their  letters.  A  token  here  is  either  a  word  (fragment)  or  a
   combination  of  words  (fragments),  selected  according  to   various
   switches.  Learning roughly works by tweaking token probabilities until
   the training data is least surprising.

EXIT STATUS

   The normal shell exit conventions aren't followed (sorry!). When  using
   the -l command form, dbacl returns zero on success, nonzero if an error
   occurs. When using the  -c  form,  dbacl  returns  a  positive  integer
   corresponding  to  the category with the highest posterior probability.
   In case of a tie, the first most probable category  is  chosen.  If  an
   error occurs, dbacl returns zero.

DESCRIPTION

   When  using the -l command form, dbacl learns a category when given one
   or more FILE names, which should contain readable  ASCII  text.  If  no
   FILE  is  given, dbacl learns from STDIN. If FILE is a directory, it is
   opened and all its files are read,  but  not  its  subdirectories.  The
   result  is  saved  in  the  binary  file named category, and completely
   replaces any previous contents. As a convenience,  if  the  environment
   variable DBACL_PATH contains a directory, then that is prepended to the
   file path, unless category starts with a '/' or a '.'.

   The input text for learning is assumed to be unstructured plain text by
   default.  This  is  not  suitable  for  learning  email,  because email
   contains various transport encodings and formatting instructions  which
   can  reduce classification effectiveness. You must use the -T switch in
   that case so that dbacl knows it should perform decoding and  filtering
   of  MIME  and  HTML  as  appropriate.  Apropriate switch values are "-T
   email" for RFC2822 email input, "-T html" for HTML input, "-T xml"  for
   generic XML style input and "-T text" is the default plain text format.
   There are other values of the -T switch that also allow fine tuning  of
   the decoding capabilities.

   When  using  the  -c  command form, dbacl attempts to classify the text
   found in FILE, or STDIN if no FILE is  given.  Each  possible  category
   must  be  given separately, and should be the file name of a previously
   learned text corpus. As  a  convenience,  if  the  variable  DBACL_PATH
   contains  a  directory, it is prepended to each file path which doesn't
   start with a '/' or a '.'. The visible  output  of  the  classification
   depends  on  the  combination  of  extra switches used. If no switch is
   used, then no output is shown on STDOUT. However, dbacl always produces
   an exit code which can be tested.

   To see an output for a classification, you must use at least one of the
   -v,-U,-n,-N,-D,-d switches. Sometimes, they can be used in  combination
   to  produce a natural variation of their individual outputs. Sometimes,
   dbacl also produces warnings on STDERR if applicable.

   The -v switch outputs the name of  the  best  category  among  all  the
   choices given.

   The  -U  switch  outputs  the  name  of the best category followed by a
   confidence percentage. Normally, this is the switch that  you  want  to
   use. A percentage of 100% means that dbacl is sure of its choice, while
   a percentage of 0% means that some other category  is  equally  likely.
   This  is  not  the  model probability, but measures how unambiguous the
   classification is, and can be used to tag unsure classifications  (e.g.
   if the confidence is 25% or less).

   The  -N  switch  prints  each category name followed by its (posterior)
   probability, expressed as a percentage. The percentages always  sum  to
   100%.  This  is  intuitive,  but  only  valuable  if the document being
   classified contains a handful of tokens (ten or less).  In  the  common
   case  with  many  more  tokens,  the probabilities are always extremely
   close to 100% and 0%.

   The -n switch prints  each  category  name  followed  by  the  negative
   logarithm  of  its  probability.  This  is  equivalent  to using the -N
   switch, but much more  useful.  The  smallest  number  gives  the  best
   category.  A more convenient form is to use both -n and -v which prints
   each category name followed by the cross  entropy  and  the  number  of
   tokens  analyzed.  The  cross  entropy  measures  (in bits) the average
   compression rate which is achievable, under the given  category  model,
   per token of input text. If you use all three of -n,-v,-X then an extra
   value is output for each category, representing a kind of  p-value  for
   each  category  score. This indicates how typical the score is compared
   to the training documents, but only works if the  -X  switch  was  used
   during learning, and only for some types of models (e.g. email).  These
   p-values are uniformly distributed and independent (if  the  categories
   are independent), so can be combined using Fisher's chi squared test to
   obtain composite p-values for groupings of categories.

   The -v and -X switches together print each category name followed by  a
   detailed   decomposition   of  the  category  score,  factored  into  (
   divergence rate + shannon entropy rate )* token count @ p-value. Again,
   this only works in some types of models.

   The  -v  and  -U  switches  print  each  category  name  followed  by a
   decomposition of the category score into ( divergence  rate  +  shannon
   entropy rate # score variance )* token count.

   The -D switch prints out the input text as modified internally by dbacl
   prior to tokenization. For example, if a MIME encoded email document is
   classified,  then  this  prints  the decoded text that will be actually
   tokenized and classified. This switch is mainly useful for debugging.

   The -d switch dumps tokens and scores while they are being read. It  is
   useful   for   debugging,   or   if   you   want  to  create  graphical
   representations of the classification. A detailed  explanation  of  the
   output  is beyond the scope of this manual page, but is straightforward
   if you've read dbacl.ps.  Possible variations include -d together  with
   -n or -N.

   Classification can be done with one or several categories in principle.
   When  two  or  more  categories  are  used,  the   Bayesian   posterior
   probability  is  used,  given  the  input  text,  with  a uniform prior
   distribution on  categories.  For  other  choices  of  prior,  see  the
   companion   utility  bayesol(1).   When  a  single  category  is  used,
   classification can be done by comparing the score with a  treshold.  In
   practice  however,  much  better  results  are  obtained  with  several
   categories.

   Learning  and  classifying  cannot  be  mixed  on  the   same   command
   invocation,  however  there  are  no  locking issues and separate dbacl
   processes can operate simultaneously with obvious results, because file
   operations are designed to be atomic.

   Finally,  note that dbacl does not manage your document corpora or your
   computed categories, and in particular it does not allow you to  extend
   an  existing  category file with new documents.  This is unlike various
   current spam filters, which can learn new  emails  incrementally.  This
   limitation of dbacl is partially due to the nonlinear procedure used in
   the  learning  algorithm,  and  partially  a   desire   for   increased
   flexibility.

   You  can  simulate  the  effect  of incremental learning by saving your
   training documents into archives and  adding  to  these  archives  over
   time, relearning from scratch periodically. Learning is actually faster
   if these archives are compressed  and  decompressed  on  the  fly  when
   needed.  By  keeping  control  of your archives, you can never lose the
   information in your categories, and  you  can  easily  experiment  with
   different  switches  or  tokenizations or sets of training documents if
   you like.

SECONDARY SWITCHES

   By default, dbacl classifies the input text as a whole.  However,  when
   using  the  -f  option,  dbacl  can  be  used to filter each input line
   separately, printing only those lines which match one  or  more  models
   identified  by  keep  (use  the  category  name or number to refer to a
   category). This is useful if you want to filter  out  some  lines,  but
   note that if the lines are short, then the error rate can be high.

   The   -e,-w,-g,-j  switches  are  used  for  selecting  an  appropriate
   tokenization scheme. A token is a word or word fragment or  combination
   of  words  or  fragments.  The  shape of tokens is important because it
   forms the basis of the language models used by dbacl.   The  -e  switch
   selects  a predefined tokenization scheme, which is speedy but limited.
   The -w switch specifies composite tokens derived from  the  -e  switch.
   For  example,  "-e alnum -w 2" means that tokens should be alphanumeric
   word fragments combined into overlapping pairs (bigrams). When  the  -j
   switch  is  used,  all tokens are converted to lowercase, which reduces
   the number of possible tokens and therefore memory consumption.

   If the -g switch is used, you can completely specify  what  the  tokens
   should look like using a regular expression. Several -g switches can be
   used to construct complex tokenization schemes, and parentheses  within
   each  expression  can be used to select fragments and combine them into
   n-grams. The cost of such flexibility  is  reduced  classification  and
   learning speed. When experimenting with tokenization schemes, try using
   the -d or -D switches while learning or classifying, as they will print
   the  tokens explicitly so you can see what text fragments are picked up
   or missed out. For regular exression syntax, see regex(7).

   The -h and -H switches regulate how  much  memory  dbacl  may  use  for
   learning.  Text  classification can use a lot of memory, and by default
   dbacl limits itself even at the expense of learning accuracy.  In  many
   cases  if  a  limit  is  reached,  a warning message will be printed on
   STDERR with some advice.

   When relearning the same category several times, a significant  speedup
   can  be  obtained by using the -1 switch, as this allows the previously
   learned probabilities to be read from the category and reused.

   Note that classification accuracy depends foremost on  the  amount  and
   quality of the training samples, and then only on amount of tweaking.

EXIT STATUS

   When  using  the  -l  command form, dbacl returns zero on success. When
   using  the  -c  form,  dbacl  returns  a  positive  integer  (1,2,3...)
   corresponding  to  the category with the highest posterior probability.
   In case of a tie, the first most probable category  is  chosen.  If  an
   error occurs, dbacl returns zero.

OPTIONS

   -0     When  learning,  prevents  weight  preloading.  Normally,  dbacl
          checks if the category file already exists, and if so, tries  to
          use   the  existing  weights  as  a  starting  point.  This  can
          dramatically speed up learning.  If the -0 (zero) switch is set,
          then  dbacl  behaves as if no category file already exists. This
          is mainly useful for testing.  This switch  is  now  enabled  by
          default,  to  protect  against  weight  drift  which  can reduce
          accuracy  over  many  learning  iterations.  Use  -1  to   force
          preloading.

   -1     Force weight preloading if the category file already exists. See
          discussion of the -0 switch.

   -a     Append scores. Every input line is written  to  STDOUT  and  the
          dbacl  scores  are  appended.  This is useful for postprocessing
          with bayesol(1).  For ease of processing, every  original  input
          line is indented by a single space (to distinguish them from the
          appended scores), and the line with the scores (if -n  is  used)
          is prefixed with the string "scores ". If a second copy of dbacl
          needs to read this output later, it should be invoked  with  the
          -A switch.

   -d     Dump  the model parameters to STDOUT. In conjunction with the -l
          option, this produces a human-readable summary  of  the  maximum
          entropy  model.  In conjunction with the -c option, displays the
          contribution of each token to the final  score.  Suppresses  all
          other normal output.

   -e     Select   character   class   for   default   (not   regex-based)
          tokenization. By default, tokens are  alphabetic  strings  only.
          This  corresponds  to  the case when deftok is "alpha". Possible
          values for deftok are "alpha", "alnum", "graph",  "char",  "cef"
          and  "adp".   The  last  two  are custom tokenizers intended for
          email messages.  See  also  isalpha(3).   The  "char"  tokenizer
          picks  up single printable characters rather than bigger tokens,
          and is intended for testing only.

   -f     Filter each line of input separately,  passing  to  STDOUT  only
          lines  which match the category identified as keep.  This option
          should be used repeatedly for each category which must be  kept.
          keep can be either the category file name, or a positive integer
          representing the required category in the same order it  appears
          on the command line.

          Output  lines  are  flushed  as soon as they are written. If the
          input file is a pipe or character device,  then  an  attempt  is
          made  to  use  line buffering mode, otherwise the more efficient
          block buffering is used.

   -g     Learn only features described by the extended regular expression
          regex.  This overrides the default feature selection method (see
          -w option) and learns, for  each  line  of  input,  only  tokens
          constructed  from  the  concatenation of strings which match the
          tagged subexpressions within the supplied regex.  All substrings
          which match regex within a suffix of each input line are treated
          as features, even if they overlap on the input line.

          As an optional convenience, regex can include the  suffix  ||xyz
          which  indicates  which  parenthesized  subexpressions should be
          tagged. In this case, xyz should consist exclusively of digits 1
          to  9,  numbering  exactly  those subexpressions which should be
          tagged. Alternatively, if no  parentheses  exist  within  regex,
          then it is assumed that the whole expression must be captured.

   -h     Set  the  size  of the hash table to 2^size elements. When using
          the -l option, this refers  to  the  total  number  of  features
          allowed  in  the maximum entropy model being learned. When using
          the -c option toghether with the -M switch and multinomial  type
          categories,  this refers to the maximum number of features taken
          into account during classification.  Without the -M switch, this
          option has no effect.

   -i     Fully  internationalized mode. Forces the use of wide characters
          internally, which is necessary in some locales.  This  incurs  a
          noticeable performance penalty.

   -j     Make   features  case  sensitive.  Normally,  all  features  are
          converted to lower case during processing, which reduces storage
          requirements   and  improves  statistical  estimates  for  small
          datasets. With this option, the original capitalization is  used
          for each feature. This can improve classification accuracy.

   -m     Aggressively maps categories into memory and locks them into RAM
          to prevent swapping, if possible. This is useful when  speed  is
          paramount  and memory is plentiful, for example when testing the
          classifier on large datasets.

          Locking may require relaxing user limits  with  ulimit(1).   Ask
          your  system  administrator.  Beware  when  using  the -m switch
          together with the -o switch, as  only  one  dbacl  process  must
          learn  or  classify  at a time to prevent file corruption. If no
          learning takes place, then the  -m  switch  for  classifying  is
          always safe to use. See also the discussion for the -o switch.

   -n     Print  scores  for  each category.  Each score is the product of
          two numbers, the cross entropy and the complexity of  the  input
          text  under  each model. Multiplied together, they represent the
          log probability that the input resembles the model. To see these
          numbers  separately, use also the -v option. In conjunction with
          the -f option,  stops  filtering  but  prints  each  input  line
          prepended with a list of scores for that line.

   -q     Select quality of learning, where quality can be 1,2,3,4. Higher
          values take  longer  to  learn,  and  should  be  slightly  more
          accurate.  The default quality is 1 if the category file doesn't
          exist or weights cannot be preloaded, and 2 otherwise.

   -o     When learning, reads/writes partial token counts so they can  be
          reused.  Normally,  category  files are learned from exactly the
          input data given, and don't contain extraneous information. When
          this option is in effect, some extra information is saved in the
          file online, after all input was read. This information  can  be
          reread the next time that learning occurs, to continue where the
          previous dataset left  off.  If  online  doesn't  exist,  it  is
          created.  If  online  exists,  it  is  read before learning, and
          updated afterwards. The file is approximately 3 times bigger (at
          least) than the learned category.

          In  dbacl,  file updates are atomic, but if using the -o switch,
          two or more processes should not learn simultaneously,  as  only
          one  process  will write a lasting category and memory dump. The
          -m switch can also speed  up  online  learning,  but  beware  of
          possible  corruption.   Only  one process should read or write a
          file. This option is  intended  primarily  for  controlled  test
          runs.

   -r     Learn  the  digramic reference model only. Skips the learning of
          extra features in the text corpus.

   -v     Verbose  mode.  When  learning,  print  out   details   of   the
          computation,  when  classifying,  print out the name of the most
          probable category.  In conjunction with the  -n  option,  prints
          the  scores  as an explicit product of the cross entropy and the
          complexity.

   -w     Select default features to be n-grams up to max_order.  This  is
          incompatible  with the -g option, which always takes precedence.
          If no -w or -g options are given, dbacl assumes -w 1. Note  that
          n-grams  for  n  greater  than  1 do not straddle line breaks by
          default.  The -S switch enables line straddling.

   -x     Set decimation probability to 1 - 2^(-decim).  To reduce  memory
          requirements  when  learning,  some inputs are randomly skipped,
          and only a few are added to the model.  Exact behaviour  depends
          on  the  applicable  -T option (default is -T "text").  When the
          type is not "email" (eg "text"), then individual input  features
          are added with probability 2^(-decim). When the type is "email",
          then full input messages are added with probability  2^(-decim).
          Within each such message, all features are used.

   -A     Expect  indented  input  and  scores.  With  this  switch, dbacl
          expects input lines to be indented by a single  space  character
          (which   is  then  skipped).   Lines  starting  with  any  other
          character are ignored. This is the counterpart to the -a  switch
          above.  When used together with the -a switch, dbacl outputs the
          skipped lines as they are, and reinserts the space at the  front
          of each processed input line.

   -D     Print  debug output. Do not use normally, but can be very useful
          for displaying the list features picked up while learning.

   -H     Allow hash table to grow up to a  maximum  of  2^gsize  elements
          during learning. Initial size is given by -h option.

   -L     Select the digramic reference measure for character transitions.
          The measure can be one of "uniform",  "dirichlet"  or  "maxent".
          Default is "uniform".

   -M     Force  multinomial calculations. When learning, forces the model
          features to be treated multinomially. When classifying, corrects
          entropy   scores  to  reflect  multinomial  probabilities  (only
          applicable to multinomial type models, if present).  Scores will
          always be lower, because the ordering of features is lost.

   -N     Print  posterior  probabilities for each category.  This assumes
          the   supplied   categories   form   an   exhaustive   list   of
          possibilities.    In  conjunction  with  the  -f  option,  stops
          filtering but prints each input line prepended with a summary of
          the posterior distribution for that line.

   -R     Include  an  extra category for purely random text. The category
          is called "random".  Only makes sense when using the -c option.

   -S     Enable line straddling. This is  useful  together  with  the  -w
          option  to  allow  n-grams for n > 1 to ignore line breaks, so a
          complex token can continue past the end of the line. This is not
          recommended for email.

   -T     Specify  nonstandard text format. By default, dbacl assumes that
          the input text is a purely ASCII text file. This corresponds  to
          the case when type is "text".

          There  are several types and subtypes which can be used to clean
          the input text of extraneous tokens before  actual  learning  or
          classifying  takes place. Each (sub)type you wish to use must be
          indicated with a separate -T option on  the  command  line,  and
          automatically implies the corresponding type.

          The  "text"  type  is for unstructured plain text. No cleanup is
          performed. This is the default if no  types  are  given  on  the
          command line.

          The "email" type is for mbox format input files or single RFC822
          emails.  Headers are recognized and most are skipped. To include
          extra  RFC822  standard  headers (except for trace headers), use
          the "email:headers" subtype.  To include trace headers, use  the
          "email:theaders"  subtype.  To include all headers in the email,
          use the "email:xheaders" subtype. To skip  all  headers,  except
          the  subject,  use "email:noheaders". To scan binary attachments
          for strings, use the "email:atts" subtype.

          When the "email" type is in effect, HTML markup is automatically
          removed  from text attachments except text/plain attachments. To
          also  remove  HTML  markup  from  plain  text  attachments,  use
          "email:noplain".  To  prevent  HTML  markup  removal in all text
          attachments, use "email:plain".

          The "html" type is for removing HTML markup (between <html>  and
          </html>  tags)  and  surrounding  text. Note that if the "email"
          type is  enabled,  then  "html"  is  automatically  enabled  for
          compatible message attachments only.

          The  "xml"  type  is  like "html", but doesn't honour <html> and
          </html>, and doesn't interpret tags  (so  this  should  be  more
          properly  called  "angle  markup" removal, and has nothing to do
          with actual XML semantics).

          When "html" is enabled, most markup  attributes  are  lost  (for
          values  of  'most'  close  to  'all').  The "html:links" subtype
          forces link urls to be parsed and learned, which would otherwise
          be ignored. The "html:alt" subtype forces parsing of alternative
          text  in  ALT   attributes   and   various   other   tags.   The
          "html:scripts"  subtype forces parsing of scripts, "html:styles"
          forces parsing of styles, "html:forms" forces  parsing  of  form
          values, while "html:comments" forces parsing of HTML comments.

   -U     Print  (U)nambiguity.   When  used  in  conjunction  with the -v
          switch, prints  scores  followed  by  their  empirical  standard
          deviations.  When used alone, prints the best category, followed
          by  an  estimated  probability  that  this  category  choice  is
          unambiguous.  More  precisely,  the probability measures lack of
          overlap of CLT confidence intervals for each category score  (If
          there is overlap, then there is ambiguity).

          This estimated probability can be used as an "unsure" flag, e.g.
          if the estimated probability is  lower  than  50%.  Formally,  a
          score of 0% means another category is equally likely to apply to
          the input, and a score of 100% means no other category is likely
          to  apply  to  the  input.  Note that this type of confidence is
          unrelated to the -X switch. Also, the  probability  estimate  is
          usually low if the document is short, or if the message contains
          many tokens that have never been seen before  (only  applies  to
          uniform digramic measure).

   -V     Print the program version number and exit.

   -W     Like -w, but prevents features from straddling newlines. See the
          description of -w.

   -X     Print the confidence in the score calculated for each  category,
          when  used together with the -n or -N switch. Prepares the model
          for confidence scores,  when  used  with  the  -l  switch.   The
          confidence  is  an  estimate  of  the  typicality  of the score,
          assuming the null hypothesis that the given category is correct.
          When  used with the -v switch alone, factorizes the score as the
          empirical divergence plus the  shannon  entropy,  multiplied  by
          complexity, in that order. The -X switch is not supported in all
          possible models, and displays a percentage of "0.0" if it  can't
          be  calculated.  Note  that  for  unknown documents, it is quite
          common to have confidences close to zero.

USAGE

   To create two category files in the current directory  from  two  ASCII
   text    files    named   Mark_Twain.txt   and   William_Shakespeare.txt
   respectively, type:

   % dbacl -l twain Mark_Twain.txt
   % dbacl -l shake William_Shakespeare.txt

   Now you can classify input text, for example:

   % echo "howdy" | dbacl -v -c twain -c shake
   twain
   % echo "to be or not to be" | dbacl -v -c twain -c shake
   shake

   Note that the -v option at least is necessary, otherwise dbacl does not
   print  anything.  The  return  value  is  1 in the first case, 2 in the
   second.

   % echo "to be or not to be" | dbacl -v -N -c twain -c shake
   twain 22.63% shake 77.37%
   % echo "to be or not to be" | dbacl -v -n -c twain -c shake
   twain  7.04 * 6.0 shake  6.74 * 6.0

   These invocations are equivalent. The numbers 6.74 and  7.04  represent
   how  close the average token is to each category, and 6.0 is the number
   of tokens observed. If you want to  print  a  simple  confidence  value
   together with the best category, replace -v with -U.

   % echo "to be or not to be" | dbacl -U -c twain -c shake
   shake # 34%

   Note  that the true probability of category shake versus category twain
   is 77.37%, but the calculation is somewhat ambiguous, and  34%  is  the
   confidence out of 100% that the calculation is qualitatively correct.

   Suppose  a  file  document.txt contains English text lines interspersed
   with noise lines. To filter out the noise lines from the English lines,
   assuming you have an existing category shake say, type:

   % dbacl -c shake -f shake -R document.txt > document.txt_eng
   % dbacl -c shake -f random -R document.txt > document.txt_rnd

   Note  that  the  quality of the results will vary depending on how well
   the categories shake and random  represent  each  input  line.   It  is
   sometimes  useful  to  see  the  posterior  probabilities for each line
   without filtering:

   % dbacl -c shake -f shake -RN document.txt > document.txt_probs

   You can now postprocess the posterior probabilities for  each  line  of
   text  with  another script, to replicate an arbitrary Bayesian decision
   rule of your choice.

   In the special case of exactly two  categories,  the  optimal  Bayesian
   decision  procedure can be implemented for documents as follows: let p1
   be  the  prior  probability  that  the  input  text  is  classified  as
   category1.   Consequently,  the  prior  probability  of  classifying as
   category2 is 1 - p1.  Let u12 be the cost of misclassifying a category1
   input text as belonging to category2 and vice versa for u21.  We assume
   there is no cost for classifying correctly.  Then the following command
   implements the optimal Bayesian decision:

   % dbacl -n -c category1 -c category2 | awk '{ if($2 * p1 * u12 > $4 *
          (1 - p1) * u21) { print $1; } else { print $3; } }'

   dbacl can also be used in conjunction with procmail(1) to  implement  a
   simple  Bayesian email classification system. Assume that incoming mail
   should be automatically delivered to one of three mail folders  located
   in  $MAILDIR and named work, personal, and spam.  Initially, these must
   be created and filled with appropriate  sample  emails.   A  crontab(1)
   file can be used to learn the three categories once a day, e.g.

   CATS=$HOME/.dbacl
   5  0 * * * dbacl -T email -l $CATS/work $MAILDIR/work
   10 0 * * * dbacl -T email -l $CATS/personal $MAILDIR/personal
   15 0 * * * dbacl -T email -l $CATS/spam $MAILDIR/spam

   To  automatically  deliver  each  incoming  email  into the appropriate
   folder, the following procmailrc(5) recipe fragment could be used:

   CATS=$HOME/.dbacl

   # run the spam classifier
   :0 c
   YAY=| dbacl -vT email -c $CATS/work -c $CATS/personal -c $CATS/spam

   # send to the appropriate mailbox
   :0:
   * ? test -n "$YAY"
   $MAILDIR/$YAY

   :0:
   $DEFAULT

   Sometimes, dbacl will send the email to  the  wrong  mailbox.  In  that
   case,  the  misclassified  message  should  be  removed  from its wrong
   destination and placed in the  correct  mailbox.   The  error  will  be
   corrected  the  next  time your messages are learned.  If it is left in
   the wrong category, dbacl will learn the wrong corpus statistics.

   The default text features (tokens) read by dbacl are purely  alphabetic
   strings,  which minimizes memory requirements but can be unrealistic in
   some cases. To construct models based on alphanumeric tokens,  use  the
   -e  switch.  The  example below also uses the optional -D switch, which
   prints a list of actual tokens found in the document:

   % dbacl -e alnum -D -l twain Mark_Twain.txt | less

   It is also possible to override the default  feature  selection  method
   used  to  learn the category model by means of regular expressions. For
   example, the following duplicates the default feature selection  method
   in the C locale, while being much slower:

   % dbacl -l twain -g '^([[:alpha:]]+)' -g '[^[:alpha:]]([[:alpha:]]+)'
          Mark_Twain.txt

   The category twain which is obtained depends only on single  alphabetic
   words  in  the text file Mark_Twain.txt (and computed digram statistics
   for prediction).  For a second example, the following command builds  a
   smoothed  Markovian  (word  bigram)  model  which  depends  on pairs of
   consecutive words within each line (but pairs cannot  straddle  a  line
   break):

   % dbacl -l twain2 -g '(^|[^[:alpha:]])([[:alpha:]]+)||2' -g
          '(^|[^[:alpha:]])([[:alpha:]]+)[^[:alpha:]]+([[:alpha:]]+)||23'
          Mark_Twain.txt

   More  general, line based, n-gram models of all orders (up to 7) can be
   built in a similar way.   To  construct  paragraph  based  models,  you
   should  reformat  the input corpora with awk(1) or sed(1) to obtain one
   paragraph per line. Line size is limited by available memory, but  note
   that regex performance will degrade quickly for long lines.

PERFORMANCE

   The  underlying assumption of statistical learning is that a relatively
   small number of training documents can represent a much larger  set  of
   input  documents.  Thus  in  the long run, learning can grind to a halt
   without serious impact on classification accuracy. While  not  true  in
   reality,  this assumption is surprisingly accurate for problems such as
   email filtering.  In practice, this means that a well chosen corpus  on
   the  order  of ten thousand documents is sufficient for highly accurate
   results for years.  Continual  learning  after  such  a  critical  mass
   results  in  diminishing  returns.   Of  course,  when real world input
   document patterns change dramatically,  the  predictive  power  of  the
   models  can  be lost. At the other end, a few hundred documents already
   give acceptable results in most cases.

   dbacl is heavily optimized for the case of frequent classifications but
   infrequent  batch  learning.  This  is  the  long run optimum described
   above. Under ideal conditions, dbacl can classify a hundred emails  per
   second  on low end hardware (500Mhz Pentium III). Learning speed is not
   very much slower, but takes effectively much longer for large  document
   collections  for  various  reasons.   When  using  the  -m switch, data
   structures are aggressively mapped into memory  if  possible,  reducing
   overheads for both I/O and memory allocations.

   dbacl  throws  away its input as soon as possible, and has no limits on
   the input document size. Both classification  and  learning  speed  are
   directly  proportional  to  the  number  of  tokens  in  the input, but
   learning also needs a nonlinear  optimization  step  which  takes  time
   proportional  to  the  number  of unique tokens discovered.  At time of
   writing, dbacl is one of the fastest open source mail filters given its
   optimal  usage  scenario,  but uses more memory for learning than other
   filters.

MULTIPLE PROCESSES AND DATA CORRUPTION

   When saving category files, dbacl first writes out a temporary file  in
   the  same  location,  and  renames it afterwards. If a problem or crash
   occurs during  learning,  the  old  category  file  is  therefore  left
   untouched.  This  ensures  that  categories  can never be corrupted, no
   matter how many processes try to simultaneously learn or classify,  and
   means  that  valid  categories  are available for classification at any
   time.

   When using the -m switch, file contents are memory  mapped  for  speedy
   reading  and  writing.  This,  together with the -o switch, is intended
   mainly for testing purposes, when tens of thousands of messages must be
   learned and scored in a laboratory to measure dbacl's accuracy. Because
   no file locking is attempted for performance reasons,  corruptions  are
   possible,  unless  you  make  sure that only one dbacl process reads or
   writes any file at any given time. This is the only  case  (-m  and  -o
   together) when corruption is possible.

MEMORY USE

   When  classifying a document, dbacl loads all indicated categories into
   RAM, so the total  memory  needed  is  approximately  the  sum  of  the
   category file sizes plus a fixed small overhead.  The input document is
   consumed while being read, so its size doesn't matter,  but  very  long
   lines  can take up space.  When using the -m switch, the categories are
   read using mmap(2) as available.

   When learning, dbacl keeps a large structure in memory  which  contains
   many objects which won't be saved into the output category. The size of
   this structure is proportional to the number of unique tokens read, but
   not  the  size  of  the input documents, since they are discarded while
   being read. As a rough guide, this structure is 4x-5x the size  of  the
   final category file that is produced.

   To  prevent unchecked memory growth, dbacl allocates by default a fixed
   smallish amount of memory for tokens.  When  this  space  is  used  up,
   further  tokens  are  discarded  which  has  the  effect of skewing the
   learned category making it less usable as more tokens  are  dropped.  A
   warning is printed on STDERR in such a case.

   The  -h  switch  lets  you  fix  the initial size of the token space in
   powers of 2, ie "-h 17" means 2^17 = 131072  possible  tokens.  If  you
   type  "dbacl -V", you can see the number of bytes needed for each token
   when either learning  or  classifying.  Multiply  this  number  by  the
   maximum  number  of  possible  tokens to estimate the memory needed for
   learning. The -H switch lets dbacl grow its tables automatically if and
   when  needed,  up  to a maximum specified. So if you type "-H 21", then
   the initial size  will  be  doubled  repeatedly  if  necessary,  up  to
   approximately two million unique tokens.

   When learning with the -X switch, a handful of input documents are also
   kept in RAM throughout.

ENVIRONMENT

   DBACL_PATH
          When this variable is set,  its  value  is  prepended  to  every
          category filename which doesn't start with a '/' or a '.'.

SIGNALS

   INT    If  this  signal is caught, dbacl simply exits without doing any
          cleanup or other operations. This signal can often  be  sent  by
          pressing Ctrl-C on the keyboard. See stty(1).

   HUP, QUIT, TERM
          If one of these signals is caught, dbacl stops reading input and
          continues its operation as if no more input was available.  This
          is a way of quitting gracefully, but note that in learning mode,
          a category file will be written based on the  incomplete  input.
          The  QUIT  signal  can  often  be  sent by pressing Ctrl- on the
          keyboard. See stty(1).

   USR1   If this signal is caught, dbacl reloads the  current  categories
          at  the  earliest  feasible  opportunity.  This  is not normally
          useful at all, but might be in special cases, such as if the  -f
          switch is invoked together with input from a long running pipe.

NOTES

   dbacl generated category files are in binary format, and may or may not
   be portable to systems using a different byte order architecture  (this
   depends  on  how  dbacl was compiled). The -V switch prints out whether
   categories are portable, or else you can just experiment.

   dbacl does not recognize functionally equivalent  regular  expressions,
   and in this case duplicate features will be counted several times.

   With  every  learned  category, the command line options that were used
   are saved.  When classifying, make sure that  every  relevant  category
   was  learned  with  the  same  set  of  options (regexes are allowed to
   differ), otherwise behaviour is undefined. There is no need  to  repeat
   all the switches when classifying.

   If you get many digitization warnings, then you are trying to learn too
   much data at once, or your model is too complex.  dbacl is compiled  to
   save   memory   by  digitizing  final  weights,  but  you  can  disable
   digitization by editing dbacl.h and recompiling.

   dbacl offers several built-in tokenizers (see -e switch) with  more  to
   come in future versions, as the author invents them.  While the default
   tokenizer may evolve, no tokenizer should ever be removed, so that  you
   can  always  simulate previous dbacl behaviour subject to bug fixes and
   architectural changes.

   The  confidence  estimates  obtained  through   the   -X   switch   are
   underestimates, ie are more conservative than they should be.

BUGS

   "Ya  know,  some  day  scientists  are gonna invent something that will
   outsmart a rabbit." (Robot Rabbit, 1953)

SOURCE

   The source code for the latest version of this program is available  at
   the following locations:

   http://www.lbreyer.com/gpl.html
   http://dbacl.sourceforge.net

AUTHOR

   Laird A. Breyer <[email protected]>

Opportunity

Personal Opportunity - Free software gives you access to billions of dollars of software at no cost. Use this software for your business, personal use or to develop a profitable skill. Access to source code provides access to a level of capabilities/information that companies protect though copyrights. Open source is a core component of the Internet and it is available to you. Leverage the billions of dollars in resources and capabilities to build a career, establish a business or change the world. The potential is endless for those who understand the opportunity.

Business Opportunity - Goldman Sachs, IBM and countless large corporations are leveraging open source to reduce costs, develop products and increase their bottom lines. Learn what these companies know about open source and how open source can give you the advantage.

Free Software

Free Software provides computer programs and capabilities at no cost but more importantly, it provides the freedom to run, edit, contribute to, and share the software. The importance of free software is a matter of access, not price. Software at no cost is a benefit but ownership rights to the software and source code is far more significant.

Free Office Software - The Libre Office suite provides top desktop productivity tools for free. This includes, a word processor, spreadsheet, presentation engine, drawing and flowcharting, database and math applications. Libre Office is available for Linux or Windows.

Free Books

The Free Books Library is a collection of thousands of the most popular public domain books in an online readable format. The collection includes great classical literature and more recent works where the U.S. copyright has expired. These books are yours to read and use without restrictions.

Source Code - Want to change a program or know how it works? Open Source provides the source code for its programs so that anyone can use, modify or learn how to write those programs themselves. Visit the GNU source code repositories to download the source.

Education

Study at Harvard, Stanford or MIT - Open edX provides free online courses from Harvard, MIT, Columbia, UC Berkeley and other top Universities. Hundreds of courses for almost all major subjects and course levels. Open edx also offers some paid courses and selected certifications.

Linux Manual Pages - A man or manual page is a form of software documentation found on Linux/Unix operating systems. Topics covered include computer programs (including library and system calls), formal standards and conventions, and even abstract concepts.

dbacl(1)

NAME

SYNOPSIS

OVERVIEW

EXIT STATUS

DESCRIPTION

SECONDARY SWITCHES

EXIT STATUS

OPTIONS

USAGE

PERFORMANCE

MULTIPLE PROCESSES AND DATA CORRUPTION

MEMORY USE

ENVIRONMENT

SIGNALS

NOTES

BUGS

SOURCE

AUTHOR

SEE ALSO

Opportunity

Free Software

Free Books

Education