Skip navigation.

WordIndex: create a linenumber cross-referenced list of words

WordIndex program creates an alphabetically sorted list of words present in the input files and it reports the linenumbers where those words occur. In addition it can also report the number of word occurrences and the relative occurrences as a percentage (word frequency). Furthermore WordIndex can accept a file with keywords --or stopwords-- that it must skip.

I created the WordIndex project to discover how I would write such a program in C++, twenty years after its predecessor xref which was programmed in C. A notable aspect is the treatment of the input files as STL-like containers, or collections as they are called in [Wilson, 2007].

The wordIndex program works, but the project is in its alpha stage. WordIndex compiles with Microsoft Visual C++ 8.0 and with GNU C 3.4.2 on MS Windows, but it's probably not to difficult to support other platforms as well. The single-session make setup needs further work.

WordIndex is free software, and is distributed under the GPL. WordIndex uses the open source unit test framework Fructose and the Templatized C++ Command Line Parser Library, TCLAP

wordindex.exe · Readme · Doxumented source · WordIndex.chm · WordIndex-0.0.2-alpha.tar.gz · Subversion trunk

WordIndex's help screen.

Usage: bin\wordindex.exe [option...] [file...]

  -h, --help          display this help and exit
  -a, --author        report authors name and e-mail [no]
      --version       report program and compiler versions [no]
  -v, --verbose       report ... [none]

  -f, --frequency     also report word frequency as d.dd% (n) [no]
  -l, --lowercase     transform words to lowercase [no]
  -r, --reverse       only collect keyword occurrences, see --keywords [no]
  -s, --summary       also report number of (key)words and references [no]

  -i, --input=file    read filenames from given file [standard input or given filenames]
  -o, --output=file   write output to given file [standard output]
  -k, --keywords=file read keywords to skip (stopwords) from given file [none]

Long options also may start with a plus, like: +help.

wordindex creates an alphabetically sorted index of words present in the
input files and it reports the lines where those words occur.
Words that are marked as keywords are excluded (see option --keywords).
Use option --reverse to only show the occurrences of keywords.

Words can be read from standard input, or from files specified on the command
line and from files that are specified in another file (see option --input).

A file that specifies input filenames may look as follows:
   # comment that extends to the end of the line ( ; also starts comment line)
   file1.txt file2.txt

   echo hello world | wordindex +summary +frequency
       keywords  0
          words  2
     references  2

          hello  50% (1)  1
          world  50% (1)  1

   wordindex +lowercase file.txt | sort -n -k2 -r
This creates a list of lowercase words, sorted on frequency of occurrence.

Matthew Wilson. Extended STL, Volume 1: Collections and Iterators. Addison–Wesley Professional, 2007. ISBN-10 0–321–30550–7, ISBN-13 978–0-321–30550–3.