Martin Moene | WordIndex: a C++ program to create a linenumber cross-referenced list of words

Sitemap

All pages on my site.

Home: Go to the opening page.
Contact: moene@physics.LeidenUniv.nl

Blog: Martin Moene's Blog

Twitter: Martin Moene on Twitter

LinkedIn: Martin Moene on LinkedIn

Github: Martin Moene on Github
Books: reading: Bjarne Stroustrup. The C++ Programming Language 4ed

Tips

2D matrix in C++: three ways to allocate memory for a 2-dimensional matrix in C++.
Variant type in C++: class any, a safe, generic container for single values of different value types, adapted for VC6.
Scoped resource in C++: exception-safe acquisition and release of resources in C++.
STL-style circular buffer: Based on the work by Pete Goodliffe.
Windows batch skeleton: a Windows cmd batch skeleton for file processing with logging, persistence and other techniques.
Getting things done in TB2: getting things done using saved searches in Thunderbird 2.

Papers: Several publications

Projects

Accu-Bib: Bibliographies of CVu and Overload Journals.
Accu-Web: Bring CVu Journal articles to the accu.org intraWeb.
Camera: Computer Aided Measurement Environment for Realtime Atomic imaging.
Sclm: Scanning Confocal LabVIEW Microscope.
Hefa: Helium Fluorescentie & Absorptie.
Rulbus: C, C++ and LabVIEW libraries to use Rulbus modules.
STLSoft: A Getting Started Guide (Draft).
TestDox: A python script to create simple documentation from test case names in C++ (CppUnit, Boost.Test).
Verlof: A Web application to register and present leave information.
WordIndex: A C++ program create a linenumber cross-referenced list of words.
Old project list: Project list from my first Web site.

Hobby

TrompetMuziek.nl: elf composities voor trompet en piano: website.
PKN Bloemendaal-Overveen: Design and implementation of a Joomla CMS Website.
MFB Loudspeaker Box: Three-way motional feedback loudspeaker amplifier with linear phase filter.
Pick-up pre-amplifier: Moving magnet pick-up pre-amplifier with inverse response of element resonance
AM-Tuner: Medium wave AM-receiver with station indication (MC6802).
FM-Tuner: FM-receiver with microprocessor control (MC6802).

Photos: Several photo's I made.

Museum

Introduction

Bondwell 12

Schematics: Bondwell 12/14 Schematics.
Peripherals: Bondwell 12/14 Programmable Peripheral I.C.'s.
Virtual Disk: A Virtual-Disk Driver and a 256 kByte Memory Extension.
Keyboard Buffer: A Type Ahead Buffer.
Mode: A Program to Configure the Serial Ports.
ASCII Dump: A Program to Create a Hex and ASCII Dump (written by Toon Moene).

Programs

Pal2Abl: Palasm to Abel converter.
Cdir: VAX/VMS Dir for CP/M.
Xref: Create an alphaetical list of words followed by the linenumbers where they occur.

About

About this site
Copyrights: Copyrights on this site
How it works: The way this site works explained

Links: Internal and external links of interest.
Sitemap: Overview of all pages on this site.

WordIndex: create a linenumber cross-referenced list of words

WordIndex program creates an alphabetically sorted list of words present in the input files and it reports the linenumbers where those words occur. In addition it can also report the number of word occurrences and the relative occurrences as a percentage (word frequency). Furthermore WordIndex can accept a file with keywords --or stopwords-- that it must skip.

I created the WordIndex project to discover how I would write such a program in C++, twenty years after its predecessor xref which was programmed in C. A notable aspect is the treatment of the input files as STL-like containers, or collections as they are called in [Wilson, 2007].

The wordIndex program works, but the project is in its alpha stage. WordIndex compiles with Microsoft Visual C++ 8.0 and with GNU C 3.4.2 on MS Windows, but it's probably not to difficult to support other platforms as well. The single-session make setup needs further work.

WordIndex's help screen.

Usage: bin\wordindex.exe [option...] [file...] -h, --help display this help and exit -a, --author report authors name and e-mail [no] --version report program and compiler versions [no] -v, --verbose report ... [none] -f, --frequency also report word frequency as d.dd% (n) [no] -l, --lowercase transform words to lowercase [no] -r, --reverse only collect keyword occurrences, see --keywords [no] -s, --summary also report number of (key)words and references [no] -i, --input=file read filenames from given file [standard input or given filenames] -o, --output=file write output to given file [standard output] -k, --keywords=file read keywords to skip (stopwords) from given file [none] Long options also may start with a plus, like: +help. wordindex creates an alphabetically sorted index of words present in the input files and it reports the lines where those words occur. Words that are marked as keywords are excluded (see option --keywords). Use option --reverse to only show the occurrences of keywords. Words can be read from standard input, or from files specified on the command line and from files that are specified in another file (see option --input). A file that specifies input filenames may look as follows: # comment that extends to the end of the line ( ; also starts comment line) file1.txt file2.txt file3.txt Example: echo hello world | wordindex +summary +frequency keywords 0 words 2 references 2 hello 50% (1) 1 world 50% (1) 1 Example: wordindex +lowercase file.txt | sort -n -k2 -r This creates a list of lowercase words, sorted on frequency of occurrence.

Matthew Wilson. Extended STL, Volume 1: Collections and Iterators. Addison�Wesley Professional, 2007. ISBN-10 0�321�30550�7, ISBN-13 978�0-321�30550�3.