Go to the first, previous, next, last section, table of contents.


Common command-line options

Certain options, and regular expression syntax, are shared by various groupings of the ID utilities. We describe these in the sections below, rather than repeating them for each program.

Options Common to All Programs

`--help'
Print a usage message listing all available options, then exit successfully.
`--version'
Print the version number, then exit successfully.

Options for Programs that Read ID Databases

`-f filename'
`--file=filename'
Filename is the ID database to read when processing queries. At present, only a single `--file' option is processed, but in future releases, more than one ID database may be named on the command line.
`$IDPATH'
`IDPATH' is an environment variable that contains a colon-separated list of ID database names. If this variable is present, and no `--file' options are presented on the command line, the ID databases named in `IDPATH' are implied.(1)

If no ID databases are specified either on the command line or via the `IDPATH' environment variable, then the ID utilities search for a file named `ID' in the current working directory, and then in successive parent directories.

Options for Programs that Write ID Databases

`-o filename'
`--output=filename'
The `--output' option names the file in which to write a new ID database. If no `--output' (or `--file') option is present, an output file named `ID' is implied.
`-f filename'
`--file=filename'
This is a synonym for `--output'

Options for Programs that Walk File and Directory Trees.

The programs `mkid' and `xtokid' accept the names of files and directories on the command line. Files are scanned if there is a scanner available and enabled for the file's source language. Directories are recursively descended, searching for files whose names match the rules listed in the language map file (see section Mapping file names to source languages).

The following option controls the file tree walker:

`-p names'
`--prune=names'
One or more file or directory names may appear in names. The file tree walker will stop short at these files and directories and their contents will not be scanned.

Options for Programs that List File Names

The programs `lid' and `fnid' can print lists of file names as the result of queries. The following option controls how these lists are formatted:

`-S style'
`--separator=style'
Style may be one of `braces', `space' or `newline'. The style of `braces' means that file names with common directory prefix and common suffix are printed using the shell's brace notation in order to compress the output. For example, `../src/foo.c ../src/bar.c' can be printed in brace notation as `../src/{foo,bar}.c'. The styles of `space' and `newline' mean that file names are separated spaces or by newlines, respectively. If the list of files is being printed on a terminal, brace notation is the default. If not, file names are separated by spaces if the key is included in the output, and by newlines the key style is `none' (see section lid: Querying an ID Database by Token).

Options for Programs that Scan Source Files

`mkid' and `xtokid' walk file trees, select source files by name, and extract tokens from source files. They accept the following options:

`-m mapfile'
`--lang-map=mapfile'
mapfile contains rules for determining the source languages from file names. See section Mapping file names to source languages
`-i languages'
`--include=languages'
The `--include' option names languages whose source files should be scanned and incorporated into the ID database. By default, all languages known to the ID utilities are enabled.
`-x languages'
`--exclude=languages'
The `--exclude' option names languages whose source files should not be scanned. The default list of excluded languages is empty. Note that only one of `--include' or `--exclude' may be specified on the command line for a single run.
`-l language:options'
`--lang-option=language:options'
Language-specific scanners also accept options. Language denotes the desired scanner, and option are the command-line options that should be passed through to it. For example, to pass the -x --coke-bottle options to the scanner for the language swizzle, pass this: -l swizzle:"-x --coke-bottle", or this: -lang-option=swizzle:"-x --coke-bottle", or this: -l swizzle-x -l swizzle:--coke-bottle. Use the `--help' option to see the command-line option summary for

To determine which tokens to extract from a file and store in the database, `mkid' calls a scanner; we say a scanner recognizes a particular language. Scanners for several languages are built-in to `mkid'; you can add your own scanners as well, as explained in section Defining New Scanners in the Source Code.

The ID utilities determine which scanner to use for a particular file by consulting the language-map file. Scanners for several are already built-in to the ID utilities. You can see which languages have built-in scanners, and examine their language-specific options by invoking `mkid --help' or `xtokid --help'.

Mapping file names to source languages

The file `id-lang.map', installed by default in `$(prefix)/share/id-lang.map', contains rules for mapping file names to source languages. Each rule comprises three parts: a shell glob pattern, a language name, and language-specific scanner options.

The special pattern `**' denotes the default source language. This is the language that's assigned to file names that don't match any other pattern.

The special pattern `***' should be followed by a file name. The named file should contain more language-map rules and is included at this point.

The order in which rules are presented in a language-map file is significant. This order influences the order in which files are displayed as the result of queries. For example, the distributed language-map file places all rules for C .h files ahead of .c files, so that in general, declarations will precede definitions in query output. The same thing is done for C++ and its many different source file name extensions.

Here is a pared-down version of the `id-lang.map' file distributed with the ID utilities:


# Default language
**			IGNORE	# Although this is listed first,
				# the default language pattern is
				# logically matched last.

# Backup files
*~			IGNORE
*.bak			IGNORE
*.bk[0-9]		IGNORE

# SCCS files
[sp].*			IGNORE

# list header files before code files
*.h			C
*.h.in			C
*.H			C++
*.hh			C++
*.hpp			C++
*.hxx			C++

# list C `meta' files next
*.l			C
*.lex			C
*.y			C
*.yacc			C

# list C code files after header files
*.c			C
*.C			C++
*.cc			C++
*.cpp			C++
*.cxx			C++

# list assembly language after C
*.[sS]			asm --comment=;
*.asm			asm --comment=;

# [nt]roff
*.[0-9]			roff
*.ms			roff
*.me			roff
*.mm			roff

# TeX and friends
*.tex			TeX
*.ltx			TeX
*.texi			texinfo
*.texinfo		texinfo

C/C++ Language Scanner

The C scanner is the most commonly used. Files that match the glob pattern `*.h', `*.c', as well as `yacc' files that match `*.y' or `*.yacc', and `lex' files that match `*.l' or `*.lex', are processed with this scanner.

Scanner-specific options (Note, these options are presented without the required `-l' or `--lang-option=' prefix):

`-k character-class'
`--keep=character-class'
Consider the characters in character-class as valid constituents of identifier names. For example, if you are indexing C code that contains `$' in some of its identifiers, you can include these by using `--lang-option=C:--keep=$', or `-l C:"-k $"' (if you don't like to type so much).
`-i character-class'
`--ignore=character-class'
Consider the characters in character-class as valid constituents of identifier names, but discard all tokens containing these characters. For example, if some C code has identifiers containing `$', but you don't want these cluttering up your ID database, use `--lang-option=C:--ignore=$', or the terser equivalent `-l C:"-i $"'.
`-u'
`--strip-underscore'
Strip one leading underscore from C identifiers encapsulated as character strings. This option is useful if you are indexing C code that contains symbol-table name strings for systems that prepend an underscore to external symbols. By default, the leading underscore is retained.

Assembly Language Scanner

Assembly languages use a variety of commenting conventions, and allow a variety of special characters to dirty up local symbols, preventing name space conflicts with symbols defined by higher-level languages. Also, some compilation systems prepend an underscore to external symbols. The options listed below are designed to address these differences.

`-c character-class'
`--comment=character-class'
The characters in character-class are considered left delimiters for comments that extend until the end of the current line.
`-k character-class'
`--keep=character-class'
Consider the characters of character-class as valid constituents of identifier names. For example, if you are indexing assembly code that prepends `.' to assembler directives, and prepends `%' to register names, you can keep these characters in the tokens by specifying `--lang-option=asm:--keep=.%', or `-l asm:"-k .%"'.
`-i character-class'
`--ignore=character-class'
Consider the characters of character-class as valid constituents of identifier names, but discard all tokens containing these characters. For example, if you don't want to clutter your ID database with assembler directives that begin with a leading `.' or with assembler labels that contain `@', use `--lang-option=asm:--ignore=.@', or `-l asm:"-i .@"'.
`-u'
`--strip-underscore'
Strip one leading underscore from identifiers. This option is useful if your compilation system prepends an underscore to external symbols. By stripping the underscore, you can canonicalize such names and bring them into conformance the way they are expressed in the C language. By default, the leading underscore is retained.
`-n'
`--no-cpp'
Do not recognize C preprocessor directives. By default, such lines are handled in the same way as they are by the C language scanner.

Text Scanner

The plain text scanner is intended for human-language documents, or as the scanner of last resort for files that have no scanner that is more specific. It is customizable to the extent that character classes can be designated as token constituents or as token delimiters. The default token constituents are the alpha-numerics; all other characters are considered token delimiters.

`-i character-class'
`--include=character-class'
Include characters belonging to character-class in tokens.
`-x character-class'
`--exclude=character-class'
Exclude characters belonging to character-class from tokens, i.e., treat them as token delimiters.

Defining New Scanners in the Source Code

To add a new scanner in source code, you should add a new section to the file `scanners.c'. It might be easiest to clone one of the existing scanners and modify it as necessary. For the hypothetical language foo, you must define the functions get_token_foo, parse_args_foo, help_me_foo, as well as the tables long_options_foo and args_foo. If your scanner is modeled after one of the existing scanners, you'll also need a character-attribute table ctype_foo.

This is not a terribly difficult programming task, but it requires recompiling and installing the new version of `mkid' and `xtokid'. You should use `xtokid' to test the operation of the new scanner.

Once these functions and tables are ready, add function prototypes and an entry to the languages_0 table near the beginning of the file.

Be warned that the existing scanners are built for speed, not elegance or readability. You might wish to create a new scanner that's easier to read and understand if you don't feel that speed is so important.


Go to the first, previous, next, last section, table of contents.