Go to the first, previous, next, last section, table of contents.
Certain options, and regular expression syntax, are shared by various
groupings of the ID utilities. We describe these in the sections below,
rather than repeating them for each program.
- `--help'
-
Print a usage message listing all available options, then exit successfully.
- `--version'
-
Print the version number, then exit successfully.
- `-f filename'
-
- `--file=filename'
-
Filename is the ID database to read when processing queries. At
present, only a single `--file' option is processed, but in future
releases, more than one ID database may be named on the command line.
- `$IDPATH'
-
`IDPATH' is an environment variable that contains a
colon-separated list of ID database names. If this variable is present,
and no `--file' options are presented on the command line, the ID
databases named in `IDPATH' are implied.(1)
If no ID databases are specified either on the command line or via the
`IDPATH' environment variable, then the ID utilities search for a
file named `ID' in the current working directory, and then in
successive parent directories.
- `-o filename'
-
- `--output=filename'
-
The `--output' option names the file in which to write a new ID
database. If no `--output' (or `--file') option is present,
an output file named `ID' is implied.
- `-f filename'
-
- `--file=filename'
-
This is a synonym for `--output'
The programs `mkid' and `xtokid' accept the names of files and
directories on the command line. Files are scanned if there is a
scanner available and enabled for the file's source language.
Directories are recursively descended, searching for files whose names
match the rules listed in the language map file (see section Mapping file names to source languages).
The following option controls the file tree walker:
- `-p names'
-
- `--prune=names'
-
One or more file or directory names may appear in names. The file
tree walker will stop short at these files and directories and their
contents will not be scanned.
The programs `lid' and `fnid' can print lists of file names as
the result of queries. The following option controls how these lists
are formatted:
- `-S style'
-
- `--separator=style'
-
Style may be one of `braces', `space' or `newline'.
The style of `braces' means that file names with common
directory prefix and common suffix are printed using the shell's brace
notation in order to compress the output. For example,
`../src/foo.c ../src/bar.c' can be printed in brace notation as
`../src/{foo,bar}.c'.
The styles of `space' and `newline' mean that file names
are separated spaces or by newlines, respectively.
If the list of files is being printed on a terminal, brace notation is
the default. If not, file names are separated by spaces if the
key is included in the output, and by newlines the key style
is `none' (see section
lid
: Querying an ID Database by Token).
`mkid' and `xtokid' walk file trees, select source files by
name, and extract tokens from source files. They accept the following
options:
- `-m mapfile'
-
- `--lang-map=mapfile'
-
mapfile contains rules for determining the source languages from
file names. See section Mapping file names to source languages
- `-i languages'
-
- `--include=languages'
-
The `--include' option names languages whose source files
should be scanned and incorporated into the ID database. By default,
all languages known to the ID utilities are enabled.
- `-x languages'
-
- `--exclude=languages'
-
The `--exclude' option names languages whose source files
should not be scanned. The default list of excluded languages is
empty. Note that only one of `--include' or `--exclude' may
be specified on the command line for a single run.
- `-l language:options'
-
- `--lang-option=language:options'
-
Language-specific scanners also accept options. Language denotes
the desired scanner, and option are the command-line options that
should be passed through to it. For example, to pass the -x
--coke-bottle options to the scanner for the language swizzle,
pass this: -l swizzle:"-x --coke-bottle", or this:
-lang-option=swizzle:"-x --coke-bottle", or this: -l
swizzle-x -l swizzle:--coke-bottle. Use the `--help' option to
see the command-line option summary for
To determine which tokens to extract from a file and store in the
database, `mkid' calls a scanner; we say a scanner
recognizes a particular language. Scanners for several languages
are built-in to `mkid'; you can add your own scanners as well, as
explained in section Defining New Scanners in the Source Code.
The ID utilities determine which scanner to use for a particular file by
consulting the language-map file. Scanners for several are already
built-in to the ID utilities. You can see which languages have built-in
scanners, and examine their language-specific options by invoking
`mkid --help' or `xtokid --help'.
The file `id-lang.map', installed by default in
`$(prefix)/share/id-lang.map', contains rules for mapping file
names to source languages. Each rule comprises three parts: a shell
glob pattern, a language name, and language-specific scanner
options.
The special pattern `**' denotes the default source language. This is
the language that's assigned to file names that don't match any other
pattern.
The special pattern `***' should be followed by a file name. The
named file should contain more language-map rules and is included at
this point.
The order in which rules are presented in a language-map file is
significant. This order influences the order in which files are
displayed as the result of queries. For example, the distributed
language-map file places all rules for C .h files ahead of
.c files, so that in general, declarations will precede
definitions in query output. The same thing is done for C++ and its
many different source file name extensions.
Here is a pared-down version of the `id-lang.map' file distributed
with the ID utilities:
# Default language
** IGNORE # Although this is listed first,
# the default language pattern is
# logically matched last.
# Backup files
*~ IGNORE
*.bak IGNORE
*.bk[0-9] IGNORE
# SCCS files
[sp].* IGNORE
# list header files before code files
*.h C
*.h.in C
*.H C++
*.hh C++
*.hpp C++
*.hxx C++
# list C `meta' files next
*.l C
*.lex C
*.y C
*.yacc C
# list C code files after header files
*.c C
*.C C++
*.cc C++
*.cpp C++
*.cxx C++
# list assembly language after C
*.[sS] asm --comment=;
*.asm asm --comment=;
# [nt]roff
*.[0-9] roff
*.ms roff
*.me roff
*.mm roff
# TeX and friends
*.tex TeX
*.ltx TeX
*.texi texinfo
*.texinfo texinfo
The C scanner is the most commonly used. Files that match the glob
pattern `*.h', `*.c', as well as `yacc' files that match
`*.y' or `*.yacc', and `lex' files that match `*.l'
or `*.lex', are processed with this scanner.
Scanner-specific options (Note, these options are presented
without the required `-l' or `--lang-option=' prefix):
- `-k character-class'
-
- `--keep=character-class'
-
Consider the characters in character-class as valid constituents of
identifier names. For example, if you are indexing C code that contains
`$' in some of its identifiers, you can include these by using
`--lang-option=C:--keep=$', or `-l C:"-k $"' (if you don't like
to type so much).
- `-i character-class'
-
- `--ignore=character-class'
-
Consider the characters in character-class as valid constituents of
identifier names, but discard all tokens containing these characters.
For example, if some C code has identifiers containing `$', but you
don't want these cluttering up your ID database, use
`--lang-option=C:--ignore=$', or the terser equivalent `-l
C:"-i $"'.
- `-u'
-
- `--strip-underscore'
-
Strip one leading underscore from C identifiers encapsulated as
character strings. This option is useful if you are indexing C code
that contains symbol-table name strings for systems that prepend an
underscore to external symbols. By default, the leading underscore is
retained.
Assembly languages use a variety of commenting conventions, and allow a
variety of special characters to dirty up local symbols,
preventing name space conflicts with symbols defined by higher-level
languages. Also, some compilation systems prepend an underscore to
external symbols. The options listed below are designed to address
these differences.
- `-c character-class'
-
- `--comment=character-class'
-
The characters in character-class are considered left delimiters
for comments that extend until the end of the current line.
- `-k character-class'
-
- `--keep=character-class'
-
Consider the characters of character-class as valid constituents of
identifier names. For example, if you are indexing assembly code that
prepends `.' to assembler directives, and prepends `%' to
register names, you can keep these characters in the tokens by specifying
`--lang-option=asm:--keep=.%', or `-l asm:"-k .%"'.
- `-i character-class'
-
- `--ignore=character-class'
-
Consider the characters of character-class as valid constituents
of identifier names, but discard all tokens containing these characters.
For example, if you don't want to clutter your ID database with
assembler directives that begin with a leading `.' or with
assembler labels that contain `@', use
`--lang-option=asm:--ignore=.@', or `-l asm:"-i .@"'.
- `-u'
-
- `--strip-underscore'
-
Strip one leading underscore from identifiers. This option is useful if
your compilation system prepends an underscore to external symbols. By
stripping the underscore, you can canonicalize such names and bring them
into conformance the way they are expressed in the C language. By
default, the leading underscore is retained.
- `-n'
-
- `--no-cpp'
-
Do not recognize C preprocessor directives. By default, such lines are
handled in the same way as they are by the C language scanner.
The plain text scanner is intended for human-language documents, or as the
scanner of last resort for files that have no scanner that is more
specific. It is customizable to the extent that character classes can
be designated as token constituents or as token delimiters. The default
token constituents are the alpha-numerics; all other characters are
considered token delimiters.
- `-i character-class'
-
- `--include=character-class'
-
Include characters belonging to character-class in tokens.
- `-x character-class'
-
- `--exclude=character-class'
-
Exclude characters belonging to character-class from tokens, i.e., treat
them as token delimiters.
To add a new scanner in source code, you should add a new section to the
file `scanners.c'. It might be easiest to clone one of the
existing scanners and modify it as necessary. For the hypothetical
language foo, you must define the functions get_token_foo
,
parse_args_foo
, help_me_foo
, as well as the tables
long_options_foo
and args_foo
. If your scanner is
modeled after one of the existing scanners, you'll also need a
character-attribute table ctype_foo
.
This is not a terribly difficult programming task, but it requires
recompiling and installing the new version of `mkid' and `xtokid'.
You should use `xtokid' to test the operation of the new scanner.
Once these functions and tables are ready, add function prototypes and
an entry to the languages_0
table near the beginning of the file.
Be warned that the existing scanners are built for speed, not elegance
or readability. You might wish to create a new scanner that's easier to
read and understand if you don't feel that speed is so important.
Go to the first, previous, next, last section, table of contents.