 |
Index for Section 3 |
|
 |
Alphabetical listing for U |
|
 |
Bottom of page |
|
uwildmat(3)
NAME
uwildmat, uwildmat_simple, uwildmat_poison - Perform wildmat matching
SYNOPSIS
#include <libinn.h>
bool uwildmat(const char *text, const char *pattern);
bool uwildmat_simple(const char *text, const char *pattern);
enum uwildmat uwildmat_poison(const char *text, const char *pattern);
DESCRIPTION
uwildmat compares text against the wildmat expression pattern, returning
true if and only if the expression matches the text. "@" has no special
meaning in pattern when passed to uwildmat. Both text and pattern are
assumed to be in the UTF-8 character encoding, although malformed UTF-8
sequences are treated in a way that attempts to be mostly compatible with
single-octet character sets like ISO 8859-1. (In other words, if you try
to match ISO 8859-1 text with these routines everything should work as
expected unless the ISO 8859-1 text contains valid UTF-8 sequences, which
thankfully is somewhat rare.)
uwildmat_simple is identical to uwildmat except that neither "!" nor ","
have any special meaning and pattern is always treated as a single pattern.
This function exists solely to support legacy interfaces like NNTP's XPAT
command, and should be avoided when implementing new features.
uwildmat_poison works similarly to uwildmat, except that "@" as the first
character of one of the patterns in the expression (see below) "poisons"
the match if it matches. uwildmat_poison returns UWILDMAT_MATCH if the
expression matches the text, UWILDMAT_FAIL if it doesn't, and
UWILDMAT_POISON if the expression doesn't match because a poisoned pattern
matched the text. These enumeration constants are defined in the libinn.h
header.
WILDMAT EXPRESSIONS
A wildmat expression follows rules similar to those of shell filename
wildcards but with some additions and changes. A wildmat expression is
composed of one or more wildmat patterns separated by commas. Each
character in the wildmat pattern matches a literal occurance of that same
character in the text, with the exception of the following metacharacters:
? Matches any single character (including a single UTF-8 multibyte
character, so "?" can match more than one byte).
* Matches any sequence of zero or more characters.
\ Turns off any special meaning of the following character; the
following character will match itself in the text. "\" will escape
any character, including another backslash or a comma that
otherwise would separate a pattern from the next pattern in an
expression. Note that "\" is not special inside a character range
(no metacharacters are).
[...] A character set, which matches any single character that falls
within that set. The presence of a character between the brackets
adds that character to the set; for example, "[amv]" specifies the
set containing the characters "a", "m", and "v". A range of
characters may be specified using "-"; for example, "[0-5abc]" is
equivalent to "[012345abc]". The order of characters is as defined
in the UTF-8 character set, and if the start character of such a
range falls after the ending character of the range in that ranking
the results of attempting a match with that pattern are undefined.
In order to include a literal "]" character in the set, it must be
the first character of the set (possibly following "^"); for
example, "[]a]" matches either "]" or "a". To include a literal
"-" character in the set, it must be either the first or the last
character of the set. Backslashes have no special meaning inside a
character set, nor do any other of the wildmat metacharacters.
[^...] A negated character set. Follows the same rules as a character set
above, but matches any character not contained in the set. So, for
example, "[^]-]" matches any character except "]" and "-".
In addition, "!" (and possibly "@") have special meaning as the first
character of a pattern; see below.
When matching a wildmat expression against some text, each comma-separated
pattern is matched in order from left to right. In order to match, the
pattern must match the whole text; in regular expression terminology, it's
implicitly anchored at both the beginning and the end. For example, the
pattern "a" matches only the text "a"; it doesn't match "ab" or "ba" or
even "aa". If none of the patterns match, the whole expression doesn't
match. Otherwise, whether the expression matches is determined entirely by
the rightmost matching pattern; the expression matches the text if and only
if the rightmost matching pattern is not negated.
For example, consider the text "news.misc". The expression "*" matches
this text, of course, as does "comp.*,news.*" (because the second pattern
matches). "news.*,!news.misc" does not match this text because both
patterns match, meaning that the rightmost takes precedence, and the
rightmost matching pattern is negated. "news.*,!news.misc,*.misc" does
match this text, since the rightmost matching pattern is not negated.
Note that the expression "!news.misc" can't match anything. Either the
pattern doesn't match, in which case no patterns match and the expression
doesn't match, or the pattern does match, in which case because it's
negated the expression doesn't match. "*,!news.misc", on the other hand,
is a useful pattern that matches anything except "news.misc".
"!" has significance only as the first character of a pattern; anywhere
else in the pattern, it matches a literal "!" in the text like any other
non-metacharacter.
If the uwildmat_poison interface is used, then "@" behaves the same as "!"
except that if an expression fails to match because the rightmost matching
pattern began with "@", UWILDMAT_POISON is returned instead of
UWILDMAT_FAIL.
If the uwildmat_simple interface is used, the matching rules are the same
as above except that none of "!", "@", or "," have any special meaning at
all and only match those literal characters.
BUGS
All of these functions internally convert the passed arguments to const
unsigned char pointers. The only reason why they take regular char
pointers instead of unsigned char is for the convenience of INN and other
callers that may not be using unsigned char everywhere they should. In a
future revision, the public interface should be changed to just take
unsigned char pointers.
HISTORY
Written by Rich $alz <rsalz@uunet.uu.net> in 1986, and posted to Usenet
several times since then, most notably in comp.sources.misc in March, 1991.
Lars Mathiesen <thorinn@diku.dk> enhanced the multi-asterisk failure mode
in early 1991.
Rich and Lars increased the efficiency of star patterns and reposted it to
comp.sources.misc in April, 1991.
Robert Elz <kre@munnari.oz.au> added minus sign and close bracket handling
in June, 1991.
Russ Allbery <rra@stanford.edu> added support for comma-separated patterns
and the "!" and "@" metacharacters to the core wildmat routines in July,
2000. He also added support for UTF-8 characters, changed the default
behavior to assume that both the text and the pattern are in UTF-8, and
largely rewrote this documentation to expand and clarify the description of
how a wildmat expression matches.
Please note that the interfaces to these functions are named uwildmat and
the like rather than wildmat to distinguish them from the wildmat function
provided by Rich $alz's original implementation. While this code is
heavily based on Rich's original code, it has substantial differences,
including the extension to support UTF-8 characters, and has noticable
functionality changes. Any bugs present in it aren't Rich's fault.
$Id: uwildmat.3,v 1.2 2002/08/24 17:25:23 vinocur Exp $
SEE ALSO
grep(1), fnmatch(3), regex(3), regexp(3).
 |
Index for Section 3 |
|
 |
Alphabetical listing for U |
|
 |
Top of page |
|