Print this page
9083 replace regex implementation with tre
Split |
Close |
Expand all |
Collapse all |
--- old/usr/src/man/man3c/regcomp.3c.man.txt
+++ new/usr/src/man/man3c/regcomp.3c.man.txt
1 1 REGCOMP(3C) Standard C Library Functions REGCOMP(3C)
2 2
3 3 NAME
4 4 regcomp, regexec, regerror, regfree - regular-expression library
5 5
6 6 LIBRARY
7 7 Standard C Library (libc, -lc)
8 8
9 9 SYNOPSIS
10 10 #include <regex.h>
11 11
12 12 int
13 13 regcomp(regex_t *restrict preg, const char *restrict pattern,
14 14 int cflags);
15 15
16 16 int
17 17 regexec(const regex_t *restrict preg, const char *restrict string,
18 18 size_t nmatch, regmatch_t pmatch[restrict], int eflags);
19 19
20 20 size_t
21 21 regerror(int errcode, const regex_t *restrict preg,
22 22 char *restrict errbuf, size_t errbuf_size);
23 23
24 24 void
25 25 regfree(regex_t *preg);
26 26
27 27 DESCRIPTION
28 28 These routines implement IEEE Std 1003.2 ("POSIX.2") regular expressions;
29 29 see regex(5). The regcomp() function compiles an RE written as a string
30 30 into an internal form, regexec() matches that internal form against a
31 31 string and reports results, regerror() transforms error codes from either
32 32 into human-readable messages, and regfree() frees any dynamically-
33 33 allocated storage used by the internal form of an RE.
34 34
35 35 The header <regex.h> declares two structure types, regex_t and
36 36 regmatch_t, the former for compiled internal forms and the latter for
37 37 match reporting. It also declares the four functions, a type regoff_t,
38 38 and a number of constants with names starting with "REG_".
39 39
40 40 regcomp()
41 41 The regcomp() function compiles the regular expression contained in the
42 42 pattern string, subject to the flags in cflags, and places the results in
43 43 the regex_t structure pointed to by preg. The cflags argument is the
44 44 bitwise OR of zero or more of the following flags:
45 45
46 46 REG_EXTENDED Compile extended regular expressions (EREs), rather than
47 47 the basic regular expressions (BREs) that are the default.
48 48
49 49 REG_BASIC This is a synonym for 0, provided as a counterpart to
↓ open down ↓ |
49 lines elided |
↑ open up ↑ |
50 50 REG_EXTENDED to improve readability.
51 51
52 52 REG_NOSPEC Compile with recognition of all special characters turned
53 53 off. All characters are thus considered ordinary, so the
54 54 RE is a literal string. This is an extension, compatible
55 55 with but not specified by IEEE Std 1003.2 ("POSIX.2"), and
56 56 should be used with caution in software intended to be
57 57 portable to other systems. REG_EXTENDED and REG_NOSPEC may
58 58 not be used in the same call to regcomp().
59 59
60 + REG_LITERAL An alias of REG_NOSPEC.
61 +
60 62 REG_ICASE Compile for matching that ignores upper/lower case
61 63 distinctions. See regex(5).
62 64
63 65 REG_NOSUB Compile for matching that need only report success or
64 66 failure, not what was matched.
65 67
66 68 REG_NEWLINE Compile for newline-sensitive matching. By default,
67 69 newline is a completely ordinary character with no special
68 70 meaning in either REs or strings. With this flag, "[^"
69 71 bracket expressions and "." never match newline, a "^"
70 72 anchor matches the null string after any newline in the
71 73 string in addition to its normal function, and the "$"
72 74 anchor matches the null string before any newline in the
73 75 string in addition to its normal function.
74 76
75 77 REG_PEND The regular expression ends, not at the first NUL, but just
76 78 before the character pointed to by the re_endp member of
77 79 the structure pointed to by preg. The re_endp member is of
78 80 type const char *. This flag permits inclusion of NULs in
79 81 the RE; they are considered ordinary characters. This is
80 82 an extension, compatible with but not specified by IEEE Std
81 83 1003.2 ("POSIX.2"), and should be used with caution in
82 84 software intended to be portable to other systems.
83 85
84 86 When successful, regcomp() returns 0 and fills in the structure pointed
85 87 to by preg. One member of that structure (other than re_endp) is
86 88 publicized: re_nsub, of type size_t, contains the number of parenthesized
87 89 subexpressions within the RE (except that the value of this member is
88 90 undefined if the REG_NOSUB flag was used).
89 91
90 92 regexec()
91 93 The regexec() function matches the compiled RE pointed to by preg against
92 94 the string, subject to the flags in eflags, and reports results using
93 95 nmatch, pmatch, and the returned value. The RE must have been compiled
94 96 by a previous invocation of regcomp(). The compiled form is not altered
95 97 during execution of regexec(), so a single compiled RE can be used
96 98 simultaneously by multiple threads.
97 99
98 100 By default, the NUL-terminated string pointed to by string is considered
99 101 to be the text of an entire line, minus any terminating newline. The
100 102 eflags argument is the bitwise OR of zero or more of the following flags:
101 103
102 104 REG_NOTBOL The first character of the string is treated as the
103 105 continuation of a line. This means that the anchors "^",
104 106 "[[:<:]]", and "\<" do not match before it; but see
105 107 REG_STARTEND below. This does not affect the behavior of
106 108 newlines under REG_NEWLINE.
107 109
108 110 REG_NOTEOL The NUL terminating the string does not end a line, so the
109 111 "$" anchor does not match before it. This does not affect
110 112 the behavior of newlines under REG_NEWLINE.
111 113
112 114 REG_STARTEND The string is considered to start at string +
113 115 pmatch[0].rm_so and to end before the byte located at
114 116 string + pmatch[0].rm_eo, regardless of the value of
115 117 nmatch. See below for the definition of pmatch and nmatch.
116 118 This is an extension, compatible with but not specified by
117 119 IEEE Std 1003.2 ("POSIX.2"), and should be used with
118 120 caution in software intended to be portable to other
119 121 systems.
120 122
121 123 Without REG_NOTBOL, the position rm_so is considered the
122 124 beginning of a line, such that "^" matches before it, and
123 125 the beginning of a word if there is a word character at
124 126 this position, such that "[[:<:]]" and "\<" match before
125 127 it.
126 128
127 129 With REG_NOTBOL, the character at position rm_so is treated
128 130 as the continuation of a line, and if rm_so is greater than
129 131 0, the preceding character is taken into consideration. If
130 132 the preceding character is a newline and the regular
131 133 expression was compiled with REG_NEWLINE, "^" matches
132 134 before the string; if the preceding character is not a word
133 135 character but the string starts with a word character,
134 136 "[[:<:]]" and "\<" match before the string.
135 137
136 138 See regex(5) for a discussion of what is matched in situations where an
137 139 RE or a portion thereof could match any of several substrings of string.
138 140
139 141 If REG_NOSUB was specified in the compilation of the RE, or if nmatch is
140 142 0, regexec() ignores the pmatch argument (but see below for the case
141 143 where REG_STARTEND is specified). Otherwise, pmatch points to an array
142 144 of nmatch structures of type regmatch_t. Such a structure has at least
143 145 the members rm_so and rm_eo, both of type regoff_t (a signed arithmetic
144 146 type at least as large as an off_t and a ssize_t), containing
145 147 respectively the offset of the first character of a substring and the
146 148 offset of the first character after the end of the substring. Offsets
147 149 are measured from the beginning of the string argument given to
148 150 regexec(). An empty substring is denoted by equal offsets, both
149 151 indicating the character following the empty substring.
150 152
151 153 The 0th member of the pmatch array is filled in to indicate what
152 154 substring of string was matched by the entire RE. Remaining members
153 155 report what substring was matched by parenthesized subexpressions within
154 156 the RE; member i reports subexpression i, with subexpressions counted
155 157 (starting at 1) by the order of their opening parentheses in the RE, left
156 158 to right. Unused entries in the array (corresponding either to
157 159 subexpressions that did not participate in the match at all, or to
158 160 subexpressions that do not exist in the RE (that is, i > preg->re_nsub))
159 161 have both rm_so and rm_eo set to -1. If a subexpression participated in
160 162 the match several times, the reported substring is the last one it
161 163 matched. (Note, as an example in particular, that when the RE "(b*)+"
162 164 matches "bbb", the parenthesized subexpression matches each of the three
163 165 `b's and then an infinite number of empty strings following the last "b",
164 166 so the reported substring is one of the empties.)
165 167
166 168 If REG_STARTEND is specified, pmatch must point to at least one
167 169 regmatch_t (even if nmatch is 0 or REG_NOSUB was specified), to hold the
168 170 input offsets for REG_STARTEND. Use for output is still entirely
169 171 controlled by nmatch; if nmatch is 0 or REG_NOSUB was specified, the
170 172 value of pmatch[0] will not be changed by a successful regexec().
171 173
172 174 regerror()
173 175 The regerror() function maps a non-zero errcode from either regcomp() or
174 176 regexec() to a human-readable, printable message. If preg is non-NULL,
175 177 the error code should have arisen from use of the regex_t pointed to by
176 178 preg, and if the error code came from regcomp(), it should have been the
177 179 result from the most recent regcomp() using that regex_t. The
↓ open down ↓ |
108 lines elided |
↑ open up ↑ |
178 180 (regerror() may be able to supply a more detailed message using
179 181 information from the regex_t.) The regerror() function places the NUL-
180 182 terminated message into the buffer pointed to by errbuf, limiting the
181 183 length (including the NUL) to at most errbuf_size bytes. If the whole
182 184 message will not fit, as much of it as will fit before the terminating
183 185 NUL is supplied. In any case, the returned value is the size of buffer
184 186 needed to hold the whole message (including terminating NUL). If
185 187 errbuf_size is 0, errbuf is ignored but the return value is still
186 188 correct.
187 189
188 - If the errcode given to regerror() is first ORed with REG_ITOA, the
189 - "message" that results is the printable name of the error code, e.g.
190 - "REG_NOMATCH", rather than an explanation thereof. If errcode is
191 - REG_ATOI, then preg shall be non-NULL and the re_endp member of the
192 - structure it points to must point to the printable name of an error code;
193 - in this case, the result in errbuf is the decimal digits of the numeric
194 - value of the error code (0 if the name is not recognized). REG_ITOA and
195 - REG_ATOI are intended primarily as debugging facilities; they are
196 - extensions, compatible with but not specified by IEEE Std 1003.2
197 - ("POSIX.2"), and should be used with caution in software intended to be
198 - portable to other systems.
199 -
200 190 regfree()
201 191 The regfree() function frees any dynamically-allocated storage associated
202 192 with the compiled RE pointed to by preg. The remaining regex_t is no
203 193 longer a valid compiled RE and the effect of supplying it to regexec() or
204 194 regerror() is undefined.
205 195
206 -IMPLEMENTATION NOTES
207 - There are a number of decisions that IEEE Std 1003.2 ("POSIX.2") leaves
208 - up to the implementor, either by explicitly saying "undefined" or by
209 - virtue of them being forbidden by the RE grammar. This implementation
210 - treats them as follows.
211 -
212 - There is no particular limit on the length of REs, except insofar as
213 - memory is limited. Memory usage is approximately linear in RE size, and
214 - largely insensitive to RE complexity, except for bounded repetitions.
215 -
216 - A backslashed character other than one specifically given a magic meaning
217 - by IEEE Std 1003.2 ("POSIX.2") (such magic meanings occur only in BREs)
218 - is taken as an ordinary character.
219 -
220 - Any unmatched "[" is a REG_EBRACK error.
221 -
222 - Equivalence classes cannot begin or end bracket-expression ranges. The
223 - endpoint of one range cannot begin another.
224 -
225 - RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is
226 - 255.
227 -
228 - A repetition operator ("?", "*", "+", or bounds) cannot follow another
229 - repetition operator. A repetition operator cannot begin an expression or
230 - subexpression or follow "^" or "|".
231 -
232 - "|" cannot appear first or last in a (sub)expression or after another
233 - "|", i.e., an operand of "|" cannot be an empty subexpression. An empty
234 - parenthesized subexpression, "()", is legal and matches an empty
235 - (sub)string. An empty string is not a legal RE.
236 -
237 - A "{" followed by a digit is considered the beginning of bounds for a
238 - bounded repetition, which must then follow the syntax for bounds. A "{"
239 - not followed by a digit is considered an ordinary character.
240 -
241 - "^" and "$" beginning and ending subexpressions in BREs are anchors, not
242 - ordinary characters.
243 -
244 196 RETURN VALUES
245 197 On successful completion, the regcomp() function returns 0. Otherwise,
246 198 it returns an integer value indicating an error as described in
247 199 <regex.h>, and the content of preg is undefined.
248 200
249 201 On successful completion, the regexec() function returns 0. Otherwise it
250 - returns REG_NOMATCH to indicate no match, or REG_ENOSYS to indicate that
251 - the function is not supported.
202 + returns REG_NOMATCH to indicate no match.
252 203
253 204 Upon successful completion, the regerror() function returns the number of
254 - bytes needed to hold the entire generated string. Otherwise, it returns
255 - 0 to indicate that the function is not implemented.
205 + bytes needed to hold the entire generated string.
256 206
257 207 The regfree() function returns no value.
258 208
259 209 The following constants are defined as error return values:
260 210
261 211 REG_NOMATCH The regexec() function failed to match.
262 212 REG_BADPAT Invalid regular expression.
263 213 REG_ECOLLATE Invalid collating element referenced.
264 214 REG_ECTYPE Invalid character class type referenced.
265 215 REG_EESCAPE Trailing "\" in pattern.
266 216 REG_ESUBREG Number in "\digit" invalid or in error.
267 217 REG_EBRACK "[]" imbalance.
268 218 REG_ENOSYS The function is not supported.
269 219 REG_EPAREN "\(\)" or "()" imbalance.
270 220 REG_EBRACE "\{\}" imbalance.
271 221 REG_BADBR Content of "\{\}" invalid: not a number, number too large,
272 222 more than two numbers, first larger than second.
273 223 REG_ERANGE Invalid endpoint in range expression.
274 224 REG_ESPACE Out of memory.
275 225 REG_BADRPT "?", "*" or "+" not preceded by valid regular expression.
226 + REG_EMPTY Empty (sub)expression.
227 + REG_INVARG Invalid argument, e.g. negative-length string.
276 228
277 229 USAGE
278 230 An application could use:
279 231
280 232 regerror(code, preg, (char *)NULL, (size_t)0)
281 233
282 234 to find out how big a buffer is needed for the generated string, malloc()
283 235 a buffer to hold the string, and then call regerror() again to get the
284 236 string (see malloc(3C)). Alternately, it could allocate a fixed, static
285 237 buffer that is big enough to hold most strings, and then use malloc()
286 238 allocate a larger buffer if it finds that this is too small.
287 239
288 240 EXAMPLES
289 241 Matching string against the extended regular expression in pattern.
290 242
291 243 #include <regex.h>
292 244
293 245 /*
294 246 * Match string against the extended regular expression in
295 247 * pattern, treating errors as no match.
296 248 *
297 249 * return 1 for match, 0 for no match
298 250 */
299 251 int
300 252 match(const char *string, char *pattern)
301 253 {
302 254 int status;
303 255 regex_t re;
304 256
305 257 if (regcomp(&re, pattern, REG_EXTENDED|REG_NOSUB) != 0) {
306 258 return(0); /* report error */
307 259 }
308 260 status = regexec(&re, string, (size_t) 0, NULL, 0);
309 261 regfree(&re);
310 262 if (status != 0) {
311 263 return(0); /* report error */
312 264 }
313 265 return(1);
314 266 }
315 267
316 268 The following demonstrates how the REG_NOTBOL flag could be used with
317 269 regexec() to find all substrings in a line that match a pattern supplied
318 270 by a user. (For simplicity of the example, very little error checking is
319 271 done.)
320 272
321 273 (void) regcomp(&re, pattern, 0);
322 274 /* this call to regexec() finds the first match on the line */
323 275 error = regexec(&re, &buffer[0], 1, &pm, 0);
324 276 while (error == 0) { /* while matches found */
325 277 /* substring found between pm.rm_so and pm.rm_eo */
326 278 /* This call to regexec() finds the next match */
327 279 error = regexec(&re, buffer + pm.rm_eo, 1, &pm, REG_NOTBOL);
328 280 }
329 281
330 282 ERRORS
331 283 No errors are defined.
332 284
333 285 CODE SET INDEPENDENCE
334 286 Enabled
335 287
336 288 INTERFACE STABILITY
337 289 Standard
338 290
339 291 MT-LEVEL
340 292 MT-Safe with exceptions
↓ open down ↓ |
55 lines elided |
↑ open up ↑ |
341 293
342 294 The regcomp() function can be used safely in a multithreaded application
343 295 as long as setlocale(3C) is not being called to change the locale.
344 296
345 297 SEE ALSO
346 298 attributes(5), regex(5), standards(5)
347 299
348 300 IEEE Std 1003.2 ("POSIX.2"), sections 2.8 (Regular Expression Notation)
349 301 and B.5 (C Binding for Regular Expression Matching).
350 302
351 -illumos June 14, 2017 illumos
303 +illumos February 3, 2018 illumos
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX