Tester's Story.

Regular Expressions (Perl Style)


UltraEdit/UEStudio now support Perl style regular expressions using the Boost C++ Libraries. The Perl regular expression syntax is based on that used by the programming language Perl.  


Perl Regular Expression Syntax

In Perl regular expressions, all characters match themselves except for the following special characters:


.[{()\*+?|^$


Wildcard

The single character '.' when used outside of a character set will match any single character.


Anchors

A '^' character shall match the start of a line.


A '$' character shall match the end of a line.


Marked sub-expressions

A section beginning ( and ending ) acts as a marked sub-expression.  Whatever matched the sub-expression is split out in a separate field by the matching algorithms.  Marked sub-expressions can also repeated, or referred to by a back-reference.


Alternation

The | operator will match either of its arguments, so for example: abc|def will match either "abc" or "def".  


Parenthesis can be used to group alternations, for example: ab(d|ef) will match either of "abd" or "abef".


Empty alternatives are not allowed (these are almost always a mistake), but if you really want an empty alternative use (?:) as a placeholder, for example:


"|abc" is not a valid expression, but

"(?:)|abc" is and is equivalent, also the expression:

"(?:abc)??" has exactly the same effect.


Character sets

A character set is a bracket-expression starting with [ and ending with ], it defines a set of characters, and matches any single character that is a member of that set.

A bracket expression may contain any combination of the following:


Single characters

For example [abc], will match any of the characters 'a', 'b', or 'c'.


Character ranges

For example [a-c] will match any single character in the range 'a' to 'c'.  By default, for POSIX-Perl regular expressions, a character x is within the range y to z, if it collates within that range; this results in locale specific behavior.  


Negation

If the bracket-expression begins with the ^ character, then it matches the complement of the characters it contains, for example [^a-c] matches any character that is not in the range a-c.


Character classes

An expression of the form [[:name:]] matches the named character class "name", for example [[:lower:]] matches any lower case character.  The following character class names are always supported:


Name

POSIX-standard

Description

alnum

Yes

Any alpha-numeric character.

alpha

Yes

Any alphabetic character.

blank

Yes

Any whitespace character that is not a line separator.

cntrl

Yes

Any control character.

d

No

Any decimal digit

digit

Yes

Any decimal digit.

graph

Yes

Any graphical character.

l

No

Any lower case character.

lower

Yes

Any lower case character.

print

Yes

Any printable character.

punct

Yes

Any punctuation character.

s

No

Any whitespace character.

space

Yes

Any whitespace character.

unicode

No

Any extended character whose code point is above 255 in value.

u

No

Any upper case character.

upper

Yes

Any upper case character.

w

No

Any word character (alphanumeric characters plus the underscore).

word

No

Any word character (alphanumeric characters plus the underscore).

xdigit

Yes

Any hexadecimal digit character.


Escapes

Any special character preceded by an escape shall match itself.   The following escape sequences are also supported:


Escapes matching a specific character

The following escape sequences are all synonyms for single characters:


Escape

Character

\a

'\a'

\e

0x1B

\f

\f

\n

\n

\r

\r

\t

\t

\v

\v

\b

\b (but only inside a character class declaration).

\cX

An ASCII escape sequence - the character whose code point is X % 32

\xdd

A hexadecimal escape sequence - matches the single character whose code point is 0xdd.

\x{dddd}

A hexadecimal escape sequence - matches the single character whose code point is 0xdddd.

\0ddd

An octal escape sequence - matches the single character whose code point is 0ddd.

\N{name}

 Matches the single character which has the symbolic name name.  For example \N{newline} matches the single character \n.


"Single character" character classes

Any escaped character x, if x is the name of a character class shall match any character that is a member of that class, and any escaped character X, if x is the name of a character class, shall match any character not in that class.  The following are supported by default:


Escape sequence

Equivalent to

\d

[[:digit:]]

\l

[[:lower:]]

\s

[[:space:]]

\u

[[:upper:]]

\w

[[:word:]]

\D

[^[:digit:]]

\L

[^[:lower:]]

\S

[^[:space:]]

\U

[^[:upper:]]

\W

[^[:word:]]


Word Boundaries

The following escape sequences match the boundaries of words:


\<

Matches the start of a word.

\>

Matches the end of a word.

\b

Matches a word boundary (the start or end of a word).

\B

Matches only when not at a word boundary.


 For further information/options please see the Boost Libraries Perl Regular Expression syntax pages.


Use, modification and distribution are subject to the Boost Software License, Version 1.0. (See accompanying file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt).

Posted by Tester