RegexKitLite

Lightweight Objective-C Regular Expressions for Mac OS X using the ICU Library

Introduction to RegexKitLite

This document introduces RegexKitLite for Mac OS X. RegexKitLite enables easy access to regular expressions by providing a number of additions to the standard Foundation NSString class. RegexKitLite acts as a bridge between the NSString class and the regular expression engine in the International Components for Unicode, or ICU, dynamic shared library that is shipped with Mac OS X.

Highlights

Documentation Overview

RegexKitLite Overview

While RegexKitLite is not a descendent of the RegexKit.framework source code, it does provide a small subset of RegexKits NSString methods for performing various regular expression tasks. These include determining the range that a regular expression matches within a string, easily creating a new string from the results of a match, splitting a string in to a NSArray with a regular expression, and performing search and replace operations with regular expressions using common $n substitution syntax.

RegexKitLite uses the regular expression provided by the ICU library that ships with Mac OS X. The two files, RegexKitLite.h and RegexKitLite.m, and linking against the /usr/lib/libicucore.dylib ICU shared library is all that is required. Adding RegexKitLite to your project only adds a few kilobytes of overhead to your applications size and typically only requires a few kilobytes of memory at runtime. Since a regular expression must first be compiled by the ICU library before it can be used, RegexKitLite keeps a small pseudo Least Recently Used cache of the compiled regular expressions.

Warning:

Apple does not officially support linking to the libicucore.dylib library.

See Also

Compiled Regular Expression Cache

The NSString that contains the regular expression must be compiled in to an ICU URegularExpression. This can be an expensive, time consuming step, and the compiled regular expression can be reused again in another search, even if the strings to be searched are different. Therefore RegexKitLite keeps a small cache of recently compiled regular expressions.

This cache is a simple hash table, the size of which can be tuned with the pre-processor define RKL_CACHE_SIZE. The default cache size, which should always be a prime number, is set to 23. The NSString regexString is mapped to a cache slot using modular arithmetic: Cache slot ≡ [regexString hash] mod RKL_CACHE_SIZE, i.e. cacheSlot = [regexString hash] % 23;. Since RegexKitLite uses Core Foundation, this is actually coded as cacheSlot = CFHash(regexString) % RKL_CACHE_SIZE;.

If the cache slot currently contains a compiled URegularExpression, checks are made to ensure that the current regexString is identical to the regular expression used to create the compiled URegularExpression. If they are a match, the cached compiled regular expression is used. If they are not a match, the current compiled regular expression for the selected cache slot is ejected and all of its resources are freed. Then the regexString that caused the ejection is compiled and fills the cache slot. Only one compiled regular expression can reside in a cache slot at a time.

Regular Expressions in Mutable Strings

When a regular expression is compiled, an immutable copy of the string is kept. For immutable NSString objects, the copy is usually the same object with its reference count increased by one. Only NSMutableString objects will cause a new, immutable NSString to be created.

If the regular expression being used is stored in a NSMutableString, the cached regular expression will continue to be used as long as the NSMutableString remains unchanged. Once mutated, the changed NSMutableString will no longer be a match for the cached compiled regular expression that was being used by it previously. Even if the newly mutated strings hash is congruent to the previous unmutated strings hash modulo RKL_CACHE_SIZE, that is to say they share the same cache slot (i.e., ([mutatedString hash] % RKL_CACHE_SIZE) == ([unmutatedString hash] % RKL_CACHE_SIZE)), the immutable copy of the regular expression string used to create the compiled regular expression is used to ensure true equality. The newly mutated string will have to go through the whole cache slot entry creation process and be compiled in to a URegularExpression.

This means that NSMutableString objects can be safely used as regular expressions, and any mutations to those objects will immediately be detected and reflected in the regular expression used for matching.

Searching Mutable Strings

Unfortunately, the ICU regular expression API requires that the compiled regular expression be "set" to the string to be searched. To search a different string, the compiled regular expression must be "set" to the new string. Therefore, RegexKitLite tracks the last NSString that each compiled regular expression was set to, recording the pointer to the NSString object, its hash, and its length. If any of these parameters are different from the last parameters used for a compiled regular expression, the compiled regular expression is "set" to the new string. Since mutating a string will likely change its hash value, it's generally safe to search NSMutableString objects, and in most cases the mutation will reset the compiled regular expression to the updated contents of the NSMutableString.

Caution:

Care must be taken when mutable strings are searched and there exists the possibility that the string has mutated between searches. See NSString RegexKitLite Additions Reference - Cached Information and Mutable Strings for more information.

Last Match Information

When performing a match, the arguments used to perform the match are kept. If those same arguments are used again, the actual matching operation is skipped because the compiled regular expression already contains the results for the given arguments. This is mostly useful when a regular expression contains multiple capture groups, and the results for different capture groups for the same match are needed. This means that there is only a small penalty for iterating over all the capture groups in a regular expression for a match, and essentially becomes the direct ICU regular expression API equivalent of uregex_start() and uregex_end().

See Also

UTF-16 Conversion Cache

RegexKitLite is ideal when the string being matched is a non-ASCII, Unicode string. This is because the regular expression engine used, ICU, can only operate on UTF-16 encoded strings. Since Cocoa keeps essentially all non-ASCII strings encoded in UTF-16 form internally, this means that RegexKitLite can operate directly on the strings buffer without having to make a temporary copy and transcode the string in to ICU's required format.

Like all object oriented programming, the internal representation of an objects information is private. However, the ICU regular expression engine requires that the text to be search be encoded as a UTF-16 string. For pragmatic purposes, Core Foundation has several public functions that can provide direct access to the buffer used to hold the contents of the string, but such direct access is only available if the private buffer is already encoded in the requested direct access format. As a rough rule of thumb, 8-bit simple strings, such as ASCII, are kept in their 8-bit format. Non 8-bit simple strings are stored as UTF-16 strings. Of course, this is an implementation private detail, so this behavior should never be relied upon. It is mentioned because of the tremendous impact on matching performance and efficiency it can have if a string must be converted to UTF-16.

For strings in which direct access to the UTF-16 string is available, RegexKitLite uses that buffer. This is the ideal case as no extra work needs to be performed, such as converting the string in to a UTF-16 string, and allocating memory to hold the temporary conversion. Of course, direct access is not always available, and occasionally the string to be searched will need to be converted in to a UTF-16 string.

RegexKitLite has two conversion buffer caches. Each buffer can only hold the contents of a single NSString at a time. If the selected buffer does not contain the contents of the NSString that is currently being searched, the previous occupant is ejected from the buffer and the current NSString takes it place. The first conversion buffer is fixed in size and set by the C pre-processor define RKL_FIXED_LENGTH, which defaults to 2048. Any string whose length is less than RKL_FIXED_LENGTH will use the fixed size conversion buffer. The second conversion buffer, for strings whose length is longer than RKL_FIXED_LENGTH, will use the dynamically sized conversion buffer. The memory allocation for the dynamically sized conversion buffer is resized for each conversion with realloc() to the size needed to hold the entire contents of the UTF-16 converted string.

This strategy was chosen for its relative simplicity. Keeping track of dynamically created resources is required to prevent memory leaks. As designed, there is only a single pointer to dynamically allocated memory: the pointer to hold the conversion contents of strings whose length is larger than RKL_FIXED_LENGTH. However, since realloc() is used to manage that memory allocation, it becomes very difficult to accidentally leak the buffer. Having the fixed sized buffer means that the memory allocation system isn't bothered with many small requests, most of which are transient in nature to begin with. The current strategy tries to strike the best balance between performance and simplicity.

Mutable Strings

When converted in to a UTF-16 string, the hash of the NSString is recorded, along with the pointer to the NSString object and the strings length. In order for the RegexKitLite to use the cached conversion, all of these parameters must be equal to their values of the NSString to be searched. If there is any difference, the cached conversion is discarded and the current NSString, or NSMutableString as the case may be, is reconverted in to a UTF-16 string.

Caution:

Care must be taken when mutable strings are searched and there exists the possibility that the string has mutated between searches. See NSString RegexKitLite Additions Reference - Cached Information and Mutable Strings for more information.

Multithreading Safety

RegexKitLite is also multithreading safe. Access to the compiled regular expression cache and the conversion cache is protected by a single OSSpinLock to ensure that only one thread has access at a time. The lock remains held while the regular expression match is performed since the compiled regular expression returned by the ICU library is not safe to use from multiple threads. Once the match has completed, the lock is released, and another thread is free to lock the cache and perform a match.

Important:

While it is safe to use the same regular expression from any thread at any time, the usual multithreading caveats apply. For example, it is not safe to mutate a NSMutableString in one thread while performing a match in another.

Using RegexKitLite

The goal of RegexKitLite is not to be a comprehensive Objective-C regular expression framework, but to provide a set of easy to use primitives from which additional functionality can be created. To this end, RegexKitLite provides the following two core primitives from which everything else is built:

RegexKitLite 2.0 adds the ability to split strings by dividing them with a regular expression, and the ability to perform search and replace operations using common $n substitution syntax. replaceOccurrencesOfRegex:withString: is used to modify the contents of NSMutableString objects directly and stringByReplacingOccurrencesOfRegex:withString: will create a new, immutable NSString from the receiver.

There are no additional classes that supply the regular expression matching functionality, everything is accomplished with the two methods above. These methods are added to the existing NSString class via an Objective-C category extension. See NSString RegexKitLite Additions Reference for a complete list of methods.

The real workhorse is the rangeOfRegex:options:inRange:capture:error: method. The receiver of the message is an ordinary NSString class member that you wish to perform a regular expression match on. The parameters of the method are a NSString containing the regular expression regex, any RKLRegexOptions match options, the NSRange range of the receiver that is to be searched, the capture number from the regular expression regex that you would like the result for, and an optional error parameter that will contain a NSError object if a problem occurs with the details of the error.

Important:

The C language assigns special meaning to the \ character when inside a quoted " " string in your source code. The \ character is the escape character, and the character that follows has a different meaning than normal. The most common example of this is \n, which translates in to the new-line character. Because of this, you are required to 'escape' any uses of \ by prepending it with another \. In practical terms this means doubling any \ in a regular expression, which unfortunately is quite common, that are inside of quoted " " strings in your source code. Failure to do so will result in numerous warnings from the compiler about unknown escape sequences.

A simple example:

NSString *searchString = @"This is neat."; NSString *regexString = @"(\\w+)\\s+(\\w+)\\s+(\\w+)"; NSRange matchedRange = NSMakeRange(NSNotFound, 0); NSError *error = NULL; matchedRange = [searchString rangeOfRegex:regexString options:RKLNoOptions inRange:searchRange capture:2 error:&error]; NSLog(@"matchedRange : %@", NSStringFromRange(matchedRange)); // 2008-03-18 03:51:16.530 test[51583:813] matchedRange : {5, 2}
Continues…

In the previous example, the NSRange that capture number 2 matched is {5, 2}, which corresponds to the word is in searchString. Once the NSRange is known, you can create a new string containing just the matching text:

…example
NSString *matchedString = [searchString substringWithRange:matchedRange]; NSLog(@"matchedString: '%@'", matchedString); // 2008-03-18 03:51:16.532 test[51583:813] matchedString: 'is'

Search and Replace

You can perform search and replace operations on NSString objects and use common $n capture group substitution in the replacement string:

NSString *searchString = @"This is neat."; NSString *regexString = @"\\b(\\w+)\\b"; NSString *replaceWithString = @"{$1}"; NSString *replacedString = NULL; replacedString = [searchString stringByReplacingOccurrencesOfRegex:regexString withString:replaceWithString]; NSLog(@"replaced string: '%@'", replacedString); // 2008-07-01 19:03:03.195 test[68775:813] replaced string: '{This} {is} {neat}.'
Important:
Search and replace methods will raise a RKLICURegexException if the replacementString contains $n capture references where n is greater than the number of capture groups in the regular expression.

In this example, the regular expression \b(\w+)\b has a single capture group, which is created with the use of () parenthesis. The text that was matched inside the parenthesis is available for use in the replacement text by using $n, where n is the parenthesized capture group you would like to use. Additional capture groups are numbered sequentially in the order that they appear from left to right. Capture group 0 (zero) is also available and is equivalent to all the text that the regular expression matched.

Mutable strings can be manipulated directly:

NSMutableString *mutableString = [NSMutableString stringWithString:@"This is neat."]; NSString *regexString = @"\\b(\\w+)\\b"; NSString *replaceWithString = @"{$1}"; NSUInteger replacedCount = 0; replacedCount = [mutableString replaceOccurrencesOfRegex:regexString withString:replaceWithString]; NSLog(@"count: %u string: '%@'", replacedCount, mutableString); // 2008-07-01 21:25:43.433 test[69689:813] count: 3 string: '{This} {is} {neat}.'

Splitting Strings

Strings can be split with a regular expression using the componentsSeparatedByRegex: methods. This functionality is nearly identical to the preexisting NSString method componentsSeparatedByString:, except instead of only being able to use a fixed string as a separator, you can use a regular expression:

NSString *searchString = @"This is neat."; NSString *regexString = @"\\s+"; NSArray *splitArray = NULL; splitArray = [searchString componentsSeparatedByRegex:regexString]; // splitArray == { @"This", @"is", @"neat." } NSLog(@"splitArray: %@", splitArray);
Continues…

The output from NSLog() when run from a shell:

splitArray
shell% ./splitArray 2008-07-01 20:58:39.025 test[69618:813] splitArray: ( This, is, "neat." ) shell%

Unfortunately our example string @"This is neat." doesn't allow us to show off the power of regular expressions. As you can probably imagine, splitting the string with the regular expression \s+ allows for one or more white space characters to be matched. This can be much more flexible than just a fixed string of @" ", which will split on a single space only. If our example string contained extra spaces, say @"This   is     neat.", the result would have been the same.

See Also

Creating A Match Enumerator

As a practical example of how to use the simple primitives provided by RegexKitLite, consider the common need of having to enumerate all the matches of a regular expression in a target string. The following example creates a simple NSEnumerator based enumerator for all the matches of a regular expression in a target string, returning a NSString of the text matched by the regular expression (capture 0) for each call to nextObject until the end of the string is reached. Each match begins searching where the last match ended.

The match enumerator is divided in to two parts. The public part is defined in the header RKLMatchEnumerator.h, below. The second part is a private subclass of NSEnumerator whose interface resides only in the file RKLMatchEnumerator.m. Match enumerators are instantiated by sending a NSString class member the message matchEnumeratorWithRegex:. A NSString with the regular expression is passed as the only argument, and a NSEnumerator is returned.

File name:RKLMatchEnumerator.h
#import <Foundation/NSEnumerator.h> #import <Foundation/NSString.h> #import <stddef.h> @interface NSString (RegexKitLiteEnumeratorAdditions) - (NSEnumerator *)matchEnumeratorWithRegex:(NSString *)regex; @end

Next, in RKLMatchEnumerator.m, we define our private sub-class of NSEnumerator. In it we declare three instance variables, string, regex, and location. The string ivar holds the string to search, while regex holds the regular expression string. To guard against mutations to either, an immutable copy is made. The location ivar is used to keep track of the current location from which to begin matching. Finally, we declare our designated initializer which initializes the instantiated RKLMatchEnumerator object with the string to search and the regular expression to use.

File name:RKLMatchEnumerator.m
#import <Foundation/NSArray.h> #import <Foundation/NSRange.h> #import "RegexKitLite.h" #import "RKLMatchEnumerator.h" @interface RKLMatchEnumerator : NSEnumerator { NSString *string; NSString *regex; NSUInteger location; } - (id)initWithString:(NSString *)initString regex:(NSString *)initRegex; @end
Continues…

The following begins the implementation section of RKLMatchEnumerator and a fairly standard initialization method, initWithString:regex:.

RKLMatchEnumerator.m
@implementation RKLMatchEnumerator - (id)initWithString:(NSString *)initString regex:(NSString *)initRegex { if((self = [self init]) == NULL) { return(NULL); } string = [initString copy]; regex = [initRegex copy]; return(self); }
Continues…

The following implements the heart of any NSEnumerator, the nextObject method. If all of the matches have been enumerated, location will be set to NSNotFound, and the body of the if statement won't be evaluated and NULL will be returned.

If there are still matches to be found, searchRange is created to begin at value of the location ivar, with the NSRange length set to the remaining length of the string to be searched, or location - [string length].

Then, the match is performed using the RegexKitLite method rangeOfRegex:inRange: and the result stored in the variable matchedRange.

Next, the location ivar is updated to point to the location at the end of the matchedRange. Since it is possible to have a match with a length of zero, it must handle that special case by adding one, otherwise it will loop endlessly, always matching the same location of zero length. If there was no match, matchedRange.location will be NSNotFound and matchedRange.length will be 0, and the location ivar will be set to NSNotFound.

If the matched range location is not NSNotFound, then a substring of the matched range will be returned. Otherwise, we will exit the if body and return NULL, indicating that the NSEnumerator has no more matches to enumerate.

RKLMatchEnumerator.m
- (id)nextObject { if(location != NSNotFound) { NSRange searchRange = NSMakeRange(location, [string length] - location); NSRange matchedRange = [string rangeOfRegex:regex inRange:searchRange]; location = NSMaxRange(matchedRange) + ((matchedRange.length == 0) ? 1 : 0); if(matchedRange.location != NSNotFound) { return([string substringWithRange:matchedRange]); } } return(NULL); }
Continues…

A standard dealloc, releasing the string and regex ivar objects created during initialization.

RKLMatchEnumerator.m
- (void) dealloc { [string release]; [regex release]; [super dealloc]; } @end
Continues…

And finally, the NSString category addition that returns our match enumerator. This simply creates an instance of our private NSEnumerator sub-class RKLMatchEnumerator, initializes it with the string to match, self, using the regular expression regex, then sends the instantiated object autorelease, which is finally returned. Since this is a NSString category addition, this message will be sent to an instance of an object that is a member of the NSString class, which includes any objects whose super class is ultimately NSString. Therefore, the string to match is the instance receiving the message, self.

RKLMatchEnumerator.m
@implementation NSString (RegexKitLiteEnumeratorAdditions) - (NSEnumerator *)matchEnumeratorWithRegex:(NSString *)regex { return([[[RKLMatchEnumerator alloc] initWithString:self regex:regex] autorelease]); } @end

The following piece of code is a simple demonstration of the match enumerator which will use a regular expression to enumerate all the lines in the string to be searched.

The variable searchString contains the string to search. The example string includes several embedded \n, or new-line characters. There are a total of four lines of text, with the third line containing no characters.

The variable regex contains the regular expression to be used for matching. This regular expression begins with the sequence (?m) which is used to enable the RKLMultiline regular expression option from the text of the regular expression itself. This enables the metacharacters ^ and $ to match the start of and end of a line, respectively. The remaining characters .* will match any character '.' zero or more times '*'. The prose translation would be:

Enable the RKLMultiline option and match all of the characters from the beginning of a line until the end of a line.

The match enumerator is then instantiated and the results are enumerated with a standard while loop, setting matchedString to the object returned by nextObject. For each line that is returned, the current line number, length of the matched string, and the matched string are printed.

File name:main.m
#import <Foundation/NSAutoreleasePool.h> #import "RegexKitLite.h" #import "RKLMatchEnumerator.h" int main(int argc, char *argv[]) { NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init]; NSString *searchString = @"one\ntwo\n\nfour\n"; NSEnumerator *matchEnumerator = NULL; NSString *regexString = @"(?m)^.*$"; NSLog(@"searchString: '%@'", searchString); NSLog(@"regexString : '%@'", regexString); matchEnumerator = [searchString matchEnumeratorWithRegex:regexString]; NSUInteger line = 0; NSString *matchedString = NULL; while((matchedString = [matchEnumerator nextObject]) != NULL) { NSLog(@"%d: %d '%@'", ++line, [matchedString length], matchedString); } [pool release]; return(0); }

The following shell transcript demonstrates compiling the example and executing it. Line number three clearly demonstrates that matches of zero length are possible. Without the additional logic in nextObject to handle this special case, the enumerator would never advance past the match.

Note:

In the shell transcript below, the NSLog() line that prints searchString has been annotated with the '⏎' character to help visually identify the corresponding \n new-line characters in searchString.

shell% cd examples shell% gcc -I.. -g -o main main.m RKLMatchEnumerator.m ../RegexKitLite.m -framework Foundation -licucore shell% ./main 2008-03-21 15:56:17.469 main[44050:807] searchString: 'one two four ' 2008-03-21 15:56:17.520 main[44050:807] regexString : '(?m)^.*$' 2008-03-21 15:56:17.575 main[44050:807] 1: 3 'one' 2008-03-21 15:56:17.580 main[44050:807] 2: 3 'two' 2008-03-21 15:56:17.584 main[44050:807] 3: 0 '' 2008-03-21 15:56:17.590 main[44050:807] 4: 4 'four' shell%

ICU Syntax

ICU Regular Expression Syntax

For your convenience, the regular expression syntax from the ICU documentation is included below. When in doubt, you should refer to the official ICU User Guide - Regular Expressions documentation page.

Metacharacters
CharacterDescription
\aMatch a BELL, \u0007
\AMatch at the beginning of the input. Differs from ^ in that \A will not match after a new-line within the input.
\b, outside of a [Set]Match if the current position is a word boundary. Boundaries occur at the transitions between word \w and non-word \W characters, with combining marks ignored.
See also: RKLUnicodeWordBoundaries
\b, within a [Set]Match a BACKSPACE, \u0008.
\BMatch if the current position is not a word boundary.
\cxMatch a Control-x character.
\dMatch any character with the Unicode General Category of Nd (Number, Decimal Digit).
\DMatch any character that is not a decimal digit.
\eMatch an ESCAPE, \u001B.
\ETerminates a \Q\E quoted sequence.
\fMatch a FORM FEED, \u000C.
\GMatch if the current position is at the end of the previous match.
\nMatch a LINE FEED, \u000A.
\N{Unicode Character Name}Match the named Unicode Character.
\p{Unicode Property Name}Match any character with the specified Unicode Property.
\P{Unicode Property Name}Match any character not having the specified Unicode Property.
\QQuotes all following characters until \E.
\rMatch a CARRIAGE RETURN, \u000D.
\sMatch a white space character. White space is defined as [\t\n\f\r\p{Z}].
\SMatch a non-white space character.
\tMatch a HORIZONTAL TABULATION, \u0009.
\uhhhhMatch the character with the hex value hhhh.
\UhhhhhhhhMatch the character with the hex value hhhhhhhh. Exactly eight hex digits must be provided, even though the largest Unicode code point is \U0010ffff.
\wMatch a word character. Word characters are [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}].
\WMatch a non-word character.
\x{hhhh}Match the character with hex value hhhh. From one to six hex digits may be supplied.
\xhhMatch the character with two digit hex value hh.
\XMatch a Grapheme Cluster.
\ZMatch if the current position is at the end of input, but before the final line terminator, if one exists.
\zMatch if the current position is at the end of input.
\n
Back Reference. Match whatever the nth capturing group matched. n must be a number > 1 and < total number of capture groups in the pattern.
Note:
Octal escapes, such as \012, are not supported.
[pattern]Match any one character from the set. See UnicodeSet for a full description of what may appear in the pattern.
.Match any character.
^Match at the beginning of a line.
$Match at the end of a line.
\Quotes the following character. Characters that must be quoted to be treated as literals are * ? + [ ( ) { } ^ $ | \ . /
Operators
OperatorDescription
|Alternation. A|B matches either A or B.
*Match zero or more times. Match as many times as possible.
+Match one or more times. Match as many times as possible.
?Match zero or one times. Prefer one.
{n}Match exactly n times.
{n,}Match at least n times. Match as many times as possible.
{n,m}Match between n and m times. Match as many times as possible, but not more than m.
*?Match zero or more times. Match as few times as possible.
+?Match one or more times. Match as few times as possible.
??Match zero or one times. Prefer zero.
{n}?Match exactly n times.
{n,}?Match at least n times, but no more than required for an overall pattern match.
{n,m}?Match between n and m times. Match as few times as possible, but not less than n.
*+Match zero or more times. Match as many times as possible when first encountered, do not retry with fewer even if overall match fails. Possessive match.
++Match one or more times. Possessive match.
?+Match zero or one times. Possessive match.
{n}+Match exactly n times. Possessive match.
{n,}+Match at least n times. Possessive match.
{n,m}+Match between n and m times. Possessive match.
()Capturing parentheses. Range of input that matched the parenthesized subexpression is available after the match.
(?:)Non-capturing parentheses. Groups the included pattern, but does not provide capturing of matching text. Somewhat more efficient than capturing parentheses.
(?>)Atomic-match parentheses. First match of the parenthesized subexpression is the only one tried; if it does not lead to an overall pattern match, back up the search for a match to a position before the (?> .
(?#)Free-format comment (?#comment).
(?=)Look-ahead assertion. True if the parenthesized pattern matches at the current input position, but does not advance the input position.
(?!)Negative look-ahead assertion. True if the parenthesized pattern does not match at the current input position. Does not advance the input position.
(?<=)Look-behind assertion. True if the parenthesized pattern matches text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators).
(?<!)Negative Look-behind assertion. True if the parenthesized pattern does not match text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators).
(?ismwx-ismwx:)Flag settings. Evaluate the parenthesized expression with the specified flags enabled or -disabled.
(?ismwx-ismwx)Flag settings. Change the flag settings. Changes apply to the portion of the pattern following the setting. For example, (?i) changes to a case insensitive match.
See also: Regular Expression Options
See Also

ICU Replacement Text Syntax

Replacement Text Syntax
CharacterDescription
$n
The text of capture group n will be substituted for $n. n must be >= 0 and not greater than the number of capture groups. A $ not followed by a digit has no special meaning, and will appear in the substitution text as itself, a $.
Important:
Methods will raise a RKLICURegexException if n is greater than the number of capture groups in the regular expression.
\Treat the character following the backslash as a literal, suppressing any special meaning. Backslash escaping in substitution text is only required for $ and \, but may proceed any character. The backslash itself will not be copied to the substitution text.
See Also

Adding RegexKitLite to your Project

Note:

The following outlines a typical set of steps that one would perform. This is not the only way, nor the required way to add RegexKitLite to your application. They may not be correct for your project as each project is unique. They are an overview for those unfamiliar with adding additional shared libraries to the list of libraries your application links against.

Outline of Required Steps

The following outlines the steps required to use RegexKitLite in your project.

See Also

Adding RegexKitLite using Xcode

Important:
These instructions apply to Xcode versions 2.4.1 and 3.0. Other versions should be similar, but may vary for specific details.

Unfortunately, adding additional dynamic shared libraries that your application links to is not a straightforward process in Xcode, nor is there any recommended standard way. Two options are presented below— the first is the 'easy' way that alters your applications Xcode build settings to pass an additional command line argument directly to the linker. The second option attempts to add the ICU dynamic shared library to the list of resources for your project and configuring your executable to link against the added resource.

The 'easy' way is the recommended way to link against the ICU dynamic shared library.

The Easy Way To Add The ICU Library
  1. First, determine the build settings layer of your project that should have altered linking configuration change applied to. The build settings in Xcode are divided in to layers and each layer inherits the build settings from the layer above it. The top, global layer is Project Settings, followed by Target Settings, and finally the most specific layer Executable Settings. If your project is large enough to have multiple targets and executables, you probably have an idea which layer is appropriate. If you are unsure or unfamiliar with the different layers, Target Settings is recommended.

  2. Select the appropriate layer from the Project menu. If you are unsure, Project ► Edit Active Target is recommended.

  3. Select Build from the tab near the top of the Target Info window. Find the Other Linker Flags build setting from the many build settings available and edit it. Add -licucore [dash ell icucore as a single word, without spaces]. If there are already other flags present, it is recommended that you add -licucore to the end of the existing flags.

    Important:
    If other linker flags are present, there must be at least one space separating -licucore from the other linker flags. For example, -flag1 -licucore -flag2
    Note:
    The Configuration drop down menu controls which build configuration the changes you make are applied to. All Configurations should be selected if this is the first time your are making these changes.
  4. Follow the Add The RegexKitLite Source Files To Your Project steps below.
See Also
The Hard Way To Add The ICU Library
  1. First, add the ICU dynamic shared library to your Xcode project. You may choose to add the library to any group in your project, and which groups are created by default is dependent on the template type you chose when you created your project. For a typical Cocoa application project, a good choice is the Frameworks group. To add the ICU dynamic shared library, control/right-click on the Framework group and choose Add ► Existing Files…

  2. Next, you will need to choose the ICU dynamic shared library file to add. Exactly which file to choose depends on your project, but a fairly safe choice is to select /Developer/SDKs/MacOSX10.5.sdk/usr/lib/libicucore.dylib. You may have installed your developer tools in a different location than the default /Developer directory, and the Mac OS X SDK version should be the one your project is targeting, typically the latest one available.

  3. Then, in the dialog that follows, make sure that Copy items into… is unselected. Select the targets you will be using RegexKitLite in and then click Add to add the ICU dynamic shared library to your project.

  4. Once the ICU dynamic shared library is added to your project, you will need to add it to the libraries that your executable is linked with. To do so, expand the Targets group, and then expand the executable targets you will be using RegexKitLite in. You will then need to select the libicucore.dylib file that you added in the previous step and drag it in to the Link Binary With Libraries group for each executable target that you will be using RegexKitLite in. The order of the files within the Link Binary With Libraries group is not important, and for a typical Cocoa application the group will contain the Cocoa.framework file.

Add The RegexKitLite Source Files To Your Project
  1. Next, add the RegexKitLite source files to your Xcode project. In the Groups & Files outline view on the left, control/right-click on the group that would like to add the files to, then select Add ► Existing Files…

    Note:

    You can perform the following steps once for each file (RegexKitLite.h and RegexKitLite.m), or once by selecting both files from the file dialog.

  2. Select the RegexKitLite.h and / or RegexKitLite.m file from the file chooser dialog.

  3. The next dialog will present you with several options. If you have not already copied the RegexKitLite files in to your projects directory, you may want to click on the Copy items into… option. Select the targets that you would like add the RegexKitLite functionality to.

  4. Finally, you will need to include the RegexKitLite.h header file. The best way to do this is very dependent on your project. If your project consists of only half a dozen source files, you can add:

    #import "RegexKitLite.h"

    manually to each source file that makes uses of RegexKitLites features. If your project has grown beyond this, you've probably already organized a common "master" header to include to capture headers that are required by nearly all source files already.

Adding RegexKitLite using the Shell

Using RegexKitLite from the shell is also easy. Again, you need to add the header #import to the appropriate source files. Then, to link to the ICU library, you typically only need to add -licucore, just as you would any other library. Consider the following example:

File name:link_example.m
#import <Foundation/NSObjCRuntime.h> #import <Foundation/NSAutoreleasePool.h> #import "RegexKitLite.h" int main(int argc, char *argv[]) { NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init]; // Copyright COPYRIGHT_SIGN APPROXIMATELY_EQUAL_TO 2008 // Copyright \u00a9 \u2245 2008 char *utf8CString = "Copyright \xC2\xA9 \xE2\x89\x85 2008"; NSString *regexString = @"Copyright (.*) (\\d+)"; NSString *subjectString = [NSString stringWithUTF8String:utf8CString]; NSString *matchedString = [subjectString stringByMatching:regexString capture:1]; NSLog(@"subject: \"%@\"", subjectString); NSLog(@"matched: \"%@\"", matchedString); [pool release]; return(0); }

Compiled and run from the shell:

shell% cd examples shell% gcc -g -I.. -o link_example link_example.m ../RegexKitLite.m -framework Foundation -licucore shell% ./link_example 2008-03-14 03:52:51.187 test[15283:807] subject: "Copyright © ≅ 2008" 2008-03-14 03:52:51.269 test[15283:807] matched: "© ≅" shell%

NSString RegexKitLite Additions Reference

Extends by categoryNSString, NSMutableString
RegexKitLite2.0
Declared in
  • RegexKitLite.h

Overview

RegexKitLite is not meant to be a full featured regular expression framework. Because of this, it provides only the basic primitives needed to create additional functionality. It is ideal for developers who:

RegexKitLite consists of only two files, the header file RegexKitLite.h and RegexKitLite.m. The only other requirement is to link with the ICU library that comes with Mac OS X. No new classes are created, all functionality is provided as a category extension to the NSString and NSMutableString classes.

See Also

Compile Time Preprocessor Tuneables

The settings listed below are implemented using the C Preprocessor. Some of the setting are simple boolean enabled or disabled settings, while others specify a value, such as the number of cache slot entries. There are several ways to alter these settings, but if you are not familiar with this style of compile time configuration settings and how to alter them using the C Preprocessor, it is recommended that you use the default values provided.

SettingDefaultDescription
NS_BLOCK_ASSERTIONSn/aRegexKitLite contains a number of extra assertion checks that can be disabled with this flag. The standard NSException.h assertion macros are not used because of the multithreading lock.
RKL_CACHE_SIZE23Controls the number of compiled regular expressions that are cached. This should always be a prime number to maximize the use of the available cache slots.
RKL_FAST_MUTABLE_CHECKDisabledEnables the use of the undocumented, private Core Foundation __CFStringIsMutable() function to determine if the string to be searched is immutable. This can significantly increase the number of matches per second that can be performed on immutable strings since a number of mutation checks can be safely skipped.
RKL_FIXED_LENGTH2048Sets the size of the fixed length UTF-16 conversion cache buffer. Strings that need to be converted to UTF-16 that are smaller than this size will use this buffer. Using a single fixed buffer for all small strings means less malloc() overhead, heap fragmentation, and reduces the chances of a memory leak occurring.
RKL_STACK_LIMIT131072The maximum amount of stack space that will be used before switching to heap based allocations. This can be useful for multithreading programs where the stack size of secondary threads is much smaller than the main thread.

Fast Mutable Checks

Setting RKL_FAST_MUTABLE_CHECK allows RegexKitLite to quickly check if a string to search is immutable or not. Every call to RegexKitLite requires checking a strings hash and length values to guard against a string mutating and using invalid cached data. If the same string is searched repeatedly and it is immutable, these checks aren't necessary since the string can never change while in use. While these checks are fairly quick, it can add approximately 15 to 20 percent of extra overhead, and not performing the checks is always faster.

Since checking a strings mutability requires calling an undocumented, private Core Foundation function, RegexKitLite takes extra precautions and does not use the function directly. Instead, an internal, local stub function is created and called to determine if a string is mutable. The first time this function is called, RegexKitLite uses dlsym() to look up the address of the __CFStringIsMutable() function. If the function is found, RegexKitLite will use it from that point on to determine if a string is immutable. However, if the function is not found, RegexKitLite has no way to determine if a string is mutable or not, so it assumes the worst case that all strings are potentially mutable. This means that the private Core Foundation __CFStringIsMutable() function can go away at any time and RegexKitLite will continue to work, although with slightly less performance.

This feature is disabled by default, but should be fairly safe to enable due to the extra precautions that are taken. If this feature is enabled and the __CFStringIsMutable() function is not found for some reason, RegexKitLite falls back to its default behavior which is the same as if this feature was not enabled.

Xcode 3 Integrated Documentation

This documentation is available in the Xcode DocSet format. To add this documentation to Xcode, select Help ► Documentation. Then, in the lower left hand corner of the documentation window, there should be a gear icon with a drop down menu indicator which you should select and choose New Subscription… and enter the following URL:

feed://regexkit.sourceforge.net/RegexKitLiteDocSets.atom

Once you have added the URL, a new group should appear, inside which will be the RegexKitLite documentation with a Get button. Click on the Get button and follow the prompts. Xcode will ask you to enter an administrators password to install the documentation for the first time, which is explained here.

Cached Information and Mutable Strings

While RegexKitLite takes steps to ensure that the information it has cached is valid for the strings it searches, there exists the possibility that out of date cached information may be used when searching mutable strings. For each compiled regular expression, RegexKitLite caches the following information about the last NSString that was searched:

An ICU compiled regular expression must be "set" to the text to be searched. Before a compiled regular expression is used, the pointer to the string object to search, its hash, length, and the pointer to the UTF-16 buffer is compared with the values that the compiled regular expression was last "set" to. If any of these values are different, the compiled regular expression is reset and "set" to the new string.

If a NSMutableString is mutated between two uses of the same compiled regular expression and its hash, length, or UTF-16 buffer changes between uses, RegexKitLite will automatically reset the compiled regular expression with the new values of the mutated string. The results returned will correctly reflect the mutations that have taken place between searches.

It is possible that the mutations to a string can go undetected, however. If the mutation keeps the length the same, then the only way a change can be detected is if the strings hash value changes. For most mutations the hash value will change, but it is possible for two different strings to share the same hash. This is known as a hash collision. Should this happen, the results returned by RegexKitLite may not be correct.

Therefore, if you are using RegexKitLite to search NSMutableString objects, and those strings may have mutated in such a way that RegexKitLite is unable to detect that the string has changed, you must manually clear the internal cache to ensure that the results accurately reflect the mutations. You can clear the cache by calling the following class method:

[NSString clearStringCache];
Warning:

When searching NSMutableString objects that have mutated between searches, failure to clear the cache may result in undefined behavior.

Exceptions Raised

Methods will raise an exception if their arguments are invalid, such as passing NULL for a required parameter. An invalid regular expression or RKLRegexOptions parameter will not raise an exception. Instead, a NSError object with information about the error will be created and returned via the address given with the optional error argument. If information about the problem is not required, error may be NULL. For convenience methods that do not have an error argument, the primary method is invoked with NULL passed as the argument for error.

Important:
Methods raise NSInvalidArgumentException if regex is NULL, or if capture < 0 or is not valid for regex.
Important:
Methods raise NSRangeException if range exceeds the bounds of the receiver.
Important:
Search and replaced methods raise RKLICURegexException if replacement contains $n capture references where n is greater than the number of capture groups in the regular expression regex.
See Also

Tasks

Clearing Cached Information
Determining the Number of Captures
Dividing Strings
Identifying Matches
Determining the Range of a Match
Modifying Mutable Strings
Creating Temporary Strings from a Match
Replacing Substrings