NSString RegexKit Additions Reference

Extends by categoryNSString
RegexKit0.6.0 Release Notes
PCRE7.6
AvailabilityAvailable in Mac OS X v10.4 or later.
Declared in
  • RegexKit/NSString.h
Overview

NSRange Compatibility with Foundation

As of version 0.4, all NSRange arguments and results of the RegexKit NSString additions and RKEnumerator class use the same method as Foundation for calculating character index values. Specifically, Foundation treats all strings as if they were UTF16 encoded, whereas PCRE requires all strings to be UTF8 encoded. The details of converting between UTF8 and UTF16 character index values, and therefore NSRange values, are now handled automatically by the framework, providing a consistant UTF16 character index API for all Foundation additions and RKEnumerator class objects.

For UTF8 encoded strings, only character values < 128 / 0x80 can be represented with a single byte, which is a one to one mapping to ASCII characters. To represent non-ASCII characters, UTF8 uses a special encoding format that uses a variable number of additional bytes, from one to five, to represent a single Unicode character. Therefore, when dealing with international text, there is no longer a one to one relationship between a 'character' and a byte as there is in ASCII text.

UTF16 encoded strings use two bytes as its fundamental date type. The vast majority of text can be encoded in to a single UTF16 character, but some character encodings will require two or more UTF16 characters to represent a single Unicode character. Only UTF32 has a fixed length at four bytes, which simplifies 'character index' issues, but is not space efficient for the majority of text. Even ASCII requires four bytes to encode one character in UTF32, of which there are only 127 possible ASCII characters.

As of version 0.4, these issues are now abstracted by the framework. All strings are treated logically as if they were encoded as UTF16 for NSRange calculation purposes. Previous versions calculated NSRange values based on PCRE's native UTF8 encoding format. This meant that NSRange results from RegexKit could be used as arguments by another RegexKit method correctly for international text, but could not be used as arguments for native Foundation methods and vice-versa. In short, NSRange results and arguments should 'just work' as expected with NSRange results from native Foundation methods.

Important:
The RKRegex class continues to use NSRange values calculated as UTF8 character indexes. See RKConvertUTF8ToUTF16RangeForString and RKConvertUTF16ToUTF8RangeForString if you need to perform conversions.

Reference and Replacement String Syntax

Several methods are available that allow you to create new strings, or replace the matched text, with a string that contains references to the capture subpatterns. The syntax is modeled after Perls dollar sign variable references, i.e. $1. In addition to capture subpattern references, you can also covert the case of the text. Case conversion is also modeled after Perls \u, \l, \U, \L, and \E escape sequences.

Capture Subpattern Reference Syntax

Capture subpattern references have the following three forms:

\subpatternNumber
  • subpatternNumber
    A single digit number of a capture subpattern.
Example:\1

Or,

$subpatternNumber
  • subpatternNumber
    A single digit number of a capture subpattern.
Example:$1

Or,

${subpattern}
  • subpattern
    The name or number of a capture subpattern.
Examples:${1}, ${date}
Important:
A RKRegexCaptureReferenceException will be raised if the subpattern number exceeds the number of parenthesized subpatterns in the regular expression, or the named subpattern is not defined by the regular expression.

Order of Format Specifier Argument Substitution and Expansion of Capture Subpattern Match References

In order to prevent the accidental interpretation of the expanded text of a capture subpattern match reference as a format specifier, the order of substitution and expansion is as follows:

Expansion of Capture Subpattern Match References in Strings

When a capture subpattern reference is expanded in a string, the text of the capture subpattern reference is replaced with the text of the subpattern that was matched by the regular expression.

Important:
A RKRegexCaptureReferenceException will be raised if a type conversion is specified and only capture subpattern reference expansion is allowed.

Preventing the interpretation of $

To include a $ from the reference string in the final replacement text without the possibility of additional interpretation, you may specify two dollar signs consecutively. Only a single dollar sign will be copied to the final expanded text. For example: You owe: $$1234.56 will be copied as You owe: $1234.56. Another example: Does not reference the $${0} capture subpattern is copied as Does not reference the ${0} capture subpattern.

Case Conversion Syntax

The following escape sequences allow you to convert the text in the reference and replacement string. The NSString uppercaseString and lowercaseString methods are used to perform the actual case conversion.

Case Conversion Escape Sequences
SequenceDescription
\uConvert the next character to upper case.
\lConvert the next character to lower case.
\UConvert the following characters to upper case until \E.
\LConvert the following characters to lower case until \E.
\EEnd case conversion.

The following is the example given in Apples NSString lowercaseString method documentation, modified to illustrate the case conversion escape sequences. As the example given in apples documentation illustrates, Unicode case conversion is not necessarily symmetrical.

example = [NSString stringWithUTF8String:"Stra\xc3\x9f" "e"]; // Straße upper = [example stringByMatching:@"(.*)" replace:RKReplaceAll withReferenceString:@"\\U$1\\E"]; // STRASSE lower = [upper stringByMatching:@"(.*)" replace:RKReplaceAll withReferenceString:@"\\L$1\\E"]; // strasse != Straße

The following example uses a regular expression that matches all the characters between a and e to replace the matched text with \U$1\E.

upper = [example stringByMatching:@"(?<=a)(.*?)(?=e)" replace:RKReplaceAll withReferenceString:@"\\U$1\\E"]; // StraSSe lower = [upper stringByMatching:@"(?<=a)(.*?)(?=e)" replace:RKReplaceAll withReferenceString:@"\\L$1\\E"]; // Strasse

The same regular expression is used in the following example, but the replacement reference string is now \u$1. The \u escape sequences causes the next character to be converted to uppercase, and in this case ß is converted to SS. The result string, StraSSe, has the same regular expression applied to it and the replacement reference string is changed to \l$1, which converts the next character to lower case. The final result is StrasSe.

upper = [example stringByMatching:@"(?<=a)(.*?)(?=e)" replace:RKReplaceAll withReferenceString:@"\\u$1"]; // StraSSe lower = [upper stringByMatching:@"(?<=a)(.*?)(?=e)" replace:RKReplaceAll withReferenceString:@"\\l$1"]; // StrasSe

Capture Subpattern Reference and Type Conversion Syntax

Capture subpattern references and type conversions extend regular capture subpattern references by allowing an optional type argument:

\subpatternNumber
  • subpatternNumber
    A single digit number of a capture subpattern.
Example:\1

Or,

$subpatternNumber
  • subpatternNumber
    A single digit number of a capture subpattern.
Example:$1

Or,

${subpattern:type}
  • subpattern
    The name or number of a capture subpattern.
  • type
    The optional conversion type to perform.
Examples:${1}, ${date}, ${1:%d}
Important:
A RKRegexCaptureReferenceException will be raised if the subpattern number exceeds the number of parenthesized subpatterns in the regular expression, or the named subpattern is not defined by the regular expression.

Conversion Type Syntax

There are two major classes of conversion types:

The first form, conversion to basic C data types, uses the familiar percent (character '%') style specification. For example, %d converts the string "12345" to an int with the value of 12345.

The second form uses an at sign (character '@') to specify conversion to an Objective-C object. The rest of the conversion specification, such as what type to convert to and conversion options, is similar to the percent style conversion syntax.

C Data Type Conversions

Conversions to basic C data types are handled by the scanf standard library function. If the first character of the conversion specification begins with a % (percent), then the remaining text up to the end of the conversion is passed to scanf unaltered. Any conversion types or conversion options supported by the libraries scanf are available.

Warning:
The conversion format specification is passed unaltered to the scanf function. No validation of the conversion format is performed.

Since the conversion specification is passed unaltered to scanf, care should be taken to ensure that only a single conversion is specified. Requesting two or more conversions will result in a crash since the framework will only supply scanf a single valid pointer in which to store the conversion result. It is also not safe to specify a different pointer argument position using the %n$ syntax.

The following table lists some of the common percent C data type conversion specifiers. Consult the scanf man page for additional information.

Common Conversion Specifiers
ConversionSyntaxConverted TypeExample String FormsDescription
Positive or Negative Integers, Non-Floating Point Numbers%dint1234, -4321Numbers without a decimal point.
Positive Only Integers, Non-Floating Point Numbers%uunsigned int1234Non-negative numbers without a decimal point.
Hex Values%xunsigned int0x1234abcdHex values starting with 0x.
Floating Point Numbers%ffloat1234.567, -4321.987, 2.657e+7, nan, infinityIEEE-754 Single Precision Floating Point numbers.
Floating Point Numbers%lfdouble1234.567, -4321.987, 2.657e+7, nan, infinityIEEE-754 Double Precision Floating Point numbers.
The middle l must be included for doubles.

On Mac OS X the floating point conversions will actually accept and convert a wide variety of string formats. For example, hex values such as 0x1234abcd will be correctly converted to 305441741.0 for the requested floating point type. Even signed hex values, such as -0xfedc9876 are correctly converted. Additionally, C99 style hex floats, such as 0x3.fe69149f758p+45, are also correctly converted.

Note:
A common mistake when converting to a double is not including the middle l size specifier. When converting from a double to a string, ie with printf, the conversion specifier %f can be used even though it strictly refers to a single precision float type. This is because when passing a value of type float, it is automatically promoted to a type double, thus causing %f and %lf to actually refer to the same type, double, for output conversions.

Objective-C Object Conversions

Ownership of objects returned by a conversion is the same as ownership of other convenience methods such as [NSString stringWithFormat:]. Specifically, they have been sent a autorelease message and they do not require any additional release messages.

However, if you require use of a converted object past the current NSAutoreleasePool context, you must take ownership of it by sending a retain message.

Conversions to NSNumber
Important:
NSNumber conversions are not available under GNUstep.

NSNumber object conversions are specified with @n and the conversions are performed with a thread private NSNumberFormatter. A threads NSNumberFormatter is created when the first NSNumber conversion is requested and is released when the thread exits.

The different types of NSNumberFormatterStyle are made available using certain option characters. For example, the NSNumberFormatterCurrencyStyle is specified as @$n.

Note:
All NSNumber conversions are performed with the NSNumberFormatter behavior set to NSNumberFormatterBehavior10_4.
NSNumber type conversion syntax
ConversionSyntaxNSNumberFormatterStyleExample String FormsNSNumber valueDescription
General Numbers@nNSNumberFormatterNoStyle1234, -43211234.0, -4321.0Numbers without a decimal point.
Decimal Numbers@.nNSNumberFormatterDecimalStyle1234.5, -4321.91234.5, -4321.9Numbers with a decimal point.
Currency@$nNSNumberFormatterCurrencyStyle$1,234.56, -$4321.001234.56, -4321.0Uses the current locale settings for currency symbols and number separation style.
Percentages@%nNSNumberFormatterPercentStyle100%, 99.99%1.0, 0.9999Percentage values. For example, 100% is converted to 1.0, 23% becomes 0.23.
Scientific Notation@snNSNumberFormatterScientificStyle9.342124e+069342124.0Numbers specified with an exponent. Equivalent to 9.342124 * 106.
Spelled out words@wnNSNumberFormatterSpellOutStyletwo hundred forty-three, fifty-seven point nine five243.0, 57.95Numbers that are spelled out with words.
Conversion to NSCalendarDate

NSCalendarDate object conversions are specified with @d and the conversions are performed with the NSDate class method dateWithNaturalLanguageString:. Since the conversions are performed by a class method, conversions are serialized with a lock to ensure correct multithreading behavior.

Common conversion specifiers
ConversionSyntaxConverted TypeExample String FormsDescription
Dates and Times@dNSCalendarDate'07/20/2007', '6:44 PM', 'Feb 5th', '6/20/2007, 11:34PM EST'Parses a wide range of date and time formats.

Specifying a Regular Expression

When specifying a regular expression, the regular expression can be either a RKRegex object or a NSString containing the text of a regular expression. When specified as a NSString, as determined by sending isKindOfClass:, the receiver will convert the string to a RKRegex object via regexWithRegexString:options:.

Important:
The method will raise NSInvalidArgumentException if the regular expression is nil. If passed as a NSString, the method will raise RKRegexSyntaxErrorException if the regular expression is not valid.

Determining if an Object Matches a Regular Expression

Objects are sent isMatchedByRegex: to determine whether or not they are matched by the specified regular expression.

Tasks

Capture Extraction and Conversion
Determining the Range of a Match
Enumerating Matches
Identifying Matches
Creating Temporary Strings from Match Results
Search and Replace

Instance Methods

Takes a regular expression and range of the receiver to search, followed by a variable length list of capture subpattern reference and pointer to a pointer type conversion specification pairs.
- (BOOL)getCapturesWithRegex:(id)aRegex inRange:(const NSRange)range references:(NSString * const)firstReference, ... ;
Return Value

If aRegex matches the receiver within range, the supplied pointer to a pointer arguments are updated with the match results and YES is returned. If aRegex matches the receiver multiple times, only the first match within range is used.

If aRegex does not match the receiver within range, none of the supplied pointer to a pointer arguments are altered and NO is returned.

Takes a regular expression followed by a variable length list of capture subpattern reference and pointer to a pointer type conversion specification pairs.
- (BOOL)getCapturesWithRegexAndReferences:(id)aRegex, ... ;
Parameters
  • aRegex
    A regular expression string or RKRegex object.
  • ...
    A comma-separated list of capture subpattern reference and pointer to a pointer type conversion specification pairs, terminated with a nil.
    Warning:
    Failure to terminate the argument list with a nil will result in a crash.
Discussion

Following the regular expression aRegex, a variable length list of capture subpattern reference / pointer to a pointer type conversion specification pairs is given, terminated with a nil. The calling sequence is similar to NSDictionary dictionaryWithObjectsAndKeys: except that reference proceeds pointer to a pointer instead of the NSDictionary pair ordering where object pointer proceeds reference.

The order in which the capture subpattern reference arguments appears does not matter, nor the number of times that a capture subpattern reference appears.

See Capture Subpattern Reference and Type Conversion Syntax for information on how to specify capture subpatterns and the different types of conversions that can be performed on the matched text. If the optional type conversion is not specified then the default conversion to a NSString containing the text of the requested capture subpattern will be returned via pointer to a pointer.

Examples

NSString *capture0 = nil, *capture1 = nil, *capture2 = nil; NSString *subjectString = @"This is the subject string to be matched"; // Note the use of &, referring to the address containing the pointer, not the value of the pointer. [subjectString getCapturesWithRegexAndReferences:@"(is the).*(to be)", @"${1}", &capture1, @"${2}", &capture2, @"${0}", &capture0, nil]; // capture0 == @"is the subject string to be"; // capture1 == @"is the"; // capture2 == @"to be";

The same example demonstrating that a RKRegex object and a NSString of a regular expression may be used interchangeably. Regular expressions passed as a NSString are automatically converted to RKRegex objects before use.

NSString *capture0 = nil, *capture1 = nil, *capture2 = nil; NSString *subjectString = @"This is the subject string to be matched"; RKRegex *aRegex = [[RKRegex alloc] initWithRegexString:@"(is the).*(to be)" options:RKCompileNoOptions]; [subjectString getCapturesWithRegexAndReferences:aRegex, @"${1}", &capture1, @"${2}", &capture2, @"${0}", &capture0, nil];

An example demonstrating a hex string converted to the equivalent unsigned int value.

unsigned int convertedHex = 0; NSString *subjectString = @"Convert this value: 0xb1223dd8"; [subjectString getCapturesWithRegexAndReferences:@"value: (0x[0-9a-f]+)", @"${1:%x}", &convertedHex, nil]; // convertedHex == 0xb1223dd8 (decimal 2971811288)
Return Value

If aRegex matches the receiver, the supplied pointer to a pointer arguments are updated with the match results and YES is returned. If aRegex matches the receiver multiple times, only the first match is used.

If aRegex does not match the receiver, none of the supplied pointer to a pointer arguments are altered and NO is returned.

Returns a Boolean value that indicates whether the receiver is matched by aRegex.
- (BOOL)isMatchedByRegex:(id)aRegex;
Returns a Boolean value that indicates whether the receiver is matched by aRegex within range.
- (BOOL)isMatchedByRegex:(id)aRegex inRange:(const NSRange)range;
Returns an enumerator object that lets you access every match of aRegex in the receiver.
- (RKEnumerator *)matchEnumeratorWithRegex:(id)aRegex;
Discussion
Returns an RKEnumerator object that begins at location 0 of the receiver and enumerates every match of aRegex in the receiver.
Returns an enumerator object that lets you access every match of aRegex within range of the receiver.
- (RKEnumerator *)matchEnumeratorWithRegex:(id)aRegex inRange:(const NSRange)range;
Parameters
  • range
    The range of the receiver to enumerate matches.
Discussion
Returns an RKEnumerator object that enumerates every match of aRegex within range of the receiver.
Returns the range of the first occurrence within the receiver of aRegex.
- (NSRange)rangeOfRegex:(id)aRegex;
Return Value
A NSRange structure giving the location and length of the first match of aRegex in the receiver. Returns {NSNotFound, 0} if the receiver is not matched by aRegex.
Returns the range of aRegex capture number capture for the first match within range of the receiver.
- (NSRange)rangeOfRegex:(id)aRegex inRange:(const NSRange)range capture:(const RKUInteger)capture;
Parameters
  • aRegex
    A regular expression string or RKRegex object.
  • range
    The range of the receiver to search.
  • capture
    The matching range of aRegex capture number to return. Use 0 for the entire range that aRegex matched.
Return Value
A NSRange structure giving the location and length of aRegex capture number capture for the first match within range of the receiver. Returns {NSNotFound, 0} if the receiver is not matched by aRegex.
See Also
Returns a pointer to an array of NSRange structures of the capture subpatterns of aRegex for the first match in the receiver.
- (NSRange *)rangesOfRegex:(id)aRegex;
Discussion
See rangesForCharacters:length:inRange:options: for details regarding the returned NSRange array memory allocation.
Return Value
Returns a pointer to an array of NSRange structures of the capture subpatterns of aRegex for the first match in the receiver, or NULL if aRegex does not match.
Returns a pointer to an array of NSRange structures of the capture subpatterns of aRegex for the first match in the receiver within range.
- (NSRange *)rangesOfRegex:(id)aRegex inRange:(const NSRange)range;
Discussion
See rangesForCharacters:length:inRange:options: for details regarding the returned NSRange array memory allocation.
Return Value
Returns a pointer to an array of NSRange structures of the capture subpatterns of aRegex for the first match in the receiver within range, or NULL if aRegex does not match.
Returns a new NSString containing the results of repeatedly searching within range of the receiver with aRegex and replacing up to count matches with the evaluated and expanded text of referenceFormatString.
- (NSString *)stringByMatching:(id)aRegex inRange:(const NSRange)range replace:(const RKUInteger)count withReferenceFormat:(NSString * const)referenceFormatString, ...;
Parameters
  • aRegex
    A regular expression string or RKRegex object.
  • range
    The range of the receiver to perform the search and replace.
  • count
    The maximum number of replacements to perform, or RKReplaceAll to replace all matches.
  • referenceFormatString
    A format string containing format specifiers and capture subpattern references.
  • ...
    A comma-separated list of format specifier arguments to substitute into referenceFormatString.
Returns a new NSString containing the results of repeatedly searching the range of the receiver with aRegex and replacing up to count matches with the text of referenceString after capture references have been expanded.
- (NSString *)stringByMatching:(id)aRegex inRange:(const NSRange)range replace:(const RKUInteger)count withReferenceString:(NSString * const)referenceString;
Parameters
  • aRegex
    A regular expression string or RKRegex object.
  • range
    The range of the receiver to perform the search and replace.
  • count
    The maximum number of replacements to perform, or RKReplaceAll to replace all matches.
  • referenceString
    The string used to replace the matched text. May include references to aRegex captures with perl style ${NUMBER} notation. Refer to Capture Subpattern Reference Syntax for additional information.
Discussion
RKReplaceAll can be used for count to specify that all matches should be replaced.
Return Value
A NSString containing the results of repeatedly searching the receiver with aRegex and replacing up to count matches with the text of the replacement string after match references have been expanded.
Returns a new NSString containing the results of expanding the capture references and substituting the format specifiers in referenceFormatString with the text of the first match of aRegex within range of the receiver and the variable length list of format arguments.
- (NSString *)stringByMatching:(id)aRegex inRange:(const NSRange)range withReferenceFormat:(NSString * const)referenceFormatString, ...;
Returns a new NSString containing the results of expanding the capture references in referenceString with the text of the first match of aRegex within range of the receiver.
- (NSString *)stringByMatching:(id)aRegex inRange:(const NSRange)range withReferenceString:(NSString * const)referenceString;
Returns a new NSString containing the results of repeatedly searching the receiver with aRegex and replacing up to count matches with the evaluated and expanded text of referenceFormatString.
- (NSString *)stringByMatching:(id)aRegex replace:(const RKUInteger)count withReferenceFormat:(NSString * const)referenceFormatString, ...;
Returns a new NSString containing the results of repeatedly searching the receiver with aRegex and replacing up to count matches with the text of referenceString after capture references have been expanded.
- (NSString *)stringByMatching:(id)aRegex replace:(const RKUInteger)count withReferenceString:(NSString * const)referenceString;
Discussion
Equivalent to stringByMatching:inRange:replace:withReferenceString: with range specified as the entire range of the receiver.
Return Value
A NSString containing the results of repeatedly searching the receiver with aRegex and replacing up to count matches with the text of the replacement string after match references have been expanded.
Returns a new NSString containing the results of expanding the capture references and substituting the format specifiers in referenceFormatString with the text of the first match of aRegex in the receiver and the variable length list of format arguments.
- (NSString *)stringByMatching:(id)aRegex withReferenceFormat:(NSString * const)referenceFormatString, ...;
Returns a new NSString containing the results of expanding the capture references in referenceString with the text of the first match of aRegex in the receiver.
- (NSString *)stringByMatching:(id)aRegex withReferenceString:(NSString * const)referenceString;
Discussion
Equivalent to stringByMatching:inRange:withReferenceString: with range specified as the entire range of the receiver.
 
RegexKit project hosted by: