RKRegex Class Reference
PCRE7.6
AvailabilityAvailable in Mac OS X v10.4 or later.
Overview
The RKRegex class declares the programmatic interface for the RKRegex framework to the PCRE regular expression pattern matching library.
Some of the noteworthy features provided by the RKRegex class are:
- Multithreading safe.
- The PCRE compiled regular expression is cached. See RKCache for more information.
- Makes extensive use of the stack for temporary results, avoiding expensive and time consuming memory allocations.
Note:
Since the regular expression is cached and reused again and again, the regular expression is always studied. See
Studying a Pattern and
pcre_study for more information.
The RKRegex class provides the low level primitives necessary to perform regular expression matching. The matching functions perform their work on raw byte buffers and provide match results in the form of NSRange structures containing the range of a match and the range of any matching subpatterns of a regular expression.
In addition to the low level matching primitives, the RKRegex class provides information about the underlying PCRE library, such as the version with the method PCREVersionString and PCRE compile time options via PCREBuildConfig. Various methods for obtaining information about the instantiated RKRegex compiled regular expression named capture subpatterns, if any, are provided.
In general, the RKRegex class is not used by end-user applications directly. Since the RKRegex class only provides low level primitives, end-user functionality is provided via various category extensions to common Foundation objects, such as the RegexKit framework additions to NSArray, NSDictionary, NSSet, NSString, and their mutable variants. Match enumeration is provided by the RKEnumerator class.
Unicode strings are fully supported by RegexKit, both in the regular expression pattern and the string to search.
The Foundation additions use Unicode strings exclusively for the buffer that RKRegex performs matches against. If a NSString has an encoding format other than ASCII, it is first converted to UTF8 before any matching can occur. Because of this, RKRegex objects must have the RKCompileOption flags RKCompileUTF8 and RKCompileNoUTF8Check set. Bytes with the most significant bit set in UTF8 encoded strings have special meaning that must be interpreted. Without these options set, PCRE treats the buffer as a collection of 8 bit bytes without the required UTF8 decoding.
The various Foundation additions will accept either a RKRegex or a NSString for the regex argument. If the supplied object is a NSString, it is automatically converted to a RKRegex object via the regexWithRegexString:options: method with an option argument of (RKCompileUTF8 | RKCompileNoUTF8Check).
If you are supplying an instantiated RKRegex object instead of using the NSString auto-compile functionality, the RKCompileOption options RKCompileUTF8 and RKCompileNoUTF8Check must be set.
Important:
Failure to set the required
RKCompileOption options will cause the supplied
RKRegex to be discarded and a new
RKRegex object created from the discarded regex with the required
UTF8 flags logically ORd to any existing options.
For the purposes of calculating character indexes, Foundation treats all strings as if they were UTF16 encoded. PCRE, on the other hand, uses UTF8 exclusively. This has important consequences when using strings that are encoded in anything but ASCII. It is important to understand that all of the Foundation additions, and the RKEnumerator class, calculate all character index values as UTF16 character indexes. Since PCRE can only operate on UTF8 encoded strings, this requires any NSRange values to be converted between the two character index spaces. This provides transparent interoperability with the rest of Foundation at the expense of having to perform the character index conversation.
However, the RKRegex methods use UTF8 character indexes for all NSRange values. This is an important distinction as NSRange values returned by RKRegex objects will result in undefined behavior if passed to NSString objects without converting to the equivalent UTF16 character indexes. The functions RKConvertUTF8ToUTF16RangeForString and RKConvertUTF16ToUTF8RangeForString can be used to perform the necessary conversions, if required.
The reasoning behind this is that the RKRegex class provides low level access to the PCRE engine. The Foundation additions provide abstracted access to the underlying pattern matching engine. There are still many useful tasks that can be performed with low-level access, such as not enabling RKCompileUTF8 and matching raw binary byte buffers. Therefore, the RKRegex class tries to provide unadulterated access to the PCRE matching engine or those users who have special requirements.
The RKRegex class fully supports the NSCoding protocol. When a RKRegex is archived, the regular expression string used to create the receiver is coded, along with any RKCompileOption options. In addition to these two main items, the version and RKBuildConfig flags are also encoded to aid in debugging any unarchiving issues.
If problems are encountered when attempting to initialize a coded RKRegex regular expression, a NSInvalidUnarchiveOperationException is raised. The userInfo portion of the exception contains additional information regarding the failed attempt. Some of the additional information includes any difference in the archiving RKRegex PCRE version, any unknown or unsupported RKCompileOption flags for the current RegexKit, and any differences in RKBuildConfig flags.
Adopted Protocols
Tasks
Class Methods
Returns a
RKBuildConfig mask representing features and configuration settings of the
PCRE library when it was initially built.
A mask of
RKBuildConfig flags combined with the C bitwise OR operator representing features or defaults of the
PCRE library that were set when the library was built.
Returns the
PCRE library major version.
+ (int32_t)PCREMajorVersion;
Returns the
PCRE library minor version.
+ (int32_t)PCREMinorVersion;
Returns a
NSString of the
PCRE library version.
+ (NSString *)PCREVersionString;
The underlying
PCRE library will typically return a version string similar to
"7.0 18-Dec-2006".
Returns a
NSString encapsulated copy of the characters returned by
pcre_version() library function.
Returns a Boolean value that indicates whether regexString and options are valid.
+ (BOOL)isValidRegexString:(NSString * const)regexString options:(const
RKCompileOption)options;
Returns YES if valid, NO otherwise.
Returns the current regular expression cache.
Convenience method for an autoreleased RKRegex object.
+ (id)regexWithRegexString:(NSString * const restrict)regexString library:(NSString * const restrict)libraryString options:(const
RKCompileOption)libraryOptions error:(NSError **)error;
Returns an autoreleased RKRegex object if successful, nil otherwise.
Convenience method for an autoreleased RKRegex object.
+ (id)regexWithRegexString:(NSString * const)regexString options:(const
RKCompileOption)options;
Returns an autoreleased RKRegex object if successful, nil otherwise.
Instance Methods
Returns the number of captures that the receivers regular expression contains.
Every regular expression has at least one capture representing the entire range that the regular expression matched. Additional subcaptures are created with () pairs.
Returns the capture index for
captureNameString, or the first capture index of
captureNameString if compiled with
RKCompileDupNames.
- (
RKUInteger)captureIndexForCaptureName:(NSString * const)captureNameString;
Returns the capture index for
captureNameString from a match operation, or the capture index of the first successful match for
captureNameString if
RKCompileDupNames is used and there are multiple instances of
captureNameString in the receivers regular expression.
- (
RKUInteger)captureIndexForCaptureName:(NSString * const restrict)captureNameString inMatchedRanges:(const NSRange * const restrict)matchedRanges;
Used primarily when a regular expression is compiled with RKCompileDupNames or when the (?J) option has been set to determine the capture index for the first successful match in the matchedRanges result from getRanges:withCharacters:length:inRange:options:. If none of the multiple captureNameString successfully matched then NSNotFound will be returned.
May be used when a regular expression is not compiled with RKCompileDupNames or there is only a single instance of captureNameString, in which case the result will be the capture index of captureNameString only if captureNameString successfully matched, otherwise NSNotFound is returned.
The first capture index that matched in matchedRanges for captureNameString, otherwise NSNotFound is returned if there were no successful matches for any of the captures indexes of captureNameString.
Returns the capture index for
captureNameString from a match operation, or the capture index of the first successful match for
captureNameString if
RKCompileDupNames is used and there are multiple instances of
captureNameString in the receivers regular expression.
- (
RKUInteger)captureIndexForCaptureName:(NSString * const restrict)captureNameString inMatchedRanges:(const NSRange * const restrict)matchedRanges error:(NSError **)error;
This method is similar to captureIndexForCaptureName:inMatchedRanges: except that it optionally returns a NSError object for error conditions instead of throwing an exception. The error parameter may be set to nil if information about the error is not required.
Important:
Exceptions are still thrown for invalid argument conditions, such as passing nil for captureNameString or matchedRanges.
Returns a NSArray which maps the capture names in the receivers regular expression to their equivalent capture index values.
- (NSArray *)captureNameArray;
If the regular expression of the receiver uses named subcaptures (ie, (?<year>(\d\d)?\d\d) ), then for each capture name there exists a corresponding capture index. A NSArray is created with captureCount elements and for every capture name the corresponding array index is set to a NSString of the capture name. If there is no capture name for an index, a NSNull is used instead.
This method returns nil if the receivers regular expression does not contain any named subcaptures.
Returns a NSArray which maps the capture names in the receivers regular expression to their equivalent capture index values, or nil if the receivers regular expression does not contain any capture names.
Returns the capture name for the captured index.
- (NSString *)captureNameForCaptureIndex:(const
RKUInteger)captureIndex;
Returns the capture name for captureIndex, otherwise nil if captureIndex does not have a name associated with it.
A mask of
RKCompileOption flags combined with the C bitwise OR operator representing the options used in compiling the regular expression of the receiver.
Low level regular expression matching method.
- (
RKMatchErrorCode)getRanges:(NSRange * const restrict)ranges withCharacters:(const void * const restrict)charactersBuffer length:(const
RKUInteger)length inRange:(const NSRange)searchRange options:(const
RKMatchOption)options;
-
ranges
Caller supplied pointer to an array of
NSRanges at least
captureCount big.
Warning:
Failure to provide a correctly sized ranges array will result in memory corruption.
-
charactersBuffer
Pointer to the start of characters to search.
-
length
Length of charactersBuffer.
-
searchRange
The range within charactersBuffer to match.
Important:
Raises a
NSRangeException if
length or
searchRange is invalid or represents an invalid combination.
-
options
A mask of options specified by combining
RKMatchOption flags with the C bitwise OR operator.
This method is the low level matching primitive to the PCRE library.
getRanges:withCharacters:length:inRange:options: allocates all of the memory needed to perform the regular expression matching and store any temporary results on the stack. The match results, if any, are translated from the PCRE library format to the equivalent NSRange format and stored in the caller supplied ranges NSRange array. For nearly all cases this means that there is no associated malloc() overhead involved. See rangesForCharacters:length:inRange:options:, which creates an autorelease buffer to store the results, if the caller is unable to provide a suitable buffer.
It is important to note that setting the searchRange.location and adding the equivalent offset to charactersBuffer are not the same thing. The value of charactersBuffer marks the hard start of the buffer, whereas a positive searchRange.location makes the characters from charactersBuffer up to searchRange.location available to the matching engine. This is an important distinction for some types of regular expressions, such as those that use lookbehind (ie, (?<=)), which may require examining characters that are strictly not within searchRange.
Returns the number of captures matched (>0) on success, otherwise a
RKMatchErrorCode (<0) on failure. The values in
ranges are only modified on a successful match.
Returns a
RKRegex object initialized with the regular expression
regexString using the regular expression pattern matching
library with
RKCompileOption options.
- (id)initWithRegexString:(NSString * const restrict)regexString library:(NSString * const restrict)library options:(const
RKCompileOption)libraryOptions error:(NSError **)error;
-
regexString
The regular expression to compile.
-
library
The regular expression pattern matching library to use. See
Regular Expression Libraries for a list of valid constants.
Note:
Currently the only supported regular expression matching library is the
RKRegexPCRELibrary PCRE library.
-
libraryOptions
A mask of options specified by combining
RKCompileOption flags with the C bitwise OR operator.
-
error
An optional parameter that if set and an error occurs, will contain a NSError object that describes the problem. This may be set to NULL if information about any errors is not required.
Unlike initWithRegexString:options:, this method does not throw an exception on errors. Instead, a NSError object is created and returned via the optional error parameter.
Important:
Exceptions are still thrown for invalid argument conditions, such as passing nil for regexString or library.
Returns a RKRegex object if successful, nil otherwise.
Returns a
RKRegex object initialized with the regular expression
regexString with
RKCompileOption options.
- (id)initWithRegexString:(NSString * const restrict)regexString options:(const
RKCompileOption)options;
Raises RKRegexSyntaxErrorException if regexString in combination with options is not a valid regular expression. The exception provides a userInfo dictionary containing the following keys and information:
Table 1 RKRegexSyntaxErrorException userInfo dictionary information.
Key |
Object Type |
Description |
regexString | NSString |
The regexString regular expression that caused the exception. |
regexStringErrorLocation | NSNumber |
The location of the character that caused the syntax error. |
regexAttributedString | NSAttributedString |
The regexString regular expression with a NSBackgroundColorAttributeName set to [NSColor redColor] for the character that caused the error along with the NSToolTipAttributeName attribute (if supported) set to errorString. |
errorString | NSString |
The error string that the PCRE library returned. |
RKCompileOption | NSNumber |
The RKCompileOption that was passed with regexString. |
RKCompileOptionString | NSString |
A human readable C bitwise OR equivalent string of RKCompileOption options. |
RKCompileOptionArray | NSArray |
The human readable equivalent of the individual C bitwise RKCompileOption options flags in a NSArray. |
RKCompileErrorCode | NSNumber |
The RKCompileErrorCode that the PCRE library returned. |
RKCompileErrorCodeString | NSString |
A human readable equivalent of the RKCompileErrorCode name that the PCRE library returned. |
Currently creates a regular expression using the RKRegexPCRELibrary PCRE library.
Returns a RKRegex object if successful, nil otherwise.
Returns a Boolean value that indicates whether captureNameString is a valid capture name for the receiver.
- (BOOL)isValidCaptureName:(NSString * const)captureNameString;
Returns a Boolean value that indicates whether matchCharacters of length in searchRange with options is matched by the receiver.
- (BOOL)matchesCharacters:(const void * const restrict)matchCharacters length:(const
RKUInteger)length inRange:(const NSRange)searchRange options:(const
RKMatchOption)options;
-
matchCharacters
The characters to match against. This value must not be NULL.
-
length
The number of characters in matchCharacters.
-
searchRange
The range within matchCharacters to match against.
Important:
Raises a
NSRangeException if any part of
searchRange lies beyond the end of
matchCharacters.
-
options
A mask of options specified by combining
RKMatchOption flags with the C bitwise OR operator.
YES if the receiver matches matchCharacters of length length within searchRange with options, otherwise NO.
Returns the range of captureIndex for the first match in matchCharacters of length length inside searchRange with options matched by the receiver.
- (NSRange)rangeForCharacters:(const void * const restrict)matchCharacters length:(const
RKUInteger)length inRange:(const NSRange)searchRange captureIndex:(const
RKUInteger)captureIndex options:(const
RKMatchOption)options;
-
matchCharacters
The characters to match against. This value must not be NULL.
-
length
The number of characters in matchCharacters.
-
searchRange
The range within matchString to match against.
Important:
Raises a
NSRangeException if any part of
searchRange lies beyond the end of
matchString.
-
captureIndex
The range of the match for the capture subpattern captureIndex of the receivers regular expression to return.
-
options
A mask of options specified by combining
RKMatchOption flags with the C bitwise OR operator.
(comprehensive description)
A NSRange structure giving the location and length of captureIndex for the first match in matchCharacters of length length inside searchRange with options that is matched by the receiver. Returns {NSNotFound, 0} if the receiver does not match matchCharacters.
Returns a pointer to an array of NSRange structures that correspond to the capture indexes of the receiver for the first match in matchCharacters of length length in searchRange with options.
- (NSRange *)rangesForCharacters:(const void * const restrict)matchCharacters length:(const
RKUInteger)length inRange:(const NSRange)searchRange options:(const
RKMatchOption)options;
-
matchCharacters
The characters to match against. This value must not be NULL.
-
length
The number of characters in matchCharacters.
-
searchRange
The range within matchString to match against.
Important:
Raises a
NSRangeException if any part of
searchRange lies beyond the end of
matchString.
-
options
A mask of options specified by combining
RKMatchOption flags with the C bitwise OR operator.
The returned pointer of an array of captureCount NSRange structures is automatically freed just as a autoreleased object would be released; you should copy any values that are required past the autorelease context in which they were created.
There is no need to free() the returned result as it will automatically be deallocated at the end of the current NSAutoreleasePool context.
Example code
// Assumes that regexObject and characters exists
NSRange *captureRanges = NULL;
captureRanges = [regexObject rangesForCharacters:characters length:strlen(characters) inRange:NSMakeRange(0, strlen(characters)) options:RKMatchNoOptions];
if(captureRanges != NULL) {
int x;
for(x = 0; x < [regexObject captureCount]; x++) {
NSLog(@"Capture index %d location %u, length %u", x, captureRanges[x].location, captureRanges[x].length);
NSLog(@"NSRange string %@", NSStringFromRange(captureRanges[x]));
}
}
A pointer to an autoreleased allocation of memory that is sizeof(NSRange) * [self captureCount] bytes long and contains captureCount NSRange structures with the location and length for the capture indexes of the first match in matchCharacters of length length within the range searchRange using options.
Returns NULL if the receiver does not match matchCharacters using the supplied arguments.
Returns the regular expression used to create the receiver.
- (NSString *)regexString;