As of version 0.4, all NSRange arguments and results of the RegexKit NSString additions and RKEnumerator class use the same method as Foundation for calculating character index values. Specifically, Foundation treats all strings as if they were UTF16 encoded, whereas PCRE requires all strings to be UTF8 encoded. The details of converting between UTF8 and UTF16 character index values, and therefore NSRange values, are now handled automatically by the framework, providing a consistant UTF16 character index API for all Foundation additions and RKEnumerator class objects.
For UTF8 encoded strings, only character values < 128 / 0x80 can be represented with a single byte, which is a one to one mapping to ASCII characters. To represent non-ASCII characters, UTF8 uses a special encoding format that uses a variable number of additional bytes, from one to five, to represent a single Unicode character. Therefore, when dealing with international text, there is no longer a one to one relationship between a 'character' and a byte as there is in ASCII text.
UTF16 encoded strings use two bytes as its fundamental date type. The vast majority of text can be encoded in to a single UTF16 character, but some character encodings will require two or more UTF16 characters to represent a single Unicode character. Only UTF32 has a fixed length at four bytes, which simplifies 'character index' issues, but is not space efficient for the majority of text. Even ASCII requires four bytes to encode one character in UTF32, of which there are only 127 possible ASCII characters.
As of version 0.4, these issues are now abstracted by the framework. All strings are treated logically as if they were encoded as UTF16 for NSRange calculation purposes. Previous versions calculated NSRange values based on PCRE's native UTF8 encoding format. This meant that NSRange results from RegexKit could be used as arguments by another RegexKit method correctly for international text, but could not be used as arguments for native Foundation methods and vice-versa. In short, NSRange results and arguments should 'just work' as expected with NSRange results from native Foundation methods.
Several methods are available that allow you to create new strings, or replace the matched text, with a string that contains references to the capture subpatterns. The syntax is modeled after Perls dollar sign variable references, i.e. $1. In addition to capture subpattern references, you can also covert the case of the text. Case conversion is also modeled after Perls \u, \l, \U, \L, and \E escape sequences.
Capture subpattern references have the following three forms:
Or,
Or,
In order to prevent the accidental interpretation of the expanded text of a capture subpattern match reference as a format specifier, the order of substitution and expansion is as follows:
When a capture subpattern reference is expanded in a string, the text of the capture subpattern reference is replaced with the text of the subpattern that was matched by the regular expression.
To include a $ from the reference string in the final replacement text without the possibility of additional interpretation, you may specify two dollar signs consecutively. Only a single dollar sign will be copied to the final expanded text. For example: You owe: $$1234.56 will be copied as You owe: $1234.56. Another example: Does not reference the $${0} capture subpattern is copied as Does not reference the ${0} capture subpattern.
The following escape sequences allow you to convert the text in the reference and replacement string. The NSString uppercaseString and lowercaseString methods are used to perform the actual case conversion.
Sequence | Description |
---|---|
\u | Convert the next character to upper case. |
\l | Convert the next character to lower case. |
\U | Convert the following characters to upper case until \E. |
\L | Convert the following characters to lower case until \E. |
\E | End case conversion. |
The following is the example given in Apples NSString lowercaseString method documentation, modified to illustrate the case conversion escape sequences. As the example given in apples documentation illustrates, Unicode case conversion is not necessarily symmetrical.
The following example uses a regular expression that matches all the characters between a and e to replace the matched text with \U$1\E.
The same regular expression is used in the following example, but the replacement reference string is now \u$1. The \u escape sequences causes the next character to be converted to uppercase, and in this case ß is converted to SS. The result string, StraSSe, has the same regular expression applied to it and the replacement reference string is changed to \l$1, which converts the next character to lower case. The final result is StrasSe.
Capture subpattern references and type conversions extend regular capture subpattern references by allowing an optional type argument:
Or,
Or,
There are two major classes of conversion types:
The first form, conversion to basic C data types, uses the familiar percent (character '%') style specification. For example, %d converts the string "12345" to an int with the value of 12345.
The second form uses an at sign (character '@') to specify conversion to an Objective-C object. The rest of the conversion specification, such as what type to convert to and conversion options, is similar to the percent style conversion syntax.
Conversions to basic C data types are handled by the scanf standard library function. If the first character of the conversion specification begins with a % (percent), then the remaining text up to the end of the conversion is passed to scanf unaltered. Any conversion types or conversion options supported by the libraries scanf are available.
Since the conversion specification is passed unaltered to scanf, care should be taken to ensure that only a single conversion is specified. Requesting two or more conversions will result in a crash since the framework will only supply scanf a single valid pointer in which to store the conversion result. It is also not safe to specify a different pointer argument position using the %n$ syntax.
The following table lists some of the common percent C data type conversion specifiers. Consult the scanf man page for additional information.
Conversion | Syntax | Converted Type | Example String Forms | Description |
---|---|---|---|---|
Positive or Negative Integers, Non-Floating Point Numbers | %d | int | 1234, -4321 | Numbers without a decimal point. |
Positive Only Integers, Non-Floating Point Numbers | %u | unsigned int | 1234 | Non-negative numbers without a decimal point. |
Hex Values | %x | unsigned int | 0x1234abcd | Hex values starting with 0x. |
Floating Point Numbers | %f | float | 1234.567, -4321.987, 2.657e+7, nan, infinity | IEEE-754 Single Precision Floating Point numbers. |
Floating Point Numbers | %lf | double | 1234.567, -4321.987, 2.657e+7, nan, infinity | IEEE-754 Double Precision Floating Point numbers. The middle l must be included for doubles. |
On Mac OS X the floating point conversions will actually accept and convert a wide variety of string formats. For example, hex values such as 0x1234abcd will be correctly converted to 305441741.0 for the requested floating point type. Even signed hex values, such as -0xfedc9876 are correctly converted. Additionally, C99 style hex floats, such as 0x3.fe69149f758p+45, are also correctly converted.
Ownership of objects returned by a conversion is the same as ownership of other convenience methods such as [NSString stringWithFormat:]. Specifically, they have been sent a autorelease message and they do not require any additional release messages.
However, if you require use of a converted object past the current NSAutoreleasePool context, you must take ownership of it by sending a retain message.
NSNumber object conversions are specified with @n and the conversions are performed with a thread private NSNumberFormatter. A threads NSNumberFormatter is created when the first NSNumber conversion is requested and is released when the thread exits.
The different types of NSNumberFormatterStyle are made available using certain option characters. For example, the NSNumberFormatterCurrencyStyle is specified as @$n.
Conversion | Syntax | NSNumberFormatterStyle | Example String Forms | NSNumber value | Description |
---|---|---|---|---|---|
General Numbers | @n | NSNumberFormatterNoStyle | 1234, -4321 | 1234.0, -4321.0 | Numbers without a decimal point. |
Decimal Numbers | @.n | NSNumberFormatterDecimalStyle | 1234.5, -4321.9 | 1234.5, -4321.9 | Numbers with a decimal point. |
Currency | @$n | NSNumberFormatterCurrencyStyle | $1,234.56, -$4321.00 | 1234.56, -4321.0 | Uses the current locale settings for currency symbols and number separation style. |
Percentages | @%n | NSNumberFormatterPercentStyle | 100%, 99.99% | 1.0, 0.9999 | Percentage values. For example, 100% is converted to 1.0, 23% becomes 0.23. |
Scientific Notation | @sn | NSNumberFormatterScientificStyle | 9.342124e+06 | 9342124.0 | Numbers specified with an exponent. Equivalent to 9.342124 * 106. |
Spelled out words | @wn | NSNumberFormatterSpellOutStyle | two hundred forty-three, fifty-seven point nine five | 243.0, 57.95 | Numbers that are spelled out with words. |
NSCalendarDate object conversions are specified with @d and the conversions are performed with the NSDate class method dateWithNaturalLanguageString:. Since the conversions are performed by a class method, conversions are serialized with a lock to ensure correct multithreading behavior.
Conversion | Syntax | Converted Type | Example String Forms | Description |
---|---|---|---|---|
Dates and Times | @d | NSCalendarDate | '07/20/2007', '6:44 PM', 'Feb 5th', '6/20/2007, 11:34PM EST' | Parses a wide range of date and time formats. |
When specifying a regular expression, the regular expression can be either a RKRegex object or a NSString containing the text of a regular expression. When specified as a NSString, as determined by sending isKindOfClass:, the receiver will convert the string to a RKRegex object via regexWithRegexString:options:.
Objects are sent isMatchedByRegex: to determine whether or not they are matched by the specified regular expression.
If aRegex matches the receiver within range, the supplied pointer to a pointer arguments are updated with the match results and YES is returned. If aRegex matches the receiver multiple times, only the first match within range is used.
If aRegex does not match the receiver within range, none of the supplied pointer to a pointer arguments are altered and NO is returned.
Following the regular expression aRegex, a variable length list of capture subpattern reference / pointer to a pointer type conversion specification pairs is given, terminated with a nil. The calling sequence is similar to NSDictionary dictionaryWithObjectsAndKeys: except that reference proceeds pointer to a pointer instead of the NSDictionary pair ordering where object pointer proceeds reference.
The order in which the capture subpattern reference arguments appears does not matter, nor the number of times that a capture subpattern reference appears.
See Capture Subpattern Reference and Type Conversion Syntax for information on how to specify capture subpatterns and the different types of conversions that can be performed on the matched text. If the optional type conversion is not specified then the default conversion to a NSString containing the text of the requested capture subpattern will be returned via pointer to a pointer.
Examples
The same example demonstrating that a RKRegex object and a NSString of a regular expression may be used interchangeably. Regular expressions passed as a NSString are automatically converted to RKRegex objects before use.
An example demonstrating a hex string converted to the equivalent unsigned int value.
If aRegex matches the receiver, the supplied pointer to a pointer arguments are updated with the match results and YES is returned. If aRegex matches the receiver multiple times, only the first match is used.
If aRegex does not match the receiver, none of the supplied pointer to a pointer arguments are altered and NO is returned.