This document contains a summary of the changes and important information for each version of RegexKit that developers should be aware of.
PCRE 7.6 includes an important security related buffer overflow fix. From the PCRE 7.6 Change Log:
A problem with linking against the framework on Xcode versions prior to 3.0 or under Mac OS X 10.4 has come to light. The problem lies with the DTrace functionality introduced in RegexKit 0.5 and the fact that Apple uses a new linker section type to hold the information required for DTrace support. This is only a problem with the step during the build process that does the final link of .o files in to a single executable image, not with the run time linking that takes places during the start of program execution. This problem is harmless in the sense that it does not cause any problems with the execution of applications linked under Xcode 3.0 / Mac OS X 10.5 and run on a Mac OS X 10.4 system. It is, however, a major inconvenience if you are attempting to use the older toolchain for development.
I have opened a bug with Apple, ID # 5698078, regarding this issue. I can't say if this will result in a fix from Apple, but a bug report has at least been filed. Update: The first bug has been closed as Behaves correctly. Resubmitted bug (ID # 5708443) because the justification given for closing the bug suggests the engineer did not actually read the bug report.
Unfortunately I can not find nor think of an elegant work around for this issue. A proper fix from Apple would involve a trivial modification to the ld linker that simply ignores the DTrace section instead of throwing an error. None of the common tools, such as ld, nmedit, or strip appear to have the ability to remove the offending section. strip does support a 'stub library' mode that strips a shared library of all of it's executable code and data but leaves all the dynamic library symbol information required to perform the final linking step, but when testing this feature it was discovered that the offending DTrace Object Format section remained in the stripped stub library.
A very kludgey work around if you absolutely must link using the older tools is to re-build the framework with DTrace support disabled. You can disable DTrace by following these steps:
When doing some quick tests of this procedure, the cleaning step seems to be important. I'm not sure what's lingering that causes the final linked product to have a DTrace Object Format section, but cleaning seems to do the trick. While I did not need to remove RegexKitProbes.d from the RegexKit Framework compiled sources, it is another potential source of 'DTrace contamination.' If you need to, you can stop RegexKitProbes.d from being 'compiled' (it only creates a .h header, not actual code) by removing it from Targets > RegexKit Framework > Compile Sources.
Depending on your needs, you can either use the rebuilt framework as-is, including copying the rebuilt framework version in to your applications bundle. One possibility is to simply copy the rebuilt framework over the copy you are currently linking against. This will also have the effect of causing the rebuilt framework, without DTrace support, to be copied in to your applications bundle and used during execution, assuming your application links to RegexKit as outlined in Adding the RegexKit.framework to your Project.
The other option requires a bit more work, but retains the DTrace functionality. Since Mac OS X 10.4 can properly execute the framework version with DTrace support, the rebuilt framework (without DTrace support) is really only needed as a temporary stand-in to allow the final linking step to complete. This would involve altering your applications build settings so that the linking step uses the rebuilt framework, but the Copy Files build phase that copies the RegexKit framework in to your applications bundle copies the fully functional version.
Another possibility would be to use the simpler first method for the majority of your work, but manually replace the RegexKit shared library file inside your applications .App bundle on an as needed basis. From the shell, this would probably be something like (as a single line, in case your browser was forced to split the line for rendering purposes):
This release brings a number of forward looking changes to the frameworks API. Two changes that you should be aware of is the addition of a library: parameter to the RKRegex class and the addition of an error: parameter to many methods. The version of PCRE has been upgraded to the latest available, 7.6.
The version of PCRE used by RegexKit has been upgraded to version 7.6. Users are encouraged to read the PCRE Change Log for information regarding the 7.6 release. In summary, this release is largely a bug fix release and introduces no new features or major improvements.
The purpose of the library: parameter is to enable the ability to use additional regular expression libraries in the future. As of this release the only supported library remains the PCRE library, which is specified using the RKRegexPCRELibrary constant. Nearly all the functionality provided by RegexKit is independent of the underlying pattern matching library, and most regular expression pattern matching libraries provide similar API's for compiling regular expressions, performing matches, and extracting the results of a match. Ideally, a generic pattern matching library interface can be created, hiding the details of implementing support for individual regular expression libraries behind a common API.
Adding the error: parameter to methods allows users of the framework to use the NSError paradigm for catching and reporting errors. Initially, RegexKit would throw an exception for error conditions such as a regular expression syntax error. Now, when using a method with a error: parameter, these types of error conditions no longer result in an exception being thrown. Instead, information about the cause of the error is returned via a NSError object. These NSError objects can be handled using the same infrastructure used in handling other NSError error conditions, such as displaying errors using the NSResponder presentError: method or the NSAlert alertWithError: method.
One of the guidelines used to determine whether or not an error condition should throw an exception or create a NSError object is whether or not an error condition could be the result of user input. As an example, passing nil as an argument for a required parameter is almost certainly due to programmer error and will throw a NSInvalidArgumentException exception. However, a syntax error in a regular expression might just be a mistake on the users part and will create a NSError object with information regarding the error along with providing a NSLocalizedDescriptionKey and NSLocalizedFailureReasonErrorKey that are suitable for displaying to the user. The goal is to simplify the process of dealing with user generated error conditions and ideally being able to hand any NSError objects directly over to the standard NSError display machinery. Using the common NSError error: way of returning error conditions avoids having to bracket calls within @try / @catch blocks to the RegexKit framework that could possibly result in throwing an exception due to unpredictable user input.
To support the creation of error reporting strings that are displayed to the user, the ability to localize the strings used by the RegexKit framework was started. In addition to this, a number of alternative substitutions for the error strings returned by the PCRE library were created in order to provide a consistent Cocoa user experience. For example, an error string provided by the PCRE library:
And the localizable alternative text provided and used by RegexKit:
This release begins the process of adding the ability to localize the strings used by RegexKit for different locales. Much of the internal infrastructure required to localize strings has been put in place, however any additional localizations will have to be contributed by users as the author of RegexKit only speaks English. See the Change Log for 0.6.0 Beta for details.
The Safari plug-in, Safari AdBlock, uses RegexKit to perform regular expression pattern matches. This involves checking each URL against a list of regular expressions to check for a match. This need to determine if any of the regular expressions in a collection matched a common string was the inspiration for this feature. There are three techniques that RegexKit uses to accelerate this particular task:
For the parallel, multithreading evaluation of regular expressions, the framework creates a thread pool with a number of threads equal to the number of CPU's available. If there is only a single CPU, no thread pool is created and evaluation of regular expressions is done sequentially. When running under Mac OS X 10.5, the new thread affinity feature is used to bind each thread to a separate CPU.
When searching a NSArray for the first matching regular expression in the array using firstMatchingRegexInArray:, and additional performance improvement is available. If a regular expression in the NSArray matches, only regular expressions with a lower index value (that is, regular expressions that are before the matching regular expression) are checked from that point on, skipping any regular expressions that may have bubbled to the top from many successful matches, but occur at a later point in the array than the current lowest match.
Five new DTrace probe points, RegexKit:::BeginSortedRegexMatch, RegexKit:::EndSortedRegexMatch, RegexKit:::BeginSortedRegexSort, RegexKit:::EndSortedRegexSort, and RegexKit:::SortedRegexCache were added to assist in evaluating the performance aspect of using this new functionality. In addition to the DTrace probe points, two new Instruments.app instruments were added, Collection Cache and Collection Timing, to provide easy access to the information from the new DTrace probe points.
These enhancements are added as extensions to the NSObject class and documentation for the methods can be found here.
This release contains enhanced DTrace support via RegexKit specific DTrace user defined static probes. Additional information can be find in DTrace Probe Points in RegexKit.
A collection of instruments for Instruments.app are now included that use the new DTrace probe points to assist you in your debugging efforts.
Added indexSetOfObjectsMatchingRegex: and indexSetOfObjectsMatchingRegex:inRange: to the NSArray category additions.
This release brings regular expression matching functionality to the NSData class. See NSData RegexKit Additions Reference for additional information.
Previously, the RegexKit NSString additions and RKEnumerator class expected and returned all NSRange results as UTF8 character indexes. This behavior was not explicitly documented. From an API perspective, the Foundation NSString class treats all strings as if they were UTF16 encoded for the purposes calculating character indexes, regardless of the strings actual encoding. This is a problem when moving from one character index domain to the other without first converting the results for Unicode strings that contain non-ASCII characters. As of version 0.4.0, the API for the NSString additions and the RKEnumerator class now calculate all character indexes exactly the same way as the Foundation NSString class does. Because of this, RegexKit NSRange values can be used interchangeably with Foundation NSRange values and produce the expected results. Previous versions of RegexKit may not have produced the expected results depending on the specific interaction of the regular expression and underlying strings being matched.
It is important to note that the API of the RKRegex class continues to use PCREs native UTF8 format for the purposes of calculating character indexes. This class is not used directly by most users, and those that make use of it require the more exposed access to the PCRE library that it provides. See Important NSRange Differences for more information.
In addition to this change, the NSString additions and RKEnumerator objects now require all RKRegex objects to have the RKCompileUTF8 option set. Normally, the bytes that PCRE performs matches against is treated as raw 8-bit data. For Unicode strings, the most significant bit of a byte (ie, >= 128 / 0x80) has special meaning that must be interpreted to extract the additional Unicode information. Since the NSString additions and RKEnumerator class can only operate on NSString Unicode strings, RKCompileUTF8 is now set for any NSString regex arguments (ie, [searchString rangeOfRegex:@"(.*)"]), and for any RKRegex object arguments (ie, [searchString rangeOfRegex:regexObject]) are recompiled with the option set if it is not enabled in the supplied RKRegex object. See Foundation Additions RKCompileOption Requirements for more information.
Due in large part to the changes described above, Unicode support is significantly enhanced. In addition to enabling the interpretation of Unicode information for the bytes that are being searched, RKCompileUTF8 can subtly alter the behavior of regular expressions. The reason for this is without RKCompileUTF8, PCRE treats the search buffer as raw 8-bit data. For example, the definition of the regular expression metacharacter . (dot) is 'match any character except newline'. A character is defined as a single byte, but RKCompileUTF8 changes this to a single Unicode character, which can be anywhere from one to six bytes. Normally these changes are invisible and alter the definitions in ways you would intuitively expect. However, these changes can alter a regular expression in subtle, but critical, ways. See UTF8 Support for more information. It should be noted that RegexKit always enables the UTF8 and Unicode properties features in the PCRE library.
While these features and capabilities existed in prior versions of RegexKit, the RKCompileUTF8 option was not enabled by default for either regular expressions passed as a NSString, or as a RKRegex object to the NSString additions and RKEnumerator class. Certainly most users would expect this option and the behavior it enables to be enabled when matching NSString objects, which are always Unicode strings. As mentioned previously, the NSString additions and RKEnumerator class now require the RKCompileUTF8 option, and will forcibly enable it if not present.
RegexKit now supports the Perl style \u, \l, \U, \L, and \E case conversion escape sequences along with the \digit capture subpattern syntax for replacement reference strings. Case conversion is performed with the NSString uppercaseString and lowercaseString methods, which follow Unicode case conversion rules.
See Case Conversion Syntax for additional information.
You can now 'Subscribe' to the RegexKit documentation. Xcode will then periodically check to see if there's a newer version of the documentation available and allow you to automatically update to the newer documentation.
The documentation is now made available in the new Xcode 3.0 DocSet format. This allows you to access the RegexKit documentation from within Xcode. If you open the Research Assistant ( Help > Show Research Assistant ), and then place the insert point over a RegexKit method, the Research Assistant will display the relevent section of documentation. Also available is full text searching of the documentation from the documentation viewer. Although the Subscribe button is present, there is no subscription available at this time (this seems to be a bug with Xcode, no URL is present to even subscribe to.) Automatic documentation updating will be added in a later release.
RegexKit now includes support for Mac OS X 10.5's Garbage Collection feature. Cocoa's Garbage Collection system requires that all linked files have Garbage Collection enabled for it to be active. RegexKit supports both the new Garbage Collection system and the older retain / release system. At load time, if Cocoa has enabled the GC system, RegexKit switches to using that. Otherwise, it falls back to the older retain / release way of managing memory automatically.
RegexKit now supports 64-bits cleanly. For Mac OS X users, the framework binary now includes the following architectures:
Architecture | ABI | Minimum Mac OS X |
---|---|---|
ppc | 32 bit | Mac OS X 10.4 |
i386 | 32 bit | Mac OS X 10.4 |
ppc64 | 64 bit | Mac OS X 10.5 / Leopard |
x86_64 | 64 bit | Mac OS X 10.5 / Leopard |
RegexKit 0.3.0 marks the introduction of a Mac OS X Installer based distribution. It bundles together the Mac OS X Binary executable RegexKit.framework, the Documentation, and the source code in to one convenient package.
Began to enable parallel building where possible. For now, the PCRE target and part of the Distribution target are converted. The global build configuration flag PARALLEL_BUILD controls this feature. The PCRE target can be selectively disabled with PCRE_PARALLEL_BUILD = NO, and the Distribution target can be selectively disabled with DISTRIBUTION_PARALLEL_BUILD = NO.
By default it will spawn as many jobs as there are CPU's.
RegexKit 0.2.0 had a bug in the Info.plist file. The settings for the keys CFBundleShortVersionString and CFBundleVersion where set to the variable ${REGEXKIT_CURRENT_VERSION} when they should have been set to ${PROJECT_CURRENT_VERSION}.
Fixed a very minor memory leak. Forgot to free() the per-thread local storage struct when a thread exited.
The version of PCRE used was updated to the latest available: 7.4.
Nearly all changes have been to the build system. Very few changes, with the exception of upgrading to PCRE 7.4, resulted in any changes to the framework proper.
A priority for the project is to get a usable version of the framework in to the hands of developers. This means that the framework code, API documentation, and end user distribution packaging and layout are the top priorities. Other issues, such as documentation on the implementation, will be allowed to fall behind in order to meet the other objectives in a timely manner.
This release adds the first version numbers internal to the framework. Some of these changes technically have a substantial impact to those linking to the framework, but pragmatically are a complete non-issue because the framework is embedded and copied in to end users application bundles. See ChangeLog for additional information.
The framework additions to these two classes are likely to be the most volatile going forward, as I expect that vast majority of users need to perform regex operations on strings.
Candidly, it is extremely difficult to craft a usable, clean, and intuitive API. I often lack the proper perspective of what's required by end users since I tend to view things from the perspective of whats easiest to implement, rather than what's the easiest to use. FEEDBACK REGARDING THE API IS EXTREMELY APPRECIATED.
I am still uncertain as to how well supported Unicode is in practice. I have no Unicode experience to speak of, so it is extraordinarily difficult for me to conceive of test vectors to verify proper operation. Being a native English speaker who cut his teeth exclusively on ASCII, I am certain there are things that would seem almost comically obvious to non-English readers and writers but I am completely ignorant of. I strongly desire to have complete, error free Unicode support in the framework so any help you can provide is greatly appreciated!
You can help in the following ways:
The implementation documentation (RegexKitImplementation.html) has not been updated in any meaningful way to track the latest changes. Since the audience for this documentation is expected to be a small fraction of the RegexKit user base, I'm letting it slip in the interest of getting out a product that is used by most people.
This release did not include any testing of the GNUstep configuration. It is likely that there are a few issues that would prevent an 'out of the box' clean build under GNUstep at this time, but the amount of work should be minimal. It has worked flawlessly in the past, so it should be a matter of bringing things up to date. I certainly try to code things with GNUstep compatibility in mind, but 99.99% of the work happens under Mac OS X Cocoa, so things slip through.
The first public release of the RegexKit framework.