RegexKit

An Objective-C Framework for Regular Expressions using the PCRE Library

Introduction

This document demonstrates how to use regular expressions by incorporating the RegexKit.framework in to your project.

The RegexKit.framework is an Objective-C wrapper for the PCRE (Perl Compatible Regular Expression) library available at www.pcre.org. The RegexKit.framework acts as a bridge between the Objective-C Foundation framework and the PCRE library by providing regular expression pattern matching extensions to NSArray, NSDictionary, NSSet, and NSString, and their mutable variants.

Highlights
  • Multithreading safe.
  • Automatically caches compiled regular expressions.
  • For Mac OS X, the framework is built as a Universal Binary.
  • Uses Core Foundation on Mac OS X for greater speed.
  • PCRE library built in, no need to build or install separately.
  • GNUstep support.
Prerequisites
  • An Objective-C development environment.
  • Cocoa Foundation framework, or compatible.
  • For Mac OS X, 10.4 or greater is required.
  • Some experience with regular expressions.

PCRE Version and Feature Support

PCRE Versions Supported

The version of PCRE used for development was 7.0 18-Dec-2006, which was the latest stable at the time of development. No version prior to 7.0 was tested for compatibility.

Features Supported
  • Named subpattern captures.

    Named subpattern captures are fully supported. You can find the capture index of a capture name with captureIndexForCaptureName:. In general, the convenience methods will automatically convert a capture name to a capture index when the index request is not numeric.

  • Unicode support.

    Provided the underlying PCRE library was built with Unicode support, the default, the framework should support Unicode as well. Currently only the default locale table created when the PCRE library was built is supported.

    Important:
    The author is a native English speaker and has dealt almost exclusively with ASCII strings, therefore the Unicode support probably contains bugs. Bug reports and especially unit tests that exercise the Unicode portions are welcome.
Features Not Supported

Multithreading Safety

Warning:
Multithreaded programming is extremely difficult and error prone. While an effort has been made to make this framework multithreading safe there are a large number of unproven assumptions made, notably that the libraries and frameworks that RegexKit.framework depends on are multithreading safe.

The RegexKit.framework has been written with multithreading safety in mind. The major points on which multithreading safety is predicated on are:

  • The underlying PCRE library is multithreading safe, which it typically is. Consult the PCRE library documentation for details. Version 7.0 of the PCRE library contained the following provision:

    The PCRE functions can be used in multithreading applications, with the proviso that the memory management functions pointed to by pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the callout function pointed to by pcre_callout, are shared by all threads.

    The compiled form of a regular expression is not altered during matching, so the same compiled pattern can safely be used by several threads at once.

  • The implementation of NSMapTable used to store cached RKRegex objects is multiple reader / single writer multithreading safe, which is usually true. Refer to the RKCache class documentation for additional information.
  • The underlying Foundation implementation of NSAutoreleasePool follows the semantics of Cocoa's implementation, which is to say that there is a NSAutoreleasePool per thread. Access to the cache is protected with a multiple reader, single writer lock. If the cache system finds a hit, it performs a retain on the matched object while it still has the cache locked. This protects against premature deallocation if another thread immediately clears the cache, which can only be done under the write lock. Convenience methods that subsequently autorelease the cached object will place it in to that threads NSAutoreleasePool. This ensures that any objects that are returned from the cache remain valid until at least the end of the threads NSAutoreleasePool context that obtained the cached result regardless of any intervening requests to clear the cache by other threads.

RKRegex objects are safe to share between threads. No special precautions need to be taken except for those that are to be expected, such as making sure the object is properly retained so it is not accidentally deallocated during the time between threads.

Thread Local Data

The capture conversion routines can create a NSNumber object by using a NSNumberFormatter to perform the conversion. Since the NSNumberFormatter is likely to be reused again and again, the framework creates and retains a per thread instantiation the first time that thread performs a NSNumber conversion. When the thread exits, the NSNumberFormatter is automatically released. This is required because the documentation for NSNumberFormatter says that a single NSNumberFormatter is not safe to share across threads.

The thread local data structures are managed with the pthread library pthread_getspecific and pthread_setspecific functions.

Atomic Primitives
Required atomic primitives
Primitive Description
RKThreadYield Force the calling thread to yield the CPU to a different thread.
RKIsMainThread Returns YES if the calling thread is the main thread, NO otherwise.
RKAtomicMemoryBarrier Force all pending memory loads and stores to complete before continuing.
RKAtomicIncrementInt Atomically increment a C int (as defined by the platform / compiler) by one.
RKAtomicDecrementInt Atomically decrement a C int (as defined by the platform / compiler) by one.
RKAtomicCompareAndSwapInt Compare a C int (as defined by the platform / compiler) value (referred to as old) with a value (referred to as new) at a location in memory (referred to as ptr) if and only if the value at ptr contains the value of old and can be replaced with the value of new in a single atomic operation.
RKAtomicCompareAndSwapPtr Compare a C void * (as defined by the platform / compiler) value (referred to as old) with a value (referred to as new) at a location in memory (referred to as ptr) if and only if the value at ptr contains the value of old and can be replaced with the value of new in a single atomic operation.

In order to ensure correct multithreading operation, the framework requires a handful of atomic primitives. These are very common primitives, save for the fact that there's no common API to use them.

If RegexKitPrivate.h does not have atomic operations defined for the build platform and GCC 4.1 or greater is used, then the GCC 4.1+ atomic intrinsics will be used if possible.

Locks are handled by two private, internal classes: RKLock and RKReadWriteLock. These use pthread mutex locks as their locking primitives and should compile on any platform with pthread support. The lock objects also provide some debugging capabilities and spurious error handling ability.

Single threaded vs. Multithreaded

The private lock objects optimize their locking strategy so that when running in single threaded mode they do not actually obtain the lock, but simply return as though they had. As soon as multithreading is detected, as determined by [NSThread isMultiThreaded], they automatically switch to multithreading safe full lock acquisition. This can result in an approximate 10% performance increase for the single threaded case. No special precautions or steps are required when switching from single to multithreading.

Cocoa Specifics

The RegexKit.framework was primarily designed and tested with the Mac OS X 10.4 Cocoa environment. Multithreading has been extensively tested and should be safe to use.

GNUstep Specifics

While the RegexKit.framework has had extensive multithreaded testing under the Cocoa environment, there has been much less testing under GNUstep. Testing on FreeBSD 5.4 amd64 with ports GCC 4.1.3 20070402 (and its included ObjC runtime) along with OCUnit v27 using ports GNUstep-base 1.13.1 revealed a number of multithreading bugs, but none (as near as could be determined) with the RegexKit.framework. Notably it would appear the ObjC runtime in GCC 4.1.3 "leaks" the __objc_runtime_mutex lock under unknown conditions. Since the lock is thread re-entrant, this is not really a problem during single threaded use. However, if another thread attempts to acquire the lock, it will obviously deadlock because not all of the lock/unlock calls are balanced. This was observed during certain situations, such as first time selector lookup from a thread other than the main thread (which held/leaked the lock). It was possible to "pre-lookup" the selectors in question before becoming multithreaded and complete most multithreaded tests, but this is not very encouraging. Some debugging revealed that this lock had been leaked before a single line of test code had been executed meaning the condition was caused outside the RegexKit.framework and unit testing code base during initialization. Also, unit tests that exercised exception (ie, NSException) conditions had to be completely disabled as this would 'seem' to cause the ObjC runtime to deadlock but was unable to be debugged because of bugs in the debugger.

In short, single threaded use should be fine. Multithreading should be used cautiously, but the framework itself seems to be multithreading safe. After pre-populating certain ObjC runtime structures to prevent deadlock, the multithreading tests completed without incident except for previously mentioned NSException tests which were disabled. Caveat emptor.

Building the RegexKit.framework with Xcode

The Xcode project included with the framework distribution comes with a number of targets. In addition to the targets, a number of additional project settings have been added to assist in the building of the PCRE library. You can view these by choosing the Project > Edit Project Settings and then clicking on the Build tab.

Included Xcode targets
TargetDescription
Unit TestsTests that exercise the functionality of the framework.
PCREDownloads, configures, and builds the PCRE library.
RegexKit FrameworkBuilds the framework so that it can be used as an embedded private framework.
DocumentationBuilds the documentation for the framework from the comments in the header files.
DistributionBuilds the pcre, Embedded Framework, and Documentation targets and then packages them in to the distribution groups RegexKit and RegexKit_source. The finished distribution files are in the build/Distribution directory.

When building the framework, it may be useful to select the menu item Build > Build Results to open the build results window. A number of build targets will output status messages prefixed with debug: that will allow you to monitor the targets build progress. Additional information is available via the Build Transcript icon (resembles a text document) on the middle pane splitter. This allows you to view the targets build output as you would see it if you were to perform the build from the shell.

Unit Tests

Included with the distribution are a set of unit tests. They are divided in to three categories:

  • Functionality
  • Multithreading
  • Timing

The multithreading and timing tests both duplicate the majority of the code in the functionality tests while adding additional code to either exercise the tests under multithreading conditions or to time repeated executions of the tests.

The timing tests record the amount of CPU time that has elapsed rather than the amount of "wall time". This is because the amount of CPU time required to execute a test remains fairly consistent regardless of any other activities the machine executing the tests is performing, whereas the amount of "wall time" a test takes can vary considerably if the tests are executing concurrently with other processes that are consuming a significant amount of CPU time.

The multithreading tests add some additional tests to enable, disable, and clear both the entire cache and individual entries. A complete series of all the tests is a round. The multithreading test prepares a number of threads, each of which executes a configurable number of rounds while delaying a random amount of time in-between rounds. The reasoning behind a random amount of time between rounds is to maximize the overlap of different test cases executing simultaneously. While non-deterministic, the tests at least provide some confidence that multithreading behavior is somewhat reliable.

Important:
The multithreading tests can trigger a bug in the malloc() memory allocation library on Mac OS X up to at least version 10.4.10. The Apple Problem ID is 5083704. It is only triggered when MallocStackLogging or MallocStackLoggingNoCompact environment variables have been set.

In short, the stack logging code has a lock to synchronize updates to the stack log. Occasionally when the stack log grows, the entire stack recording allocation must be moved to a new block in memory in order to accommodate the request (ie, realloc()). A race condition exists when this happens because the stack logging code keeps the lock within that allocated memory. By chance, another thread may need to record a stack allocation while a different thread is moving the stack recording allocation, and begin waiting for the lock at the original allocation address. However, once the thread that has the lock moves the stack recording allocation, the blocked thread will be waiting on a lock which will no longer be the valid because it has been moved with the reallocation.

The program will core dump at this point because the underlying memory for the old block has been unmapped from the process space. The top most frame in the stack trace will be a call to spin_lock(), which was called by stack_logging_log_stack().

These crashes are not due to a bug in the RegexKit.framework code.

Building the PCRE library

The PCRE target automatically downloads and builds the PCRE library. There are a number of project settings that help control the PCRE library build process.

Noteworthy PCRE related project settings
Name Default Description
PCRE_CONF_CFLAGS -isysroot ${SDKROOT} CFLAGS passed to the PCRE configure script.
PCRE_CONF_LDFLAGS -Wl,-syslibroot,${SDKROOT} LDFLAGS passed to the PCRE configure script.
PCRE_CONF_OPTIONS --enable-utf8 --enable-unicode-properties --disable-cpp --disable-shared Options to enable or disable that is passed to the PCRE configure script. The default configures Unicode support and disables C++ support. Only the static link library is required, not the shared dynamic link library.
PCRE_TARBALL_FILE ${PCRE_NAME_VERSION}.tar.bz2 The file name for the tarred and compressed PCRE distribution file.
PCRE_URL_ROOT ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre The base URL where the PCRE distribution can be found.
PCRE_VERSION 7.0 The version number of the PCRE distribution.
PCRE_NAME_VERSION pcre-${PCRE_VERSION} The base name of the PCRE distribution.

These settings, along with the make file Makefile.pcre, work together to build the PCRE library. The make file automatically detects and adds the appropriate configuration options based on a limited set of Xcode settings such as the architecture(s) to build for if creating a Universal Binary and the appropriate debugging and optimization flags.

The end result is an automated process configured by the project settings for performing the typical cycle of:

  • Download the distributions tarball.
  • Decompress and untar the distribution.
  • Run the included ./configure script with the desired options.
  • Build the distribution.
  • Install the distribution.

The scripts configure the PCRE library to install in the projects ${BUILD_DIR} directory. Different configuration styles are built and installed in different locations, just as your regular code would be. Since the target is driven by a Makefile, the library is only built as required. Since it is not as tightly integrated in to the Xcode build process, it is a good idea to clean the target if you make any non-obvious changes that directly affect the PCRE library. You should review Makefile.pcre for additional information on the PCRE library build process.

It should be possible to change the PCRE_VERSION setting and rebuild the framework to switch to a different version of PCRE. Although things naturally change from release to release, the pcre target is hopefully robust enough to accommodate most simple changes. Any changes required to successfully build a new distribution are likely to be limited to the Makefile.pcre file.

The RegexKit.framework statically links to the PCRE library. This was chosen because the 'right thing' for using dynamic libraries is neither obvious nor easy. While a copy of the PCRE shared library could be included, what is the correct behavior if the user has the PCRE library installed as well? And what if, in the unlikely event, your regular expressions are not compatible with users PCRE library? Troubleshooting such an issue would be very difficult because there would be no way to easily determine if a different PCRE library version was causing the problems. For this, and other reasons, it was chosen to statically link the PCRE library to the framework. It forms an always known base, and various steps are taken so that none of the PCRE library symbols are exposed outside of the framework. This prevents inadvertent tampering with the PCRE internals, such as the pcre_callout feature, and prevents a linking conflict if a different PCRE library is linked to for whatever reason.

GNUstep

Needs to be updated and checked.

The RegexKit.framework has been tested with GNUstep, though not nearly as extensively as testing done with Mac OS X Cocoa. In general, the framework should work under GNUstep, but the shear number of variations of operating systems, compilers, and revision combinations makes testing difficult. Also, GNUstep is not the primary development target of the framework. Certainly the foundation for GNUstep compatibility has been laid, and any changes required for a specific platform are hopefully minimal. The various atomic primitives, determined and defined in REPrivate.h, will have to be added if they are not automatically configured.

The header markup used is Apples HeaderDoc which, unfortunately, is not compatible with GNUstep GSDoc format. Therefore there is no native GNUstep style GSDoc documentation provided.

You should review the multithreading notes in this document and the GNUstep/README.GNUstep file in the distribution for additional information.

Implementation Details

The following macros are used to control various compiler specific features:

Compiler feature macros
Macro Compiler Description
RKREGEX_STATIC_INLINE GCC >= 4 Portable equivalent of FOUNDATION_STATIC_INLINE which causes a function to be defined as static and always inlined.
RK_EXPECTED GCC >= 4 Used to provide a hint to the compiler to aid instruction scheduling when the outcome of a conditional evaluation is known with a high degree of certainty.
RK_ATTRIBUTES GCC >= 4 Defines GCC specific function attributes which can help the compiler in optimizations, sentry checks, and diagnostic messages.

The following C Preprocessor defines are used to control various compile time options:

Compile time configuration flags
Flag Default Description
USE_MACRO_EXCEPTIONS Enabled if not Mac OS X Controls whether or not NS_DURING / NS_HANDLER / NS_ENDHANDLER macros are used to catch exceptions. Enabled if not running under Cocoa or GCC < 3.3. Otherwise, on Mac OS X >= 10.3, the newer style -fobjc-exceptions @try / @catch compiler exception syntax is used.
USE_AUTORELEASED_MALLOC Enabled Controls whether the framework makes use of a special object for temporary memory allocations that are returned to callers for certain types of results. Currently this only used by the rangesForCharacters:length:inRange:options: family of functions to return an array (in the C sense, not NSArray) of NSRange results from a match. In short, the memory allocation that is returned is automatically free()d when the current NSAutoreleasePool context pops. Memory allocations are % 16 aligned. When disabled, a NSMutableData of sufficient capacity is substituted instead. Only the mutableBytes pointer is returned, the object itself is autoreleased.
USE_PLACEHOLDER Enabled Controls whether or not the framework takes the additional step of substituting a placeholder object when allocWithZone: is called with a special singleton object. The singleton object then receives the initWithRegexString:options: call and checks the cache first. Only when the regular expression is not already in the cache is a new object allocated to fill the request. Without this modification a new object is allocated only to be immediately released if a match is found in the cache.
USE_CORE_FOUNDATION Enabled on Mac OS X See OpenStep Foundation vs. Core Foundation for details.
_USE_DEFINES Enabled Controls whether or not a number of utility functions are replaced with macro equivalents or if static inline functions are used. For example, NSMakeRange is redefined as NSMakeRange(x, y) (NSRange){x, y}.
Object Hash Value and Equality

The hash value for a RKRegex object is computed by taking the hash value of the string representation of the regular expression and performing a bitwise exclusive-or with its RKCompileOption value, (ie, stringHash ^ compileOptions). Collisions are therefore strongly dependent on the string implementations hash function. Identical regular expression strings with different RKCompileOption will obviously be unique.

Cache membership

A NSMapTable is used to store the cached objects. The key used to store and retrieve items from the cache is the RKRegex computed hash value. If the cache is asked to add a RKRegex and there is already a RKRegex with the same hash value, the existing object is kept in the cache and the new object is not added. Assuming that equal hashes represent true equality and not a collision (that is, two different RKRegex have identical hash values), this policy keeps the RKRegex that is already in use 'hot' in the CPU caches.

Exported symbols

The file Source/Build/export_list contains the symbols that are visible to programs that link to the framework. This acts as a filter to ensure only the symbols required to use the framework are exported. The remaining symbols are invisible and unavailable to programs that link with the framework. While many items are declared static to prevent external visibility, sometimes a symbol needs to be available outside of it's compile unit, but still private to the framework. New functionality that requires user visibility of the symbol(s) will have to be added here. New Objective-C methods and the corresponding @selector() signatures are not added here, they are added dynamically at load time by the Objective-C runtime system.

Use of Global Variables

In general what few global variables that do exist are defined static so they are not exported outside of the compile unit. Use of global variables is multithreading safe by either being constant (ie, the major version of the PCRE library) or protected by locks to control access. Since the regular expression cache is accessed with class methods it requires global variables to manage the cache state, but those variables are not exported and all access to the shared state is controlled by locks. Initialization of any global variables is done in functions marked with the constructor attribute or the classes initialize method. In either case, atomic barriers are used to prevent the highly unlikely event that the initialization routines are invoked multiple times, or simultaneously from multiple threads.

Memory Allocation and the use of the Stack

The RegexKit.framework tries, whenever possible, to use the stack for all of its memory allocation needs. Stack based allocation has essentially no overhead associated with it and usually maps to a handful of instructions executed on most architectures. Since the allocation is associated with the stack, there is no resource to free as it is 'released' when the stack frame pops. Internally, the framework uses alloca() for these allocations. Extreme care must be taken when dealing with functions that make use of alloca(), however. The reasons for this are beyond the scope of this text, but safe to say that with all the numerous advantages comes an equally numerous amount of caveats and responsibilities. If you have no idea why this warning is here, you have no business altering code that makes use of alloca(). You have been warned.

OpenStep Foundation vs. Core Foundation

The USE_CORE_FOUNDATION flag controls whether or not the framework uses Apples Core Foundation C API or the OpenStep Foundation Objective-C API to perform most of the low level object creation and manipulation. Using Core Foundation can increase overall performance since Core Foundation is an Objective-C aware library that uses a C based API. This eliminates the Objective-C message dispatch overhead and results in significantly less object creation / autorelease overhead. The functionality provided by RegexKit.framework is identical in either case, essentially substituting Core Foundation calls for the equivalent OpenStep Foundation methods.

Using Core Foundation also causes the framework to optimize some retain / autorelease calls. Since Core Foundation allows you to specify custom collection callbacks for retain and release, the framework makes use of this by eliminating the retain callback when it can safely do so. For example, if the framework is requested to return an NSArray of results, which is equivalent to CFArray, the framework instantiates all the objects to be contained in the CFArray and then creates the CFArray using a CFArrayCallBacks with a NULL retain callback. When the instantiated objects are added to the CFArray, their retain count remains the same, effectively transferring ownership from the creating RKRegex to the CFArray. In this case it eliminates the typical autorelease / retain / release calling sequence for every object in the collection. The framework uses this same technique in certain convenience methods, such as getCapturesWithRegexAndReferences:, where it groups all of the instantiated objects to be returned in to a single CFArray. Only the CFArray with all the objects is added to the NSAutoreleasePool, again eliminating a number of extraneous retain / release calls per object.

Cocoa vs. GNUstep

The primary development target is the Mac OS X Cocoa environment. However, an effort has been made to support the GNUstep environment as well. The two pre-processor defines, __MACOSX_RUNTIME__ and __GNUSTEP_RUNTIME__ (defined in RegexKitDefines.h), are used to determine which platform is being targeted. Use of the GNUstep runtime automatically disables USE_CORE_FOUNDATION and enables USE_MACRO_EXCEPTIONS, which represents the largest difference between the Mac OS X Cocoa and GNUstep targets.

Multithreading locks

Two private classes, RKLock and REReadWriteLock, are used internally to hide the host implementation details of the locking mechanism used. Currently pthread mutex and rwlock primitives are used to provide the locking functionality. A collection of debugging statistics can be toggled at run time. Debugging statistics kept are generally of the number of times the lock was busy on the first attempt to acquire it and the number of times the lock spun before it was finally acquired. Since the statistics are for debugging only, no effort is made to ensure that counter increments are atomic across threads due to the inherit performance penalty of such atomic operations.

A number of C function calls are privately exported to make use of instantiated lock objects. They accept the C API equivalent of an Objective-C method invocation, specifically func(id self, SEL _cmd, ...). This is done to bypass the normal Objective-C message dispatch overhead for maximum speed.

The locking code tries to be robust in the event of spurious errors or wakeups. RKLOCK_MAX_SPURIOUS_ERROR_ATTEMPTS (defined in RKLock.h) controls the number of spurious errors the lock attempts to overcome before finally giving up in failure. The default value is 2. When a spurious error occurs the lock objects always atomically increment their spurious error counter.

Warning:
Finding legitimate test conditions to exercise code that is not on the common path is very difficult. Consequently overall behavior under unusual lock conditions is not well tested.

Framework Dependencies

The following lists the major external dependencies that the RegexKit.framework is dependent on.

Framework Dependencies
NameURLDescription
PCREhttp://www.pcre.org/Provides the regular expression matching engine.

License Information

The code for this framework is licensed under what is commonly known as the revised, 3-clause BSD-Style license.

License

Copyright © 2007-2008, John Engelhart

All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  • Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  • Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
  • Neither the name of the Zang Industries nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 
RegexKit project hosted by: