Jump to content

RegExp - has anyone seen this library before?


sohfeyr
 Share

Recommended Posts

I'm not looking to step on anyone's toes here. I know how annoying it is to spend a lot of time developing something and then have someone say an alternative is already out there. I just put this forward for your consideration. Whoever is working on the regular expressions functions these days might get some ideas or insight or whatever.

The PCRE library is free online at ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/, and the posted builds range from January 2002 to July 2005.

According to section 3.2.1.8 of Jeffrey E. F. Friedl's Mastering Regular Expressions, 3rd Edition (O'Reilly: August 2006, ISBN 0-596-52812-4):

Philip Hazel developed PCRE, his library for Perl Compatible Regular Expressions, a high-quality regular-expression engine that faithfully mimics the syntax and semantics of Perl regular expressions. Other developers could then integrate PCRE into their own tools and languages, thereby easily providing a rich and expressive (and well-known) regex functionality to their users. PCRE is now used in popular software such as PHP, Apache Version 2, Exim, Postfix, and Nmap.

(@ Anyone out there looking for tips on regex in any language: check out that book it is fantastic!)

Just throwing that out there for consideration. I haven't looked closely, but I did notice there are some C++ wrappers posted on that FTP site. Might be worth a look.

Link to comment
Share on other sites

  • Replies 136
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

What size does it compile to? The last time I tried a similar package it compiled to 1-2MB which was no good for us.

The file size of the binaries are as follows... (the zip at the top is what contained those files)

I think that a plugin might be useful for this rather than direct inclusion in the source. What do you think Jon?

JS

Edited by JSThePatriot

AutoIt Links

File-String Hash Plugin Updated! 04-02-2008 Plugins have been discontinued. I just found out.

ComputerGetInfo UDF's Updated! 11-23-2006

External Links

Vortex Revolutions Engineer / Inventor (Web, Desktop, and Mobile Applications, Hardware Gizmos, Consulting, and more)

Link to comment
Share on other sites

Plugin is definately possible yeah. But inbuilt is still something good to have (for using in window titles and such, not just StringRegExp commands).

I see. Well that would be a call for you and the Dev's to look further into.

Please post here with the final decision. I am trying not to get too far ahead of myself. I have a lot of projects, but I see this one as another one worth while that I would take on. I do need to create a public projects list so people know what I am currently working on.

JS

AutoIt Links

File-String Hash Plugin Updated! 04-02-2008 Plugins have been discontinued. I just found out.

ComputerGetInfo UDF's Updated! 11-23-2006

External Links

Vortex Revolutions Engineer / Inventor (Web, Desktop, and Mobile Applications, Hardware Gizmos, Consulting, and more)

Link to comment
Share on other sites

More than a year ago I wrote an interface between PCRE and AutoHotkey. The complete DLL (ie Phil Hazel's stuff plus my own code (mainly a replace function, as PCRE only does the matching, and a few wrappers)) was ~23 KB upx'ed; the whole thing was called via DllCall(). See this thread.

PCRE is a lightweight, stable and well-tested library, used by apps like Apache or PHP. It's also well-documented and under constant development. Its license is a BSD license which means that including it in binary form should pose no problems.

So it might be a good replacement for the RE stuff that's currently in AU3. (I was always wondering why you guys were trying to re-invent the wheel here.) I certainly wouldn't do the interface exactly as I did it back then in '05, but my code is at least a proof of concept.

Funny coincidence that I am currently testing a few RE functions of my own to replace the AU3 built-ins... the backreferencing especially is just too buggy. I will release something later but it won't be for general use.

EDIT: checked size, the DLL (all included) is 24064 byte (as I said upx'ed).

Edited by thomasl
Link to comment
Share on other sites

I think you more or less hit it by accident as to why the wheel was re-invented. First, as Jon said, size. Second, license issues. AutoIt isn't fully open-source so using a library and compiling it into AutoIt would cause license issues. That leaves using a DLL but that isn't really an option either because it adds a dependency people have to lug around in addition to their script.

Link to comment
Share on other sites

I think you more or less hit it by accident as to why the wheel was re-invented. First, as Jon said, size. Second, license issues. AutoIt isn't fully open-source so using a library and compiling it into AutoIt would cause license issues.

Hit it by accident? I don't think so, matey.

Anyway... size may or may not be an argument here. I have no idea how much code Nutster's regex stuff generates. Then again, even if the PCRE stuff (with its ~23 kb) is 10, 12 k bigger, this isn't an overly high price to pay for a *working* regex implementation. IMO, of course.

As to licensing issues... it seems you have not read (or perhaps not understood) the PCRE license. Read it before you post such an unedifying remark. FYI:

PCRE LICENCE
------------
PCRE is a library of functions to support regular expressions whose syntax
and semantics are as close as possible to those of the Perl 5 language.

Release 6 of PCRE is distributed under the terms of the "BSD" licence, as
specified below. The documentation for PCRE, supplied in the "doc"
directory, is distributed under the same terms as the software itself.

The basic library functions are written in C and are freestanding. Also
included in the distribution is a set of C++ wrapper functions.

THE BASIC LIBRARY FUNCTIONS
---------------------------
Written by:    Philip Hazel
Email local part: ph10
Email domain:    cam.ac.uk

University of Cambridge Computing Service,
Cambridge, England. Phone: +44 1223 334714.

Copyright (c) 1997-2006 University of Cambridge
All rights reserved.


THE C++ WRAPPER FUNCTIONS
-------------------------
Contributed by:   Google Inc.

Copyright (c) 2006, Google Inc.
All rights reserved.


THE "BSD" LICENCE
-----------------
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

    * Redistributions of source code must retain the above copyright notice,
      this list of conditions and the following disclaimer.

    * Redistributions in binary form must reproduce the above copyright
      notice, this list of conditions and the following disclaimer in the
      documentation and/or other materials provided with the distribution.

    * Neither the name of the University of Cambridge nor the name of Google
      Inc. nor the names of their contributors may be used to endorse or
      promote products derived from this software without specific prior
      written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.

End
Link to comment
Share on other sites

Hit it by accident? I don't think so, matey.

I'm not sure what this remark is supposed to mean. You didn't state anything that conveyed you knew all the reasons. In fact, you stated that you wondered why we were re-inventing the wheel. To me, that reads of somebody who doesn't know our reasons. Given that our reasons have never appeared outside a private forum, I think it's pretty safe to say you did not know our reasons. You may have had conjecture, which I do not doubt were pretty accurate. Never-the-less, I say accident and I do not change it. There may be more reasons that we do not change than you think. I merely confirmed the "big ones".

As to licensing issues... it seems you have not read (or perhaps not understood) the PCRE license. Read it before you post such an unedifying remark. FYI:

I did not read the license. I also don't recall specifying any library in particular with my statement. If a library has a compatible license and is small enough, I think it would be considered if it's decided to go down the route of using a 3rd-party library. If PCRE is that library, so be it. If not, oh well. My statement about licenses was not directed to anything in particular so posting the BSD license was pointless from the standpoint of proving a statement I said wrong. I never qualified my statements by saying "PCRE is incompatible due to license issues".
Link to comment
Share on other sites

  • Administrators

Hit it by accident? I don't think so, matey.

Anyway... size may or may not be an argument here. I have no idea how much code Nutster's regex stuff generates. Then again, even if the PCRE stuff (with its ~23 kb) is 10, 12 k bigger, this isn't an overly high price to pay for a *working* regex implementation. IMO, of course.

The pcre3.dll posted above was ~200KB is that not the right one?
Link to comment
Share on other sites

The pcre3.dll posted above was ~200KB is that not the right one?

Maybe it can be streamlined. I just got a pre-compiled binary.

JS

AutoIt Links

File-String Hash Plugin Updated! 04-02-2008 Plugins have been discontinued. I just found out.

ComputerGetInfo UDF's Updated! 11-23-2006

External Links

Vortex Revolutions Engineer / Inventor (Web, Desktop, and Mobile Applications, Hardware Gizmos, Consulting, and more)

Link to comment
Share on other sites

I got the dll down to 69KB which should be around the same size as would be added to AutoIt if I merged it in. The license looks OK so it's worth a go I think.

You can get the size even further down by getting rid of several unneeded functions ("unneeded" in the context of AU3). I never did that for my DLL because I thought the few kb were not worth the trouble. But if this goes into the AU3 exe... that's a different matter.

There's also full Unicode support built into PCRE: as long as AU3 is not fully Unicode enabled, at least parts of this could go as well. Might save another 20 kb or so, for various tables and support functions.

A competent programmer would probably need a couple of days to sort all this out and produce a working prototype but I think it would be worthwhile. I have used this library for many years now and it always was reliable and fast (though perhaps not as fast as the RE implementation in Perl).

Link to comment
Share on other sites

  • Administrators

I've got a test exe compiled and the size looks good, I'm "attempting" to add it to autoit to test (we already have a couple of regexp implementations in there that we can switch on and off for testing). I really don't understand regexps though so it will be hard going.

Link to comment
Share on other sites

Excellent timing: I'm in the middle of parsing html-files with a billion StringInStr, StringLeft, StringMid, etc.

So if you have an executable that I can work with, please let me know. I would really like to replace all the searching with a few regular expressions.

Edited by martijn
Link to comment
Share on other sites

I'm "attempting" to add it to autoit to test [.....] I really don't understand regexps though so it will be hard going.

I have more REs here than I can count, so if you have something up and running (even if it's still rough at the edges), I'll be glad to give it a thorough thrashing :)
Link to comment
Share on other sites

Jon I am there with you. I know what RegExp does, but the syntax and everything...I just havent taken the time to study it.

JS

AutoIt Links

File-String Hash Plugin Updated! 04-02-2008 Plugins have been discontinued. I just found out.

ComputerGetInfo UDF's Updated! 11-23-2006

External Links

Vortex Revolutions Engineer / Inventor (Web, Desktop, and Mobile Applications, Hardware Gizmos, Consulting, and more)

Link to comment
Share on other sites

Jon I am there with you. I know what RegExp does, but the syntax and everything...I just havent taken the time to study it.

JS

I had a link to a book in the first post. It is on Safari if anyone has a subscription to that, and Safari also has a free trial period. It really is a FANTASTIC book. My own regexps are getting better every day, and it isn't just for "mastering" them, it is good for learning them too.

Personally, I'd love to see AutoIt become Unicode-compatible, but that's probably such a large change I'll be surprised if anyone will even consider it until v4. I also really like the idea of being able to use regexps in window functions - that would be fantastic for intercepting error messages and filtering by text! I too have been writing parsers, and doing it with just string functions is a real pain to debug or maintain.

The feature I want to see most on this is supporting backreferences so that text can be analyzed and rearranged for reports, or even for code translations.

Really, I'm just so thrilled just to have been able to contribute at all to your process, especially as a "resource person" (that's my favorite role in life and the reason I hang out in the Support forum :) ) Is there any way I can be of further help on this?

Link to comment
Share on other sites

Excellent timing: I'm in the middle of parsing html-files with a billion StringInStr, StringLeft, StringMid, etc.

So if you have an executable that I can work with, please let me know. I would really like to replace all the searching with a few regular expressions.

I really don't recommend that until we actually decide we want to go this route. Testing is okay but don't write mission-critical applications with any test executables because nothing is final yet.

Personally, I'd love to see AutoIt become Unicode-compatible, but that's probably such a large change I'll be surprised if anyone will even consider it until v4.

It would be very difficult but it basically requires dropping some supported operating systems or writing a ton of code that Windows already implements for us if we do want to support them. Windows 9x don't have most of the Unicode functions and so it requries a special download from Microsoft. That's fine and AutoIt would work on Unicode-enabled systems. It wouldn't work on non-Unicode systems, though, because it wouldn't have MLSU installed and so almost every Windows API function would fail. The only way to make AutoIt Unicode and still working on non-Unicode supporting verions of Windows is to write wrapper code around nearly every Windows API function we call to detect if we can use the Unicode version and if not, fall back onto the ANSI version. That alone would take a lot of effort. Then there is porting the existing code to use WCHAR instead of CHAR. That is probably about as much effort as writing all the wrappers.
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...