Sign in to follow this  
Followers 0
Temil2005

need help with StringRegExp

15 posts in this topic

First, I want to say that i'm a noob when it comes to scripting, so please, be gentle =)

I figure what i'm trying to do is probly really simple, I am just missing some basic code to do it.

$asResult = StringRegExp("41 [Thread-437] DEBUG", '(?:Thread-)([0-9]{1,6})(?:])', 1)
If @error == 0 Then
    MsgBox(0, "Thead Number", $asResult[0])
EndIf

Basically, I'm trying to pull only the Thread number. Now, this all seems to work, but is there anyway to exclude a variable as well?

Lets say I have the following variable allready set .. $VAR = "223,220,442" (I can change it to other things besides comas, can also be | or whatever helps with the code).

Is there anyway to exempt all the number in that VAR from the StringRegExp? I know I can do an IF statement afterwards, but when I'm scanning through 20-30meg files, it's a very long process, because it's still finding ALL the threads, and has to compare them, I just want it to Skip all the threads that have allready been goten, and find the next one in the text file I'm working with. Is this posable?

Share this post


Link to post
Share on other sites



#2 ·  Posted (edited)

One way you can exclude certain #'s, after grabbing the Thread # as you have already is to just do a 'If StringInStr' statement. The trick is to surround the numbers in both the exclude-string variable and the Thread # itself with something that isn't a number. Simple enough:

Local $aPCREResult,$bMatch,$sExcludeNums="|14|1567|24|4372|"        ; exclude 14, 1567, 24, and 4372
Local $sTestPattern="41 [Thread-437] DEBUG"
$aPCREResult=StringRegExp($sTestPattern,'Thread-(\d{1,6})\]',1)
; Note how I surround the Thread # with '|' -> this prevents '67' from matching '1567' etc.
If @error Or StringInStr($sExcludeNums,'|'&$aPCREResult[0]&'|') Then
    $bMatch=False
Else
    $bMatch=True
EndIf
ConsoleWrite("Match result:"&$bMatch&@CRLF)

*edit: My apologies for not giving you a StringRegExp answer, but I think in this case a 2nd test is better than a convoluted regular expression (with something like '(#|##|###)')

Edited by Ascend4nt

Share this post


Link to post
Share on other sites

One way you can exclude certain #'s, after grabbing the Thread # as you have already is to just do a 'If StringInStr' statement. The trick is to surround the numbers in both the exclude-string variable and the Thread # itself with something that isn't a number. Simple enough:

Local $aPCREResult,$bMatch,$sExcludeNums="|14|1567|24|4372|"        ; exclude 14, 1567, 24, and 4372
Local $sTestPattern="41 [Thread-437] DEBUG"
$aPCREResult=StringRegExp($sTestPattern,'Thread-(\d{1,6})\]',1)
; Note how I surround the Thread # with '|' -> this prevents '67' from matching '1567' etc.
If @error Or StringInStr($sExcludeNums,'|'&$aPCREResult[0]&'|') Then
    $bMatch=False
Else
    $bMatch=True
EndIf
ConsoleWrite("Match result:"&$bMatch&@CRLF)

*edit: My apologies for not giving you a StringRegExp answer, but I think in this case a 2nd test is better than a convoluted regular expression (with something like '(#|##|###)')

thank you for the help. this does what I want. As I was waiting for a reply, and tryinhg to figure it out on my own, I found 2 other things I had questions on if you have time to maybe answer? =)

1. I know that you use [] as ORs for the StringRegExp, in the script you provided, I tried to add in one more search,.. (example : search for [MAIN] and search for the next thread number, and report back which comes first. .. I tried this, and it doesnt seem to be working, am I doing this wrong?

$aPCREResult=StringRegExp($sTestPattern,'[{\[main]}(\[Thread-(\d{1,5})\]',1))]

2. I assume that apending to an existing variable is as simple as .. $sExcludeNums=$sEncludeNums$aPCREResult[0] .. looking through the helpfile to find out, more fimiliar with autohotkey, but I'm tired of some of the limitations of that program, and want to learn more on autoit, seems more robust.

thanks again.

Share this post


Link to post
Share on other sites

#4 ·  Posted (edited)

1. I know that you use [] as ORs for the StringRegExp, in the script you provided, I tried to add in one more search,.. (example : search for [MAIN] and search for the next thread number, and report back which comes first. .. I tried this, and it doesnt seem to be working, am I doing this wrong?

$aPCREResult=StringRegExp($sTestPattern,'[{\[main]}(\[Thread-(\d{1,5})\]',1))]
OR's in Regular expressions are '|' when enclosed in parentheses. That's really got nothing to do with my choice of using '|' as a separator though, that was just a random choice - it could easily be ';' or whatever you like.

As for '[' and ']', those are reserved for classes like you yourself used ( "[0-9]"). They most always need to be escaped with a '\' when specifying a literal bracket, the same as '{' and '}' need to be escaped when specifying the literal '{' or '}' (you would put '\{' and '\}'). So, you'll need to escape any brackets you have that are actual literal characters within your expression, which is one reason it fails. But without seeing an example of the output, its hard to lead you in the right direction with that one.

2. I assume that apending to an existing variable is as simple as .. $sExcludeNums=$sEncludeNums$aPCREResult[0] .. looking through the helpfile to find out, more fimiliar with autohotkey, but I'm tired of some of the limitations of that program, and want to learn more on autoit, seems more robust.

Appending strings is done using '&' or '&='. However, if you are adding to the exclusion list I provided, you'll want to also add the separator character:

$sExcludeNums&=$aPCREResult[0]&'|'

*edit: for #1 I meant I would need to see the data *not* the output of the function call.

Edited by Ascend4nt

Share this post


Link to post
Share on other sites

#5 ·  Posted (edited)

OR's in Regular expressions are '|' when enclosed in parentheses. That's really got nothing to do with my choice of using '|' as a separator though, that was just a random choice - it could easily be ';' or whatever you like.

As for '[' and ']', those are reserved for classes like you yourself used ( "[0-9]"). They most always need to be escaped with a '\' when specifying a literal bracket, the same as '{' and '}' need to be escaped when specifying the literal '{' or '}' (you would put '\{' and '\}'). So, you'll need to escape any brackets you have that are actual literal characters within your expression, which is one reason it fails. But without seeing an example of the output, its hard to lead you in the right direction with that one.

Appending strings is done using '&' or '&='. However, if you are adding to the exclusion list I provided, you'll want to also add the separator character:

$sExcludeNums&=$aPCREResult[0]&'|'

*edit: for #1 I meant I would need to see the data *not* the output of the function call.

umm, kind of lost ya a little on the first one =P .. but here is an example of the data ...

2010-08-01 21:15:26,187 [main] DEBUG com.rita.JCGatewayInstance - JCGatewayInstance.getInstance

2010-09-02 12:34:28,057 [Thread-437] DEBUG com.rita.JCMessage - Building message...

So, my final outcome I'm trying to do is to read that data, look for [main] or the specified thread number (which is pulled using the script already. .thanks for that), and it returns whichever comes first. This seems to work great now with the code you provided, for all line with thread .. but trying to figure out how to many the StringRegExp look for main OR thread .. and do different things based on the outcome. I'm trying to do it all within one StringRegExp if possible, because there is over 200,000 lines, each with either thread or main in them. trying to only find the first one of the NEXT field as search every single line with thread in it makes forever, lol ,.. then I'm pulling all the data between the 2 ranges, and droping that into a file. if that makes any since?

on a previous note, is if posable for me to use the following within the StringRegExp to exclude the variables to speed it up as well?

[^ ... ]

I know you can use that to exclude text, and class, etc.. but is it posable to use that for excluding variable ranges? just an idea.

Edited by Temil2005

Share this post


Link to post
Share on other sites

well, it's not leting me edit my last message, so here is newer one.

after some testing, I understand what you are saying about the \'s in there, to allow for searching for the brackets. the part I don't understand is why someone like this doesnt work.

$aPCREResult=StringRegExp($sTestPattern,'[(main)(Thread-)]',1)

I mean, to my understanding, since they are within the [ and ] .. it should be looking for either MAIN or THREAD- .. right? I'm figuring i'm wrong, and that's why it's not working, lol. but that's basicly what i'm trying to do, is find either main or thread .. of course, it gets more complicated when I'm adding in looking for the threadID number, and so on, but just going simplified, can't get the basic's to work =/

Share this post


Link to post
Share on other sites

#7 ·  Posted (edited)

I'm sorta having trouble grasping what it is you want to do, but maybe (possibly?) the below may help. Note a few things:

-since there's a sub-capture group for thread #, *if* the first element returned is NOT 'main' then what is captured is 'Thread-####', followed by a 2nd element at [1] which is the number. You can then test that against the exclusion list as shown above (replacing [0] with [1]).

-$iLoc will store the next location in the string to start a search at. If you are searching multiple cases of 'main\Thread', you can use this as the start location ('offset' parameter in StringRegExp). However, be warned that if you capture 'main', and then do another search with the below pattern, you will then capture 'Thread' on the next iteration unless you specifically search for that string and then bypass it. (If that makes sense)

Anyway, see if it helps. I'm still confused lol ;)

Local $iLoc,$aPCREResult
Local $sTestPattern="2010-08-01 21:15:26,187 [main] DEBUG com.rita.JCGatewayInstance - JCGatewayInstance.getInstance"&@CRLF& _
    "2010-09-02 12:34:28,057 [Thread-437] DEBUG com.rita.JCMessage - Building message..."
$iLoc=1
$aPCREResult=StringRegExp($sTestPattern,'\[(main|Thread-(\d+))\]',1,$iLoc)
If @error=0 Then
    $iLoc=@extended
    If $aPCREResult[0]='main' Then
        ConsoleWrite("main found"&@CRLF)
    Else
        ConsoleWrite("Thread found first, number is: "&$aPCREResult[1]&@CRLF)
    EndIf
    ConsoleWrite("Next search position:"&$iLoc&@CRLF)
EndIf

*edit: P.S. - I use Szhlopp's String Regular Expression Tester for experimentation. I recommend using this or one of the other ones available on the forum. Some are quite helpful to newbs.

Edited by Ascend4nt

Share this post


Link to post
Share on other sites

#8 ·  Posted (edited)

I'm sorta having trouble grasping what it is you want to do, but maybe (possibly?) the below may help. Note a few things:

-since there's a sub-capture group for thread #, *if* the first element returned is NOT 'main' then what is captured is 'Thread-####', followed by a 2nd element at [1] which is the number. You can then test that against the exclusion list as shown above (replacing [0] with [1]).

-$iLoc will store the next location in the string to start a search at. If you are searching multiple cases of 'main\Thread', you can use this as the start location ('offset' parameter in StringRegExp). However, be warned that if you capture 'main', and then do another search with the below pattern, you will then capture 'Thread' on the next iteration unless you specifically search for that string and then bypass it. (If that makes sense)

Anyway, see if it helps. I'm still confused lol ;)

Local $iLoc,$aPCREResult
Local $sTestPattern="2010-08-01 21:15:26,187 [main] DEBUG com.rita.JCGatewayInstance - JCGatewayInstance.getInstance"&@CRLF& _
    "2010-09-02 12:34:28,057 [Thread-437] DEBUG com.rita.JCMessage - Building message..."
$iLoc=1
$aPCREResult=StringRegExp($sTestPattern,'\[(main|Thread-(\d+))\]',1,$iLoc)
If @error=0 Then
    $iLoc=@extended
    If $aPCREResult[0]='main' Then
        ConsoleWrite("main found"&@CRLF)
    Else
        ConsoleWrite("Thread found first, number is: "&$aPCREResult[1]&@CRLF)
    EndIf
    ConsoleWrite("Next search position:"&$iLoc&@CRLF)
EndIf

*edit: P.S. - I use Szhlopp's String Regular Expression Tester for experimentation. I recommend using this or one of the other ones available on the forum. Some are quite helpful to newbs.

Well, welcome to my confusion, my job is complete on confusing you! muwhahaha! =)

But seriously, I see what you are saying about this process. It makes since, on how it's scaning for one, then for the another, etc. i just have a feeling there is a simplier way of doing what i am trying to do. Here is the full detail, please let me know if this is the best path to go.

1. Pulling data from a file, and recording it to a Variable. (here is a small part of the file.. we are talking about over 300,000 lines)

12:34:28,041 [Thread-1] DEBUG com.JCMessage1

12:35:28,041 [Thread-1] DEBUG com.JCMessage2

12:36:28,041 [Thread-6] DEBUG com.JCMessage3

12:37:28,041 [Thread-5] DEBUG com.JCMessage4

12:38:28,041 [Thread-5] DEBUG com.JCMessage5

12:39:28,041 [Thread-main] DEBUG com.JCMessage6

12:40:28,041 [Thread-main] DEBUG com.JCMessage7

12:41:28,041 [Thread-6] DEBUG com.JCMessage8

Couple of things to note .

a - The threads are not always in sequential order

B - The Thread #s and Main will be mixed together

c - each line of data is specifically for that 1 thread # or main

2. So, taking that data, I would want to take all of the Thread-1 lines and put them together, into a single variable. All the Thread-6 the same, all the main the same, etc. I was able to get this somewhat working reading line by line, but that takes forever (10-20mins per scan) because there is so many of them.

So, is there maybe a better way to handle these? I was going to take the script you helped me with, and use that to find the first thread, main ,etc. then add code to that to have it find the next "different" thread or main. Once it finds it, have it pull all the data, or lines from the first line to the line with the next thread (excluding the next threads line)record that to a variable, and then remove that data from the current scanned data, so that it makes a new start range to scan, then have it repeat this process.

Does that clear up the confusion at all on what i'm trying to do? Like I said, I think most of the confusion is the fact that i think there might be a beter way of doing this, and I'm just not familiar with autoit enough to fully know my options. I have thought of have autoit read through the data, find all the threads, then pull the threads data 1 by 1. I also though it would be best to read the data, and parse out the data as it records it. Not sure what my options are, and I'm mostly looking for speed, speed, speed. Trying to get this process under 30since. I can generate the entire set of data into an Array, and display that within about 10-15secs, if that helps at all.

(Edit : also, wanted to add, the part I was trying to do before, you listed and explained perfectly. I wasn't including | in the search of the strings, that was one of my issues that i wasn't understand. After looking at what you posted, it all makes since, now i'm working on what i listed above, along with trying to add an "AND (&)" into the string search, lol.)

Edited by Temil2005

Share this post


Link to post
Share on other sites

Temil2005, sorry.. I still can't grasp the entire concept of what you're trying to do (perhaps cuz I'm working on 3 hours sleep). But I do have to say this.. a simple request has now grown into something of a project. And I'm not really going to put that much of my time and effort into something like this (there's what.. 'rentacoder.com' or somesuch for that?).

I'm willing to help on small issues, but I really don't have the time for bigger things - and I'm pretty sure I'd need to know much more to get a feel for what really needs to be done. Sorry... Maybe someone else may better understand you, though. who knows..

Share this post


Link to post
Share on other sites

That would get returned as an array. So far you have shown a few lines that you need to work with but you have not given any examples of what you want returned from each of those lines.


George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

thanks for the new post

Share this post


Link to post
Share on other sites

#12 ·  Posted (edited)

I didn't get what was going on in this post, but here's my thought on how you're trying to pull only the Thread number

I use this site to check out my regexp = http://www.gethifi.com/tools/regex

You should first off:

1. Return if you have a match, StringRegExp($text, "Thread-[0-9]*", 0) : Returns 1 (matched) or 0 (no match) see help file

2. If matched, return into array, StringRegExp($text, "Thread-[0-9]*", 1) see help file

3. Loop in the array of results to get thread numbers

This should be few lines.

Edited by M a k a v e l !

[font="Lucida Sans Unicode"]M a k. a v e L ![/font]

Share this post


Link to post
Share on other sites

Ascend4nt

I apreciate the help you provided, it got me started into the process of how to get things going, thank you for that. By me explaining everything in detail, that was just my way of trying to give you all the details of what i was trying to do in the end, as to get some ideas of different paths I can go to acomplish what i'm trying. I'm not really fimilar with autoit, and i have done what I'm trying to do within Autohotkey, but it takes nearly 10-20mins to run, as it's line by line so i figured i would give autoit a try, and see what options are there to speed all this up =) .. please don't think i was trying to get you to code any of it for me, I guess i was trying to get some direction of options, so i can know where to start, that's all.

GEOSoft, Makavel!

ok, after typing this, I yet again included alot of data. I broke it down to try and explain the entire process of what i'm doing with the data in the end that gets located. if you don't want to read all the data, please skip to the bottom to read my conclution on what i'm asking is posable to acomplish. thank you.

to try and answer your posts, I'll try to explain the patern, and see if it clears it up at all.

1. If you take the data I'm working with, it's being pulled from a file. each line falls into 2 catagories (catagory1 = threadnumber - or - main ... catagory2 = neither). When i say neither, I mean that it doesn't have a thread or a main, and if that's the case that line falls into the previous thread or main that was reconized.

2. This is the total patern I'm trying to do.

a. read file into variable (or read from file directly)

b. locate first lines thread number or MAIN, whichever comes first, and use the begining of that current threads line as the starting point

c. locate the next "different" thread, or MAIN (whichever comes next), and use the begining of that line as the previous thread ending point,

and also the new threads starting point (saving that as a variable, or apending to it if it's allready got data in it)

d. repeat to step a, all the way down the file, causing all of the data in the following example

.... %threadnumber%DATA .. so, the data for thread 5 = 5data .. data for thread 114 = 114data, etc

I figured out how to display the data, and to parse out the data after it's pulled to the different variables, I'm just runing into alot of issues with speed. I'm trying to find the fastest way to pull this data into a format that stores each threadsData seperatly. I know how to do it with each and every line being scaned, parsing it line by line but I'm trying to figure another way, as it takes to long. My main issue right now is the process of trying to scan for the next "different" thread number. The code that Ascend4nt provided to me helped me out alot, but because it's scaning every single thread, it seems like it's taking alot of time, as it finds the thread on everysingle line, it has to stop, pull the threadnumber, compair it using the IF statment, then move to the next one to check.

I guess what i'm trying to say is, If i could find a way to exclude the threadNumber that is current marked as the Current thread which is the number from the ScanStartingPoint from the searching process, that would allow it to skip all of that threadNumber and locate the next one in line, without scaning all of that same thread in between. I guess that is what i'm asking if it's posable, and if not if there is any other features that might work better? Please understand, i'm not asking for a full program writen. I apreciate the help, or ideas that someone might be able to come up with. thank you again.

Share this post


Link to post
Share on other sites

I didn't get what was going on in this post, but here's my thought on how you're trying to pull only the Thread number

I use this site to check out my regexp = http://www.gethifi.com/tools/regex

You should first off:

1. Return if you have a match, StringRegExp($text, "Thread-[0-9]*", 0) : Returns 1 (matched) or 0 (no match) see help file

2. If matched, return into array, StringRegExp($text, "Thread-[0-9]*", 1) see help file

3. Loop in the array of results to get thread numbers

This should be few lines.

ok, so I used your tool (after I figured it out, liking it alot btw! =) .. very helpful.

here is what i came up with so far ...

\[Thread-([\d]*[^$var])\]|\[([main]*)\]

That does want i want it to, with finding and pulling the threadnumbers, etc that are needed. I just have 2 issues currently, that maybe you can help out, or at least direct me in the right path to try and find the best way to do this.

1. the excludion part [^$var] is excluding way to much.. meaning "if var is 1 .. then it will exclude anything with a 1 in it.. (1, 41, 413, etc) Anyway that you know of to make it only exclude it if it's EXACTLY the same? or would that only work in an "if" statment afterwards?

2. I assume that puting a \n at the end of the code, will cause it to find the next return character?

Share this post


Link to post
Share on other sites

#15 ·  Posted (edited)

hmmm Temil, couple questions:

1.What will you do with your data once it's formatted, are you gonna scroll thru it... ?

2.Is this the final format you want ?

===Starting point Thread-1

12:34:28,041 [Thread-1] DEBUG com.JCMessage1

12:35:28,041 [Thread-1] DEBUG com.JCMessage2

===Ending point Thread-1

===Starting point Thread-6

12:36:28,041 [Thread-6] DEBUG com.JCMessage3

===Ending point Thread-6

===Starting point Thread-5

12:37:28,041 [Thread-5] DEBUG com.JCMessage4

12:38:28,041 [Thread-5] DEBUG com.JCMessage5

===Ending point Thread-5

===Starting point Thread-main

12:39:28,041 [Thread-main] DEBUG com.JCMessage6

12:40:28,041 [Thread-main] DEBUG com.JCMessage7

===Ending point Thread-main

===Starting point Thread-6

12:41:28,041 [Thread-6] DEBUG com.JCMessage8

Depending on your answers, you may wanna rethink your project and format your file in a way you can dump it in mysql @localhost (.csv to mysql tools). With this kind of structure :

logtime| threadsname| filed1| field2| comments|

12:34:28,041| [Thread-1]| DEBUG| com.JCMessage1|

12:35:28,041| [Thread-1]| DEBUG| com.JCMessage2|

12:36:28,041| [Thread-6]| DEBUG| com.JCMessage3|

12:37:28,041| [Thread-5]| DEBUG| com.JCMessage4|

12:38:28,041| [Thread-5]| DEBUG| com.JCMessage5|

12:39:28,041| [Thread-main]| DEBUG| com.JCMessage6|

12:40:28,041| [Thread-main]| DEBUG| com.JCMessage7|

12:41:28,041| [Thread-6]| DEBUG| com.JCMessage8|

If you have enough RAM to open your file and split it in 3to4 files, you could StringReplace(" ", "|")

With that in your database, you can now just send SQL queries to get what you want. Guessing you have knowledge with SQL queries...

3. What's the real goal of having your file formatted with StringRegExp(), what are you attending to do with data ?

Edited by M a k a v e l !

[font="Lucida Sans Unicode"]M a k. a v e L ![/font]

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0