Sign in to follow this  
Followers 0
zfisherdrums

Reg Exp to capture specific section of document

5 posts in this topic

#1 ·  Posted (edited)

Goal:

Using StringRegExp, obtain the contents of a specific section in a larger document.

Notes:

> The sections are denoted by a section header and ended with two newline sequences ( see examples below ).

> For the purposes of brevity, I am trying to only capture the Differences, but I will be using a similar pattern to obtain contents from "Missing Baseline Files", "Inside Tolerance", etc.

> C# RegExp engine successfully returned when this pattern was used: Differences\n-+\n.*?^\n

Problem:

The problem is that I cannot replicate the equivalent C# pattern - specifically to tell pcre to match the two newline sequences at the end of the section. I have been able to translate up to here: Differences\R-+\R.*?, but that fails to provide the contents of the section. What do I fill in the blank to make this behave as expected?

Differences\R-+\R.*?_____

Context:

Here is the document :

CODE
Missing Baseline Files

----------------------

05\Annual1.txt

06\Annual1.txt

Differences

------

01\Annual1.txt

02\Annual1.txt

03\Annual1.txt

04\Annual1.txt

Inside Tolerance Only

---------------------

07\Annual1.txt

08\Annual1.txt

This is what I want to extract:

Differences
------
01\Annual1.txt
02\Annual1.txt
03\Annual1.txt
04\Annual1.txt

This is what I'm getting using Differences\R-+\R.*?

Differences
------
Edited by zfisherdrums

Share this post


Link to post
Share on other sites



Goal:

Using StringRegExp, obtain the contents of a specific section in a larger document.

Notes:

The sections are denoted by a section header and ended with two newline sequences ( see examples below ).

C# RegExp engine successfully returned when this pattern was used: Differences\n-+\n.*?^\n

Problem:

The problem is that I cannot replicate the equivalent C# pattern - specifically to tell pcre to match the two newline sequences at the end of the section. I have been able to translate up to here: Differences\R-+\R.*?, but that fails to provide the contents of the section. What do I fill in the blank to make this behave as expected?

Differences\R-+\R.*?_____

Context:

Here is the document :

CODE
Missing Baseline Files

----------------------

None

Differences

------

01\Annual1.txt

02\Annual1.txt

03\Annual1.txt

04\Annual1.txt

Inside Tolerance Only

---------------------

None

This is what I want to extract:

Differences
------
01\Annual1.txt
02\Annual1.txt
03\Annual1.txt
04\Annual1.txt

This is what I'm getting using Differences\R-+\R.*?

Differences
------

Here ya go:

(Differences\s?\s?(?:\n.*)+?)\s?\n?\s?\n?Inside

I've got a nice SRE tester in my sig if you want to use it :P

Share this post


Link to post
Share on other sites

Here ya go:

(Differences\s?\s?(?:\n.*)+?)\s?\n?\s?\n?Inside

I've got a nice SRE tester in my sig if you want to use it :P

Thanks for the tool link in you sig; I'm trying it out as I type (?)!

The part of the problem I failed to mention in that I want it to be able to capture the contents without specifying the name of the following section. Put another way, how would one obtain the contents without having to mention "Inside"? I'm looking for a generic approach as I'll be using the pattern to obtain contents from several other sections. Make sense?

Share this post


Link to post
Share on other sites

Thanks for the tool link in you sig; I'm trying it out as I type (?)!

The part of the problem I failed to mention in that I want it to be able to capture the contents without specifying the name of the following section. Put another way, how would one obtain the contents without having to mention "Inside"? I'm looking for a generic approach as I'll be using the pattern to obtain contents from several other sections. Make sense?

Yes it does. It seems like it might be easier to just get an array of the items?

\d\d\\[[:alnum:]\.\-\_]*

Add a () around the part you want. Right now it grabs "01\Text.extension"

Ex:

\d\d\\([[:alnum:]\.\-\_]*)

"Text.extension"

SRE with Flag 3. Let me know if this solves it for you.

Szh

Share this post


Link to post
Share on other sites

#5 ·  Posted (edited)

\d\d\\[[:alnum:]\.\-\_]*
Ok...cool...I'm close. Using your example with Flag set to 4, I'm able to procure the elements in the array ( including the parent folder in the string ).

I modified the example document in my original posting to be more representative of what happens in our domain. You see, we can potentially have lines that match this pattern in other sections as well. So how does one grab just the section they want and no more?

I realize now that I've made a few assumptions in my description of the problem. For that, I apologize and thank you for any/all the time you've spent helping me with this.

Zach...

PS: If it means what I think it does, I'm a JF too.

Edited by zfisherdrums

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0