Jump to content

Regular Expression to select text over multi-lines


Recommended Posts

Text in a file, read into var with fileread:

<>
<>
<>
<>
<
J please look
>
<>
<>
<>

Hi, 

I want  a RegExp to select around 'please', back to the previous < and forward to the next >.  I can select the line of text.  Then I add in (?s) and it selects the whole text.  I think I want to make it not greedy, (?U) , that seems to make it ungreedy after, but it still selects all the previous lines.

$sPattern = "(?s)<.*please.*>"            ; 1
$sPattern = "(?s)<(?U).*please.*>"        ; 2
$sPattern = "(?s)<(?U).*please(?U).*>"    ; 3
$sAry = StringRegExp($sHTML, $sPattern, 3)

 

Link to post
Share on other sites

Mikell,  Thansks for that.  I've had a quick play with it and must finish now for today (UK 23:30) It provokes questions:

1 - the .* at each end, outside the <> - what are they doing? I don't want anything outside the <>.

2 - I'm using your pattern in StringRegExp, 3) and the selected text doesn't include the immediate <> (all the text up to those, I've added a few chars to make sure.  Why aren't the <> selected if they are in the pattern?  (This agrees with what gets replaced using your StringRegExpReplace).

Richard.

 

 

 

Link to post
Share on other sites

This example allows for "please" being in first line or the last line.  And returns all of the previous line, and all of the next line of the "please" contained line, if they exist.

Note: In my example and Mikell example all the text in the "test" parameter of StringRegExpReplace() is matched with the regular expression pattern. So, the only text returned is in the "replace" parameter, which is "$1".  This is the first capture group which is referenced by the first back-reference, "$1".  The first capture group or the first back-reference is defined by the matching text that is matched after the first open bracket, traveling from left to right, and before the matching close bracket.

#cs
<>
<>
<>
<>
<
J please look
>
<>
<>
<>
#ce

;$str = FileRead("1.txt")
$str = StringRegExpReplace(FileRead(@ScriptFullPath), "^(?s).*#cs\s+(.+)\s+#ce.*$", "$1") ; Extract test string from this script.
;ConsoleWrite($str  & @CRLF)

$sFind = "please"
$res = StringRegExpReplace($str, '(?s).*?((\V*\v+)?\V*\Q' & $sFind & '\E\V*(\v+\V*)?).*', "$1")

ConsoleWrite($res & @CRLF)
MsgBox(0, "Results", $res)

 

Link to post
Share on other sites
9 hours ago, RichardL said:

1 - the .* at each end, outside the <> - what are they doing? I don't want anything outside the <>.

The pattern represents the whole string and the part to grab is put inside brackets (capturing group). So the whole text will be replaced by the content of this group, which is backreferenced as "$1" as Malkey explained
 

9 hours ago, RichardL said:

2 -....  Why aren't the <> selected if they are in the pattern? 

because they are outside the brackets. Just move the brackets to include < and > in the group and they will be grabbed too

$str = FileRead("1.txt")
; get the wanted part
$res = StringRegExpReplace($str, '(?s).*(<.*?please.*?>).*', "$1")
; remove included newlines
$res = StringRegExpReplace($res, '\R', "")
Msgbox(0,"", $res)

Using StringRegExp, 3 is a little different. You must then specify that the chars to be grabbed around 'please' must not be < or > by using of a negated character class

$str = FileRead("1.txt")
; using StringRegExp w/ flag 3
$res = StringRegExp($str, '(?s)<[^<]*please[^>]*>', 3)
; remove newlines
$res[0] = StringRegExpReplace($res[0], '\R', "")
Msgbox(0,"", $res[0])

Edit
Please note that there are several ways to skin this cat  :)

Edited by mikell
Link to post
Share on other sites
  • 2 weeks later...

It took me a few days to get back to this.  Your patterns worked well on the example text.  When I came to look at the actual text again, the 'not include' selection to prevent it including from the first <P needed to be a string <P, not just one char [^<].  I did some Googleing and it looked hard.  Then I realised I could limit the selection to only the immediately surrounding tags using .{1,90} instead of .* .  Not a very precise way to skin the cat but it's working.  I've learned a few things, thanks.

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    No registered users viewing this page.

  • Similar Content

    • By RAMzor
      Hi everyone,
      I have this string:
      "main_lot      0x111” & @CRLF & “main_version          0xABC” & @CRLF & “main_number 0xDEAD123” & @CRLF & “main_version          0x333"
      And I'm trying to extract one specific hexadecimal number, actually main_version from this string by using StringRegExp:
      How to get 'ABC' from it?
      I'm not sure if the original string uses @CRLF, @CR or @LF as a line breaks (received from linux over ssh plink.exe) I have tried this code but it doesn't work 
      #include <Array.au3> $sLog = "main_lot 0x111” & @CRLF & “main_version 0xABC” & @CRLF & “main_number 0xDEAD123” & @CRLF & “main_version 0x333" $aVer = StringRegExp($sLog, "main_version\h*(.+)(?:0[xX][[:xdigit:]])", 3) _ArrayDisplay($aVer)  
       
    • By Tosyk
      Hi,
      Please help me to change metasymbol line. Right now I have this condition code:
      If StringInStr($_sName, 'TEXT ') Then $_sName = StringRegExpReplace($_sName, '(^.*)\TEXT (.*)$', '$2') $_sName = StringRegExpReplace($_sName, '(^.*)\ (.*)$', '$1') If Not CheckIsSave_($_sName) Then It work fine with this text file and finds each line which start from 'TEXT':
      Material B7E671143D244B ==================================== TEXT 2F3139D816C34D 1 TEXT B6A968EF2505A2 1 TEXT 35206697A04F91 1 TEXT EB485AF490D83D 1 TEXT 0DAB42294BD9B3 1 TEXT 3D6525BEE360E1 0 Material D6906B886B06E3 ==================================== TEXT 0CCECCCCFB62AE 1 TEXT 1E14CB29AB43F0 1 TEXT FB7F0DCE9B5950 1 But I have a new text file now the lines of which now are start with 0:, 1: and so on:
      sm_0 --------------- 0: dummy_gray 1: c_com_socksa_mt 2: c_com_socksa_tn 3: dummy_white 4: default_z 5: dummy_nmap 6: --- 7: --- sm_1 --------------- 0: c_com_prisoner_shoes_di 1: c_com_prisoner_shoes_mt 2: c_com_prisoner_shoes_tn 3: dummy_white 4: default_z 5: c_com_leatherb_rt 6: --- 7: --- how to change (or add) the condition code above to work with new text file?
      I'm trying to change this script: http://autoit-script.ru/threads/poisk-fajlov-rekursivno-po-dannomu-spisku.26970/post-148646
       
    • By seadoggie01
      I'm trying to capture everything after a "#ToDo" in my scripts. I got that like this:
      (?i)[^\v]*#todo(.*) But then I thought it would be nice to use underscores to continue the ToDo... kind of like this:
      #ToDo: This is a really long explanation about something _ # that is very in-depth and needs to take up a lot of _ # space in a ToDo comment Global $variables = "Bad" I can't seem to capture everything... and maybe I'm trying to do too much with Regex... I keep trying variations of this:
      Condensed Version: (?im)[^\v]*#todo(?:([^\v]*)_\s*)*#([^\v]*) Expanded with comments (?ixm)(?# Ignore case, ignore newlines in Regex, use multiline option)# [^\v]*(?# Match leading space/s)# \#todo(?# Match the #todo)# (?:([^\v]*)_\s*)*(?# Match lines ending with _)# \#([^\v]*)(?# Last line only, no _'s)# I never seem to be able to build an array well with Regex... I saw something once about not being able to capture repeated patterns, and I think that's my issue
    • By genius257
      Inspired by PHP's preg_split.
      Split string by a regular expression.
      Also supports the same flags as the PHP equivalent.
      v1.0.1
       
      Example:
      #include "StringRegExpSplit.au3" StringRegExpSplit('splitCamelCaseWords', '(?<=\w)(?=[A-Z])') ; ['split', 'Camel', 'Case', 'Words']  
    • By RAMzor
      Hi guys I need your help.
      I have string like this : "TDM111A5,      RCT222Y5/ 7  ; FDT444E4 /8 , ABC222R5"
      I need find a coma or semicolon and delete white spaces before and after them
      The output should be a string and/or array 
      String : "TDM111A5,RCT222Y5/ 7;FDT444E4 /8,ABC222R5"
      Array:
      TDM111A5
      RCT222Y5/ 7
      FDT444E4 /8
      ABC222R5
×
×
  • Create New...