Sign in to follow this  
Followers 0
leuce

Can you please have a look at my regex?

14 posts in this topic

G'day everyone

I'm having trouble writing a regex for StringRegExpReplace. I'm hoping that someone may be able to help me.

The pattern is an HTML table (my source file has hundreds of these table rows, in a single table):

<tr>

<td>asdf asdf asdf</td>

<td>qwer qwer qwer</td>

<td>zxcv zxcv zxcv</td>

<td>poiu poiu poiu</td>

</tr>

The replacement should be one that the 4th cell of the table is removed and replaced with a copy of the 3rd cell, as follows:

<tr>

<td>asdf asdf asdf</td>

<td>qwer qwer qwer</td>

<td>zxcv zxcv zxcv</td>

<td>zxcv zxcv zxcv</td>

</tr>

The following regexp works as expected:

StringRegExpReplace ($fileread, "(<tr(?s).+?<td(?s).+?</td>(?s).+?<td(?s).+?</td>(?s).+?)(<td(?s).+?</td>)((?s).+?)(<td(?s).+?</td>)((?s).+?</tr>)", "\1\2\3\2\5")

However, I need the ability to make changes to the third cell's contents, and for that, I need to specify the contents independently. The following regexp should work (as far as I can see) but it doesn't:

StringRegExpReplace ($fileread, "(<tr(?s).+?<td(?s).+?</td>(?s).+?<td(?s).+?</td>(?s).+?)(<td(?s).+?>)((?s).+?)(</td>)((?s).+?)(<td(?s).+?>)((?s).+?)(</td>)((?s).+?</tr>)", "\1\2\3\4\5\6\3\8\9")

The following should also work, as far as I can see, but it doesn't either:

StringRegExpReplace ($fileread, "(<tr(?s).+?)(<td.+?>)((?s).+?)(</td>)((?s).+?)(<td.+?>)((?s).+?)(</td>)((?s).+?)(<td.+?>)((?s).+?)(</td>)((?s).+?)(<td.+?>)((?s).+?)(</td>)((?s).+?)((?s).+?</tr>)", "\1\2\3\4\5\6\7\8\9\10\11\12\13\14\11\16\17\18")

Can anyone please tell me what is wrong with the lower two regexps? I'd prefer to use the last one because it would allow me to do stuff with every column of the table, instead of only the third column. But right now my need is to do stuff with the third column.

It is possible that there are line breaks within table cells, but I'm assuming that there are no line breaks within the HTML tags themselves. It is also possible that some table cells are empty (i.e. <td></td>) but for the moment I would also be happy with a solution that requires the table cells to have content.

Any ideas would be appreciated.

Thanks

Samuel

Share this post


Link to post
Share on other sites



#2 ·  Posted (edited)

Someone else might take a look at your RexExp. But I suggest you don't try to do the whole thing with a single RegExp. Split things up and it will be a lot easer.

Like: if those tables always have 4 rows. Isolate them -> array with a singe table per entry. Loop over the array entries, changing a single table at a time, and than merge the array back into a single string. (the last two step can be done inside the loop.)

---

Plus a general RegExp tip. If you add "(?x)..." to your RegExpression, you can add blanks(spaces or tab) in the RegExp code to make them more readable.

Like:

"(<tr(?s).+?<td(?s).+?</td>(?s) ..."

"(?x) ( <tr(?s) .+? <td(?s) .+? </td> (?s) ..."

Edited by MvGulik

"Straight_and_Crooked_Thinking" : A "classic guide to ferreting out untruths, half-truths, and other distortions of facts in political and social discussions."
"The Secrets of Quantum Physics" : New and excellent 2 part documentary on Quantum Physics by Jim Al-Khalili. (Dec 2014)

"Believing what you know ain't so" ...

Knock Knock ...
 

Share this post


Link to post
Share on other sites

Does this replacement of the 3rd cells contents happen before or after you copy the contents of the 3rd cell to the 4th?

You may be looking at nested SRERs here and they can be treacherous at the best of times.


George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

#4 ·  Posted (edited)

Is this what you want?

$sHTML = "<tr>" & @CRLF & _
        "<td>asdf asdf asdf</td>" & @CRLF & _
        "<td>qwer qwer qwer</td>" & @CRLF & _
        "<td>zxcv zxcv zxcv</td>" & @CRLF & _
        "<td>poiu poiu poiu</td>" & @CRLF & _
        "</tr>" & @CRLF & _
        "<tr>" & @CRLF & _
        "<td>yxcv yxcv yxcv</td>" & @CRLF & _
        "<td>lkjh lkjh lkjh</td>" & @CRLF & _
        "<td>1qsc 2wdv 3efb</td>" & @CRLF & _
        "<td>aaaa bbbb cccc</td>" & @CRLF & _
        "</tr>"

$newHTML = StringRegExpReplace($sHTML, "(?i)<tr>\s*(<td>.*</td>\s*)(<td>.*</td>\s*)(<td>.*</td>\s*)(<td>.*</td>\s*)</tr>", "<tr>$1$2$3$3</tr>")

MsgBox(0, "Test", $newHTML)

Br,

UEZ

Edited by UEZ

Please don't send me any personal message and ask for support! I will not reply!

Selection of finest graphical examples at Codepen.io

The own fart smells best!
Her 'sikim hıyar' diyene bir avuç tuz alıp koşma!
¯\_(ツ)_/¯

Share this post


Link to post
Share on other sites

Here are some possibilities.

Local $fileread = "<tr>" & @CRLF & _
        "<td>asdf asdf asdf</td>" & @CRLF & _
        "<td>qwer qwer qwer</td>" & @CRLF & _
        "<td>zxcv zxcv zxcv</td>" & @CRLF & _
        "<td>poiu poiu poiu</td>" & @CRLF & _
        "</tr>"
;$fileread = FileRead("HtmlTable.htm")

Local $iCopyColumnNumber = 3 ; Copy 3rd column paste to next column (4th column)
Local $sRes = StringRegExpReplace($fileread, "(?s)" & _
        "((?:<td.*?</td.+?){" & ($iCopyColumnNumber - 1) & "})" & _
        "(<td.+?)(.+?)(</td.+?" & _
        "<td.+?)(.+?)(</td.+?)(.*?</tr>)", _
        "\1\2\3\4\3\6\7")
ConsoleWrite($sRes & @CRLF & @CRLF)
#cs
$file = FileOpen("HtmlTableA.htm", 2)
FileWrite($file, $sRes)
FileClose($file)
#ce




Local $Replace = "\6 Extra " ; Copy 3rd row and paste to 4th row with " Extra" added.
; "\2" copies 1st column;
; "\4" copies 2nd row.
Local $sRes = StringRegExpReplace($fileread, "(?s)" & _
        "(<td.+?)(.+?)(</td.+?" & _
        "<td.+?)(.+?)(</td.+?" & _
        "<td.+?)(.+?)(</td.+?" & _
        "<td.+?)(.+?)(</td.+?)(.*?</tr>)", _
        "\1\2\3\4\5\6\7" & $Replace & "\9\10")
ConsoleWrite($sRes & @CRLF & @CRLF)
#cs
$file = FileOpen("HtmlTableA.htm", 1)
FileWrite($file, $sRes)
FileClose($file)
ShellExecute("HtmlTableA.htm")
#ce

Share this post


Link to post
Share on other sites

#6 ·  Posted (edited)

Just an addition to what is already there.. with the ability to edit every single <td>.

Local $fileread = "<tr>" & @CRLF & _
        "<td>asdf asdf asdf</td>" & @CRLF & _
        "<td>qwer qwer qwer</td>" & @CRLF & _
        "<td>zxcv zxcv zxcv</td>" & @CRLF & _
        "<td>poiu poiu poiu</td>" & @CRLF & _
        "</tr>"

Local $new_fileread_1 = StringRegExpReplace($fileread,"(<tr>)(\R)(<td>)(\V*)(</td>)\2\3(\V*)\5\2\3(\V*)\5\2\3(\V*)\5\2(</tr>)","\1\2\3\4\5\2\3\6\5\2\3\7\5\2\3\7\5\2\9")

Local $new_fileread_2 = StringRegExpReplace($fileread,"(<tr>)(\R)(<td>)(\V*)(</td>)\2\3(\V*)\5\2\3(\V*)\5\2\3(\V*)\5\2(</tr>)","\1\2\3\4\5\2\3\6\5\2\3\7\5\2\3New Element Here\5\2\9")

;~ \1 = <tr>
;~ \2 = \R - new line. If this doesn't works, try "(\v+|$)". Credit to GEOSoft.
;~ \3 = <td>
;~ \4 = 1st column's content
;~ \5 = </td>
;~ \6 = 2nd column's content
;~ \7 = 3rd column's content
;~ \8 = 4th column's content
;~ \9 = </tr>

MsgBox(0,"",$fileread)
MsgBox(0,"",$new_fileread_1)
MsgBox(0,"",$new_fileread_2)

Edit:Added some remarks.

Edited by Mison

Hi ;)

Share this post


Link to post
Share on other sites

@Mison

I'm not sure how you managed to test that because \R should not be matching a new line or anything except uppercase R. \r will match a carriage return, or \n will match a linefeed. Metacharacters are case sensitive.


George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

#8 ·  Posted (edited)

@GEOSoft

I have tested this pattern and it works(strangely, if what you've said is true). I thought \R will match any newline sequence, thus \R = \r\n, or \r or \n.

Taken from PCRE 7.8 help file - "Outside a character class, by default, the escape sequence \R matches any Unicode newline sequence."

If it is not as what I think it is, then replace "\R" with "[\r\n]+"

Edited by Mison

Hi ;)

Share this post


Link to post
Share on other sites

There is probably some changes since the last version used for the help file although Unicode support in the AutoIt version of PCRE is normally poor. I'll play with it. I find that (?:\v+|$) actually works best for me where \v+ matches any vertical whitespace (Chr(10), Chr(11) or Chr(13)) 1 or more times.


George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

Okay, I have added remarks to my earlier post. I also changed [^\v] to \V since they are basically the same.


Hi ;)

Share this post


Link to post
Share on other sites

I suggest you don't try to do the whole thing with a single RegExp. Split things up and it will be a lot easer. Like: if those tables always have 4 rows. Isolate them -> array with a singe table per entry...

Thanks, yes, that's what I would normally do, but last night I wanted a quick solution to a problem and I thought that the right regex would do the job. Thanks for the spacing tip.

Share this post


Link to post
Share on other sites

Does this replacement of the 3rd cells contents happen before or after you copy the contents of the 3rd cell to the 4th?

I don't think it matters. The main point is that when the replacement is finished, the 4th cell's content is gone and the 3rd cell's content exists in both the 3rd and 4th cell.

My regex was very simple initially -- it simply deleted the 4th cell (including <td> tags) and then duplicated the entire 3rd cell (including <td> tags), but then it dawned on me that the 3rd and the 4th cell might have different attributes, so ideally only the content of the cells must be removed and duplicated. Also, a <td> tag can have attributes (e.g. <td width=5>) whereas a </td> tag will always be just </td>.

Another reason why I wanted to be able to specify the content of the third cell as a unit in the regex is so that I can do stuff with it (e.g. if I want to replace all spaces with ###, as in:

<td>asdf asdf asdf</td>

<td>qwer qwer qwer</td>

<td>zxcv zxcv zxcv</td>

<td>1111 2222 3333</td>

to

<td>asdf asdf asdf</td>

<td>qwer qwer qwer</td>

<td>zxcv###zxcv###zxcv</td>

<td>zxcv zxcv zxcv</td>

...and I would be able to do that if the content of each cell is specified separately in the regex). But a loop script is probably better than trying to do everything in the regex.

==

You're probably wondering why I want all of this.

The input file is a file from a language translation client in which the 1st and 2nd cell contain reference numbers, the 3rd cell contains source text (to be translated), and the 4th cell is presently empty or has a code (that can be deleted), but should contain the translation in the end.

Translation tools work by overwriting the source text with a translation, so in order to have the translation in the 4th cell, the user has to first copy the source text (from cell 3) into it. The next step would be to ensure that the content of the 3rd and 4th cells are not the same, so that would mean adding temporary junk characters to the 3rd cell so that its content is not translated by the translation tool when it translates the 4th cell.

The deadline for the job was last night, so in the end I just commented out the entire 3rd cell :-) so that the translation tool would ignore it, but for future work I'd like to see if I can refine the script to be more useful in multiple similar circumstanecs.

Share this post


Link to post
Share on other sites

Is this what you want?

...

$newHTML = StringRegExpReplace($sHTML, "(?i)<tr>\s*(<td>.*</td>\s*)(<td>.*</td>\s*)(<td>.*</td>\s*)(<td>.*</td>\s*)</tr>", "<tr>$1$2$3$3</tr>")

That code would have worked for the specific files that I worked on last night, yes, because the table cells had no attributes, but in the future the table cells may also have attributes (e.g. <td width=50%>).

Using \s would match 99% of cases, so I'll certainly consider using it (the 1% of cases it won't match are cases in which the HTML author added comments e.g. <!-- cell 1 --> before every table cell... I've seen this happen).

I'll also experiment with fewer back-references and more literals in the replace field, as you have done.

Share this post


Link to post
Share on other sites

Here are some possibilities.

I thinkered with your second solution and it works, thanks. I just had to add closing brackets, so that cell attributes are retained and not duplicated:

Local $fileread = "<tr>" & @CRLF & _
        "<td foo=1>asdf asdf asdf</td>" & @CRLF & _
        "<td foo=2>qwer qwer qwer</td>" & @CRLF & _
        "<td foo=3>zxcv zxcv zxcv</td>" & @CRLF & _
        "<td foo=4>poiu poiu poiu</td>" & @CRLF & _
        "</tr>"

Local $sRes = StringRegExpReplace($fileread, "(?s)" & _
        "(<td.+>)(.+?)(</td.+?" & _
        "<td.+?>)(.+?)(</td.+?" & _
        "<td.+?>)(.+?)(</td.+?" & _
        "<td.+?>)(.+?)(</td.+?)(.*?</tr>)", _
        "\1\2\3\4\5\6\7\6\9\10")

MsgBox (0, "", $sRes & @CRLF & @CRLF, 0)

Thanks again.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0