Can you please have a look at my regex?

leuce · October 27, 2010

G'day everyone

I'm having trouble writing a regex for StringRegExpReplace. I'm hoping that someone may be able to help me.

The pattern is an HTML table (my source file has hundreds of these table rows, in a single table):

<tr>

</tr>

The replacement should be one that the 4th cell of the table is removed and replaced with a copy of the 3rd cell, as follows:

<tr>

</tr>

The following regexp works as expected:

StringRegExpReplace ($fileread, "(<tr(?s).+?<td(?s).+?</td>(?s).+?<td(?s).+?</td>(?s).+?)(<td(?s).+?</td>)((?s).+?)(<td(?s).+?</td>)((?s).+?</tr>)", "\1\2\3\2\5")

However, I need the ability to make changes to the third cell's contents, and for that, I need to specify the contents independently. The following regexp should work (as far as I can see) but it doesn't:

StringRegExpReplace ($fileread, "(<tr(?s).+?<td(?s).+?</td>(?s).+?<td(?s).+?</td>(?s).+?)(<td(?s).+?>)((?s).+?)(</td>)((?s).+?)(<td(?s).+?>)((?s).+?)(</td>)((?s).+?</tr>)", "\1\2\3\4\5\6\3\8\9")

The following should also work, as far as I can see, but it doesn't either:

StringRegExpReplace ($fileread, "(<tr(?s).+?)(<td.+?>)((?s).+?)(</td>)((?s).+?)(<td.+?>)((?s).+?)(</td>)((?s).+?)(<td.+?>)((?s).+?)(</td>)((?s).+?)(<td.+?>)((?s).+?)(</td>)((?s).+?)((?s).+?</tr>)", "\1\2\3\4\5\6\7\8\9\10\11\12\13\14\11\16\17\18")

Can anyone please tell me what is wrong with the lower two regexps? I'd prefer to use the last one because it would allow me to do stuff with every column of the table, instead of only the third column. But right now my need is to do stuff with the third column.

It is possible that there are line breaks within table cells, but I'm assuming that there are no line breaks within the HTML tags themselves. It is also possible that some table cells are empty (i.e. <td></td>) but for the moment I would also be happy with a solution that requires the table cells to have content.

Any ideas would be appreciated.

Thanks

Samuel

MvGulik · October 27, 2010

Someone else might take a look at your RexExp. But I suggest you don't try to do the whole thing with a single RegExp. Split things up and it will be a lot easer.

Like: if those tables always have 4 rows. Isolate them -> array with a singe table per entry. Loop over the array entries, changing a single table at a time, and than merge the array back into a single string. (the last two step can be done inside the loop.)

---

Plus a general RegExp tip. If you add "(?x)..." to your RegExpression, you can add blanks(spaces or tab) in the RegExp code to make them more readable.

Like:

"(<tr(?s).+?<td(?s).+?</td>(?s) ..."

"(?x) ( <tr(?s) .+? <td(?s) .+? </td> (?s) ..."

Edited October 28, 2010 by MvGulik

GEOSoft · October 27, 2010

Does this replacement of the 3rd cells contents happen before or after you copy the contents of the 3rd cell to the 4th?

You may be looking at nested SRERs here and they can be treacherous at the best of times.

UEZ · October 27, 2010

Is this what you want?

$sHTML = "<tr>" & @CRLF & _
        "<td>asdf asdf asdf</td>" & @CRLF & _
        "<td>qwer qwer qwer</td>" & @CRLF & _
        "<td>zxcv zxcv zxcv</td>" & @CRLF & _
        "<td>poiu poiu poiu</td>" & @CRLF & _
        "</tr>" & @CRLF & _
        "<tr>" & @CRLF & _
        "<td>yxcv yxcv yxcv</td>" & @CRLF & _
        "<td>lkjh lkjh lkjh</td>" & @CRLF & _
        "<td>1qsc 2wdv 3efb</td>" & @CRLF & _
        "<td>aaaa bbbb cccc</td>" & @CRLF & _
        "</tr>"

$newHTML = StringRegExpReplace($sHTML, "(?i)<tr>\s*(<td>.*</td>\s*)(<td>.*</td>\s*)(<td>.*</td>\s*)(<td>.*</td>\s*)</tr>", "<tr>$1$2$3$3</tr>")

MsgBox(0, "Test", $newHTML)

Br,

UEZ

Edited October 27, 2010 by UEZ

Malkey · October 27, 2010

Here are some possibilities.

Local $fileread = "<tr>" & @CRLF & _
        "<td>asdf asdf asdf</td>" & @CRLF & _
        "<td>qwer qwer qwer</td>" & @CRLF & _
        "<td>zxcv zxcv zxcv</td>" & @CRLF & _
        "<td>poiu poiu poiu</td>" & @CRLF & _
        "</tr>"
;$fileread = FileRead("HtmlTable.htm")

Local $iCopyColumnNumber = 3 ; Copy 3rd column paste to next column (4th column)
Local $sRes = StringRegExpReplace($fileread, "(?s)" & _
        "((?:<td.*?</td.+?){" & ($iCopyColumnNumber - 1) & "})" & _
        "(<td.+?)(.+?)(</td.+?" & _
        "<td.+?)(.+?)(</td.+?)(.*?</tr>)", _
        "\1\2\3\4\3\6\7")
ConsoleWrite($sRes & @CRLF & @CRLF)
#cs
$file = FileOpen("HtmlTableA.htm", 2)
FileWrite($file, $sRes)
FileClose($file)
#ce




Local $Replace = "\6 Extra " ; Copy 3rd row and paste to 4th row with " Extra" added.
; "\2" copies 1st column;
; "\4" copies 2nd row.
Local $sRes = StringRegExpReplace($fileread, "(?s)" & _
        "(<td.+?)(.+?)(</td.+?" & _
        "<td.+?)(.+?)(</td.+?" & _
        "<td.+?)(.+?)(</td.+?" & _
        "<td.+?)(.+?)(</td.+?)(.*?</tr>)", _
        "\1\2\3\4\5\6\7" & $Replace & "\9\10")
ConsoleWrite($sRes & @CRLF & @CRLF)
#cs
$file = FileOpen("HtmlTableA.htm", 1)
FileWrite($file, $sRes)
FileClose($file)
ShellExecute("HtmlTableA.htm")
#ce

Mison · October 28, 2010

Just an addition to what is already there.. with the ability to edit every single <td>.

Local $fileread = "<tr>" & @CRLF & _
        "<td>asdf asdf asdf</td>" & @CRLF & _
        "<td>qwer qwer qwer</td>" & @CRLF & _
        "<td>zxcv zxcv zxcv</td>" & @CRLF & _
        "<td>poiu poiu poiu</td>" & @CRLF & _
        "</tr>"

Local $new_fileread_1 = StringRegExpReplace($fileread,"(<tr>)(\R)(<td>)(\V*)(</td>)\2\3(\V*)\5\2\3(\V*)\5\2\3(\V*)\5\2(</tr>)","\1\2\3\4\5\2\3\6\5\2\3\7\5\2\3\7\5\2\9")

Local $new_fileread_2 = StringRegExpReplace($fileread,"(<tr>)(\R)(<td>)(\V*)(</td>)\2\3(\V*)\5\2\3(\V*)\5\2\3(\V*)\5\2(</tr>)","\1\2\3\4\5\2\3\6\5\2\3\7\5\2\3New Element Here\5\2\9")

;~ \1 = <tr>
;~ \2 = \R - new line. If this doesn't works, try "(\v+|$)". Credit to GEOSoft.
;~ \3 = <td>
;~ \4 = 1st column's content
;~ \5 = </td>
;~ \6 = 2nd column's content
;~ \7 = 3rd column's content
;~ \8 = 4th column's content
;~ \9 = </tr>

MsgBox(0,"",$fileread)
MsgBox(0,"",$new_fileread_1)
MsgBox(0,"",$new_fileread_2)

Edit:Added some remarks.

Edited October 28, 2010 by Mison

GEOSoft · October 28, 2010

@Mison

I'm not sure how you managed to test that because \R should not be matching a new line or anything except uppercase R. \r will match a carriage return, or \n will match a linefeed. Metacharacters are case sensitive.

Mison · October 28, 2010

@GEOSoft

I have tested this pattern and it works(strangely, if what you've said is true). I thought \R will match any newline sequence, thus \R = \r\n, or \r or \n.

Taken from PCRE 7.8 help file - "Outside a character class, by default, the escape sequence \R matches any Unicode newline sequence."

If it is not as what I think it is, then replace "\R" with "[\r\n]+"

Edited October 28, 2010 by Mison

GEOSoft · October 28, 2010

There is probably some changes since the last version used for the help file although Unicode support in the AutoIt version of PCRE is normally poor. I'll play with it. I find that (?:\v+|$) actually works best for me where \v+ matches any vertical whitespace (Chr(10), Chr(11) or Chr(13)) 1 or more times.

Mison · October 28, 2010

Okay, I have added remarks to my earlier post. I also changed [^\v] to \V since they are basically the same.

leuce · October 28, 2010

I suggest you don't try to do the whole thing with a single RegExp. Split things up and it will be a lot easer. Like: if those tables always have 4 rows. Isolate them -> array with a singe table per entry...

Thanks, yes, that's what I would normally do, but last night I wanted a quick solution to a problem and I thought that the right regex would do the job. Thanks for the spacing tip.

leuce · October 28, 2010

Does this replacement of the 3rd cells contents happen before or after you copy the contents of the 3rd cell to the 4th?

I don't think it matters. The main point is that when the replacement is finished, the 4th cell's content is gone and the 3rd cell's content exists in both the 3rd and 4th cell.

My regex was very simple initially -- it simply deleted the 4th cell (including <td> tags) and then duplicated the entire 3rd cell (including <td> tags), but then it dawned on me that the 3rd and the 4th cell might have different attributes, so ideally only the content of the cells must be removed and duplicated. Also, a <td> tag can have attributes (e.g. <td width=5>) whereas a </td> tag will always be just </td>.

Another reason why I wanted to be able to specify the content of the third cell as a unit in the regex is so that I can do stuff with it (e.g. if I want to replace all spaces with ###, as in:

to

...and I would be able to do that if the content of each cell is specified separately in the regex). But a loop script is probably better than trying to do everything in the regex.

==

You're probably wondering why I want all of this.

The input file is a file from a language translation client in which the 1st and 2nd cell contain reference numbers, the 3rd cell contains source text (to be translated), and the 4th cell is presently empty or has a code (that can be deleted), but should contain the translation in the end.

Translation tools work by overwriting the source text with a translation, so in order to have the translation in the 4th cell, the user has to first copy the source text (from cell 3) into it. The next step would be to ensure that the content of the 3rd and 4th cells are not the same, so that would mean adding temporary junk characters to the 3rd cell so that its content is not translated by the translation tool when it translates the 4th cell.

The deadline for the job was last night, so in the end I just commented out the entire 3rd cell :-) so that the translation tool would ignore it, but for future work I'd like to see if I can refine the script to be more useful in multiple similar circumstanecs.

leuce · October 28, 2010

Is this what you want?
...
$newHTML = StringRegExpReplace($sHTML, "(?i)<tr>\s*(<td>.*</td>\s*)(<td>.*</td>\s*)(<td>.*</td>\s*)(<td>.*</td>\s*)</tr>", "<tr>$1$2$3$3</tr>")

That code would have worked for the specific files that I worked on last night, yes, because the table cells had no attributes, but in the future the table cells may also have attributes (e.g. <td width=50%>).

Using \s would match 99% of cases, so I'll certainly consider using it (the 1% of cases it won't match are cases in which the HTML author added comments e.g.  before every table cell... I've seen this happen).

I'll also experiment with fewer back-references and more literals in the replace field, as you have done.

leuce · October 28, 2010

Here are some possibilities.

I thinkered with your second solution and it works, thanks. I just had to add closing brackets, so that cell attributes are retained and not duplicated:

Local $fileread = "<tr>" & @CRLF & _
        "<td foo=1>asdf asdf asdf</td>" & @CRLF & _
        "<td foo=2>qwer qwer qwer</td>" & @CRLF & _
        "<td foo=3>zxcv zxcv zxcv</td>" & @CRLF & _
        "<td foo=4>poiu poiu poiu</td>" & @CRLF & _
        "</tr>"

Local $sRes = StringRegExpReplace($fileread, "(?s)" & _
        "(<td.+>)(.+?)(</td.+?" & _
        "<td.+?>)(.+?)(</td.+?" & _
        "<td.+?>)(.+?)(</td.+?" & _
        "<td.+?>)(.+?)(</td.+?)(.*?</tr>)", _
        "\1\2\3\4\5\6\7\6\9\10")

MsgBox (0, "", $sRes & @CRLF & @CRLF, 0)

Thanks again.

Sign In

Can you please have a look at my regex?

Recommended Posts

leuce

MvGulik

GEOSoft

UEZ

Malkey

Mison

GEOSoft

Mison

GEOSoft

Mison

leuce

leuce

leuce

leuce

Create an account or sign in to comment

Create an account

Sign in

Browse

AutoIt Resources

Release

Beta