Jump to content

RegExp --- cannot figure this out...


orange
 Share

Recommended Posts

<td nowrap class="crn" onclick="show_title(2);">12345</td>

that line is repeated about 30 times in an html that i have to parse.

I can't for the life of me get this regexp to work

I need to get an array of all the string between the > and the </td> It can be a 5 digit number, or a letter followed by 4 numbers: 12345 or A1234

For whatever reason I can't do this.

Is there anyone who knows this better than I who can write me this line.

By the way, I need the entire line to be a search parameter, and the show_title(2) changes from 0-30.

Can anyone help me with this?

Link to comment
Share on other sites

here is somewhat of an example:

#include <file.au3>
#include <array.au3>
Local $aLines
_FileReadToArray("myhtml.txt", $aLines)
Local $aResults[$aLines[0]+1] = [$aLines[0]]
For $i = 1 to $aLines[0]
    $aResults[$i] = _ParseLine($aLines[$i])
Next
_ArrayDisplay($aResults, "")
Func _ParseLine($sLine)
    $aSplit1 = StringSplit($sLine, '>')
    $aSplit2 = StringSplit($aSplit1[2], '<')
    Return $aSplit2[1]
EndFunc

and the myhtml.txt file:

<td nowrap class="crn" onclick="show_title(2);">1234A</td>
<td nowrap class="crn" onclick="show_title(2);">1234B</td>
<td nowrap class="crn" onclick="show_title(2);">1234C</td>
<td nowrap class="crn" onclick="show_title(2);">1234D</td>
<td nowrap class="crn" onclick="show_title(2);">1234E</td>
<td nowrap class="crn" onclick="show_title(2);">1234F</td>
Edited by CHRIS95219
Link to comment
Share on other sites

This pattern may help

$string = '<td nowrap class="crn" onclick="show_title(2);">12345</td>' & @CRLF & _
        '<td nowrap class="crn" onclick="show_title(3);">123456</td>' & @CRLF & _
        '<td nowrap class="crn" onclick="show_title(4);">A12345</td>'
$pattern = '<td nowrap class="crn" onclick=".*;">(.*)</td>'
$result = StringRegExp($string, $pattern, 3)
If Not @error Then
    For $i = 0 To UBound($result) -1
        MsgBox(0, '', $result[$i])
    Next
EndIf

:lmao:

Link to comment
Share on other sites

This pattern may help

$string = '<td nowrap class="crn" onclick="show_title(2);">12345</td>' & @CRLF & _
  '<td nowrap class="crn" onclick="show_title(3);">123456</td>' & @CRLF & _
  '<td nowrap class="crn" onclick="show_title(4);">A12345</td>'
$pattern = '<td nowrap class="crn" onclick=".*;">(.*)</td>'
$result = StringRegExp($string, $pattern, 3)
If Not @error Then
 For $i = 0 To UBound($result) -1
  MsgBox(0, '', $result[$i])
 Next
EndIf

:lmao:

I'm not in a place to test this right now, but the RegExp is what I was going for. Thanks for the quick response.

Link to comment
Share on other sites

This pattern may help

$string = '<td nowrap class="crn" onclick="show_title(2);">12345</td>' & @CRLF & _
        '<td nowrap class="crn" onclick="show_title(3);">123456</td>' & @CRLF & _
        '<td nowrap class="crn" onclick="show_title(4);">A12345</td>'
$pattern = '<td nowrap class="crn" onclick=".*;">(.*)</td>'
$result = StringRegExp($string, $pattern, 3)
If Not @error Then
    For $i = 0 To UBound($result) -1
        MsgBox(0, '', $result[$i])
    Next
EndIf
oÝ÷ Ûú®¢×Ê'£¬±çS+,r¸©·
+Çâì!zr#ºËlj÷¢¶)Úµë-jíßWºÜ!zr-¯&§uªi(­¶azfØZ´Z½ëaz·­º¹â³]÷ß}÷ß}÷ßr¢{pj{Z­ën®z×}ÚºÚ"µÍÌÍÚYWÛØXÝHÒQPÜX]H
    ][ÝÝÝYÙK[   ][ÝÈ
BÌÍÚ[HÝ[ÜÝÜÊÒQPÙTXY[
    ÌÍÚYWÛØXÝ
K
BÌÍÜ]H   ÌÎNÉÝÝÜÛÜÏI][ÝØÜ][ÝÈÛÛXÚÏI][ÝËÉ][ÝÉÝÊIËÝ ÝÉÌÎNÂÌÍÜÝ[HÝ[ÔYÑ^
    ÌÍÚ[ ÌÍÜ]Ê

$result = 1 always.

any ideas?

Link to comment
Share on other sites

StringStripWS(..., 8) has stripped all whitespace out of the string so no wonder it fails as the pattern is looking for 3 spaces within the string for a match. :lmao:

well, that was a problem. Fixed now, but still nothing. Return value is 1....

Link to comment
Share on other sites

Thanks for attaching "Copy_of_class.html" via PM to me.

Your source remains intact if you use FileRead(), but I may assume this html page maybe on the net so using _IEBodyReadhtml() is perhaps prefered. _IEBodyReadhtml() is stripping quotes and rearranging html tags with the return string which makes the regex fail correctly.

This is the modified pattern selected and I noticed the case of the characters where different, so I added case insensitivity by using "(?i)".

#include <IE.au3>

$ie_object = _IECreate(@ScriptDir & "\Copy_of_class.html", 0, 0)
$string = StringStripWS(_IEBodyReadHTML($ie_object), 3)

; Actual Line to catch: "<TD class=crn onclick=show_course(0); noWrap>47738</TD>"
$pattern = '(?i)<td class=crn onclick=.*; noWrap>(.*)</td>'
$result = StringRegExp($string, $pattern, 3)
If Not @error Then
    For $i = 0 To UBound($result) - 1
        MsgBox(0x40000, $i, $result[$i])
    Next
EndIf

:lmao:

Link to comment
Share on other sites

Thanks for attaching "Copy_of_class.html" via PM to me.

Your source remains intact if you use FileRead(), but I may assume this html page maybe on the net so using _IEBodyReadhtml() is perhaps prefered. _IEBodyReadhtml() is stripping quotes and rearranging html tags with the return string which makes the regex fail correctly.

This is the modified pattern selected and I noticed the case of the characters where different, so I added case insensitivity by using "(?i)".

#include <IE.au3>

$ie_object = _IECreate(@ScriptDir & "\Copy_of_class.html", 0, 0)
$string = StringStripWS(_IEBodyReadHTML($ie_object), 3)

; Actual Line to catch: "<TD class=crn onclick=show_course(0); noWrap>47738</TD>"
$pattern = '(?i)<td class=crn onclick=.*; noWrap>(.*)</td>'
$result = StringRegExp($string, $pattern, 3)
If Not @error Then
    For $i = 0 To UBound($result) - 1
        MsgBox(0x40000, $i, $result[$i])
    Next
EndIf

:lmao:

thanks very much, it looks like everything is in order!

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...