Jump to content

Parsing text from html


Recommended Posts

Is there any UDF or some special technique used to parse a html file in autoit? I have a html document, and want to read a line from it, and if regular expressions arent supported, what do I do? something like

<p>You have * new messages</p>

and have it snag the *? Or am I missing the point entirely?

I looked around and havent found anything in the help file or the forum. This is a common script task, so there is bound to be many solutions, anyone care to share one with me?

Here is another (very) simple example:

<html>
<head>
<title>Testing page</title>
</head>
<p>
The number is 347. 
</p>
</html>

Example:

$var=HTMLParse("test.htm","The number is [0-9\s][0-9\s][0-9\s].")
$var=StringReplace($var,"The number is ","")
$var=StringReplace($var," ","")

Is there a HTMLParse function that I don't know about? or some parameters for a stringreplace?

EDIT: I only looked through the stable release's online documentation, I forgot about Beta. The regular expression stuff is there, ignore the part about RegExp... The question about parsing html still exists.....

Edited by Andrew Sparkes

---Sparkes.

Link to comment
Share on other sites

I hope thats help:

#include <INet.au3>
$var = _INetGetSource("test.htm")
$var = _StringBetween($var, "The number is ", ".")
MsgBox(0, "", "You have " &$var &" messages")

Func _StringBetween($s,$from,$to)
$x=StringInStr($s,$from)+StringLen($from)
$y=StringInStr(StringTrimLeft($s,$x),$to)
Return StringMid($s,$x,$y)
EndFunc
Link to comment
Share on other sites

Perhaps it would help more if you gave us a better context of usage.

Is this a case of you have the specified .htm file directly on your file system?

Are you trying to check webmail somewhere?

Are you recieving an email notification of some sort that contains this .htm information?

------

Off the top of my head, if this is a straight parse information from a file then you might use FileOpen() and StringInStr()

If you are doing webmail, there may be a utility they provide that gives you a pop-up or tray Icon that you can manipulate to get the information.

There are usually a lot of ways to skin a cat... you just need specifics on the cat in question to do it "right".

[u]Helpful tips:[/u]If you want better answers to your questions, take the time to reproduce your issue in a small "stand alone" example script whenever possible. Also, make sure you tell us 1) what you tried, 2) what you expected to happen, and 3) what happened instead.[u]Useful links:[/u]BrettF's update to LxP's "How to AutoIt" pdfValuater's Autoit 1-2-3 Download page for the latest versions of Autoit and SciTE[quote]<glyph> For example - if you came in here asking "how do I use a jackhammer" we might ask "why do you need to use a jackhammer"<glyph> If the answer to the latter question is "to knock my grandmother's head off to let out the evil spirits that gave her cancer", then maybe the problem is actually unrelated to jackhammers[/quote]

Link to comment
Share on other sites

Func Parsehtml()

$htmlexist=InetGet("http://www.thesite.com/your.html", "C:\your dir\yourfile.html", 1); get and save file

if $htmlexist=1 then; checkpoint

$file=FileOpen("C:\auto\details.htm", 0); open the file

$line1=FileReadLine($file, 89); read the line you want by its number

$line11=StringReplace($line1, 'junk', "") replace junk with nothing. note that i used ' and not " in case you have " in the junk

fileclose($file)

endfunc

you continue to use with in the func the StringReplace function until your result is clean. usualy 2 sweeps will do it

cheers

Link to comment
Share on other sites

oh i forgot the endif !!! add it in please

Func Parsehtml()

$htmlexist=InetGet("http://www.thesite.com/your.html", "C:\your dir\yourfile.html", 1); get and save file

if $htmlexist=1 then; checkpoint

$file=FileOpen("C:\auto\details.htm", 0); open the file

$line1=FileReadLine($file, 89); read the line you want by its number

$line11=StringReplace($line1, 'junk', "") replace junk with nothing. note that i used ' and not " in case you have " in the junk

fileclose($file)

endfunc

you continue to use with in the func the StringReplace function until your result is clean. usualy 2 sweeps will do it

cheers

Link to comment
Share on other sites

The function that poisonkiller posted worked beautifully and I'll probably just use that, but I'm still curious about this regexp problem:

I have this:

<blah blah blah>The number is 10.</blah blah blah>

If I stringregexp the above string for "The number is [0-9]*\.", it returns false, and is understandable, as it is testing the whole string against the pattern. I want to know if there is a function to see if the string contains a regexp pattern, return the patterned substring and work from there.

---Sparkes.

Link to comment
Share on other sites

Use my string parsing function:

Func _StringParse($sz_str, $sz_before, $sz_after, $i_occurance = 0)
    Local $sz_sp1 = StringSplit($sz_str, $sz_before, 1)
    If $i_occurance < 0 or $i_occurance > $sz_sp1[0] Then
        SetError(1)
        Return ""
    EndIf
    Local $sz_sp2 = _Test($i_occurance = 0, StringSplit($sz_sp1[$sz_sp1[0]], $sz_after, 1), StringSplit($sz_sp1[$i_occurance + 1], $sz_after, 1))
    Return $sz_sp2[1]
EndFunc

Func _Test($b_Test, $v_True, $v_False)
    If $b_Test Then Return $v_True
    Return $v_False
EndFunc
Example:
<tag>first occurance</tag>
<tag>second occurance</tag>
<tag>third and last occurance</tag>

_StringParse($str, "<tag>", "</tag>") would give you "third and last occurance"
_StringParse($str, "<tag>", "</tag>", 1) would give you "first occurance"
_StringParse($str, "<tag>", "</tag>", 2) would give you "second occurance"
_StringParse($str, "<tag>", "</tag>", 3) would give you "third and last occurance"

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...