Sign in to follow this  
Followers 0
UnKnOwned

New to Autoit Syntax - String Parsing

12 posts in this topic

Good Day,

I have a tag inside a HTML file needs parsing.

TAG

<div class="W3 Schools Link WHITESPACES " title="Hello World    "></div>

CURRENT METHOD

$Delim1=StringSplit($tagsParameters, ' ') ; ...I did not account for this reaction initially =/
For $k = 1 to $Delim1[0]

If $Delim1[$k] = '' then
; catch empty strings
Else
$Delim2 = StringSplit($Delim1[$k], '=')
For $l = 1 to $Delim2[0]
$KeyValue = $Delim2[1] ; *** KEY VALUE NAME ***
$KeyParameter = $Delim2[2] ; *** KEY PARAMETER VALUE ***
Next

$KeyParameter = StringReplace($KeyParameter, '"', "")

EndIf

Next

I'm a moron of course, not compensating for all those " SPACES"

FUTILE ATTEMPTS

$DelimX = StringSplit($tagsparameters, '="')
For $l = 1 to $DelimX[0]
$KeyValue = $DelimX[1] ; *** KEY VALUE NAME ***
$KeyParameter = $DelimX[3] ; *** KEY PARAMETER VALUE ***
Next
msgbox(0, '', 'Key Value Name: '&$KeyValue&@LF&'Key Parameter: '&$KeyParameter)

This of course does not work because 'title="Hello World "' exist in array[4]

I can't seem to remember how to get For $x = 1 to $arr[0][0] to cooperate properly.

I'm overlooking something I already know the answer to, but too many hours staring at this screen has made me biased.

The function was made completely adbsent-minded of the fact that there would be spaces in quoted strings.

To much of THIS_AndThatAndThisAndThat again, has made me lazy.

Regex solutions are fine but I would prefer to keep it as is.

Thank You & Regards,

UnKnOwned

Share this post


Link to post
Share on other sites



#3 ·  Posted (edited)

JohnOne,

Thank you for the prompt response.

My apologies for not being more specific in the op.

The quoted string needs to remain intact as it is passed along to another function.

When the string is passed along it throws the Array has incorrect subscripts error because of how the string was parsed initially.

The parameters return incorrectly & "Hello becomes and incorrect return value whereas "HelloWorld" returns correctly.

Edited by UnKnOwned

Share this post


Link to post
Share on other sites

#5 ·  Posted (edited)

$tagsparameters = <div class="W3 Schools Link WHITESPACES " title="Hello World "></div>

Loop through these parameters. This is nothing more than a giant loop that runs through a list of functions

processing the parameters as it goes.

So...

$KeyValue = class
$KeyParameter = "W3 Schools Link WHITESPACES"

...and

$KeyValue = title
$KeyParameter = "Hello World "

cont...

If formatted as "HelloWorld" it works fine, formatted "Hello World" does not because in the 1st ex. code parses the string

at whitespaces. class="W3 Schools Link WHITESPACES" (right here is the 1st parse) title="Hello World" someOtherParam="whatever" etc...

Instead using the 1st example code causes class="W3(1st parse) Schools Link WHITESPACES" which is wrong.

In the second example code it works more effectively but passes over all other names & params.

Class= would be picked up but not title=

I'm terribly sorry I hope this some what makes sense... :x

Again I'm being overly complicated in my explanations, sorry.

"HTML PARSER"

Simple get the parameters and values inside the tag.

Edited by UnKnOwned

Share this post


Link to post
Share on other sites

UnKnOwned,

This works for the string posted. I use regexp because its the easiest (best) way to do this.

local $tagsparameters = '<div class="W3 Schools Link WHITESPACES " title="Hello World "></div>'
local $aret = StringRegExp($tagsparameters,'(\w+="[ \w]+")',3), $aTemp

for $1 = 0 to ubound($aret) - 1
    $aTemp = stringsplit($aret[$1],'=')
    if $aTemp[0] <> 2 then ConsoleWrite('Error - ' & $aret[$1] & @LF)
    ConsoleWrite(stringformat('Key #' & $1 & ' = %-20s value = %-20s',$aTemp[1],$aTemp[2]) & @LF)
next

kylomas


Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites

JohnOne & kylomas,

Thank you both very much for your assistance.

JohnOne I am curious of this approach as it provides the data needed in a couple other areas. Could you give an example possibly grabbing not just the inner text but the values and parameters as well?

I am having a bit of trouble getting...

#include 

Local $oIE = _IEDocReadHTML("Test.html")
Local $oElements = _IETagNameAllGetCollection($oIE)
For $oElement In $oElements
MsgBox(0, "Element Info", "Tagname: " & $oElement.tagname & @CR & "innerText: " & $oElement.innerText)
Next

So, read the locally stored file get the tag names, values, properties, and inner text.

kylomas I do agree with you in being the easiest way, and perhaps even the best, but correct me if I am wrong. Is it the fastest way?

If I am wrong I will more than likely switch over my approach.

Share this post


Link to post
Share on other sites

#9 ·  Posted (edited)

UnKnOwned,

kylomas I do agree with you in being the easiest way, and perhaps even the best, but correct me if I am wrong. Is it the fastest way?

If I am wrong I will more than likely switch over my approach.

No, this is the worst way to do it, for a variety of reasons. Read the IE doc and you'll see why that is the best approach. I only offered the string parser solution because you were already on that road.

kylomas

Edited by kylomas

Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites

#10 ·  Posted (edited)

kylomas,

Thank you for clarifying, I thought perhaps there was something I was missing. :>

So, IE UDF it is. I searched local help file & here with little to no luck.

How do I go about reading from a locally stored HTML & using _IETagNameAllGetCollection($oIE)?

<div value1="param 1" value2="param 2" />INNER TEXT</div>

I think I am missing something again, :mad2: I can not find how to retrieve the params or formatting properly :>

Edited by UnKnOwned

Share this post


Link to post
Share on other sites

UnKnOwned,

Here is an example of something i use to process downloaded HTML. Maybe you can adapt it.

#include <IE.au3>

local $fln = 'k:\sd\sd0100\nba\boxes\400440940'         ; this is a text file downloaded with inetget

_get_links( fileread($fln) )
ConsoleWrite(@error & @LF)

func _get_links($html)

    Local $o_htmlfile = ObjCreate('HTMLFILE'), $str
    If Not IsObj($o_htmlfile) Then Return SetError(-1)

    $o_htmlfile.open()
    $o_htmlfile.write($html)
    $o_htmlfile.close()

    Local $ocol = _IETagnameGetCollection($o_htmlfile, 'a')
    if not isobj($ocol) then return seterror(-2)

    for $o in $ocol
        ConsoleWrite('innertext = ' & $o.innertext & @LF)
        ConsoleWrite('href      = ' & $o.href & @LF)
        ConsoleWrite('-----------------------' & @LF)
    next

endfunc

kylomas


Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites

#12 ·  Posted (edited)

kylomas,

Playing around with your solution allows me to store predefined tags into an array retrieving .tagname & .innerText in this manner if any of the tags are found in the document. This is working for me at the moment as a workaround for the method used in the Autoit help documentation which I can not seem to make work. :>

Autoit Document Example

#include

Local $oIE = _IE_Example("basic")
Local $oElements = _IETagNameAllGetCollection($oIE)
For $oElement In $oElements
MsgBox(0, "Element Info", "Tagname: " & $oElement.tagname & @CR & "innerText: " & $oElement.innerText)
Next

I've tried combining your method with the doc example, again with no luck. :think:

Local $html = ( fileread('TEST.html') )
Local $o_htmlfile = ObjCreate('HTMLFILE'), $str

$o_htmlfile.open()
$o_htmlfile.write($html)
$o_htmlfile.close()

Local $oElements = _IETagNameAllGetCollection($o_htmlfile)
For $oElement In $oElements
MsgBox(0, "Element Info", "Tagname: " & $oElement.tagname & @CR & "innerText: " & $oElement.innerText)
Next

I finally got the below to work minus the ability to retrieve the code attributes. :puke:

Workaround

$val = 2

Global $tags[$val] = ["td", "div"]

$html = ( fileread('TEST.html') )

Local $o_htmlfile = ObjCreate('HTMLFILE'), $str

$o_htmlfile.open()
$o_htmlfile.write($html)
$o_htmlfile.close()

For $j = 0 to $val - 1
Local $x = _IETagnameGetCollection($o_htmlfile, $tags[$j])
For $x in $x
MsgBox(0, '', 'TagName = ' & $x.tagname & @LF & 'InnerText = ' & $x.innertext)
Next
Next

This will allow me to loop through the array. I still can't figure out how to retrieve value="params" etc...

Looping through in this manner is sloppy and unnecessary. I was able to get the doc help example to work with

local files but it ended up outputting HTML HEAD BODY html comment & whole lot of other unwanted data.

How can I fix this?

Using Dale's IE functions how can I catch the tags and separate the major parts such as "tag name" "attributes" & "elements"/"innerText"?

Example

<div class="W3 SchoolsLink" title="Hello World">Some Text</div><div class="Autoit Forums" title="Rocks !!!">Some More Text</div>

I would like to end up with.the following when outputted to the console/or msgbox

-

TagName = "<div"
Class = "W3 SchoolsLink"
Title = "Hello World"
Element = "Some Text"

TagName = "<div"
Class = "Autoit Forums"
Title = "Rocks !!!"
Element = "Some More Text"
Edited by UnKnOwned

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0