Jump to content
IAMK

Need logic improvement on string search

Recommended Posts

IAMK

I am using the following code

Local $pageSource = _IEDocReadHTML($ie) ;THIS THING IS HUUUUUUUUUUUUUUUUGE (~300k chars)

From the above, I need to search for about 50 occurrences of

<h3 class='heading_title'>
<a href="GYBERISH_ALPHANUMERIC_HERE"></a>

When I find <h3 class='heading_title'>, I want to strip the next line with ". That would give me an array of size 3, with the 2nd element being what I need.

However, upon doing the above, I wish to execute other code, then CONTINUE from the place I found the previous <h3 class='heading_title'>, until I find the next <h3 class='heading_title'>.

The way I do it is split $pageSource by <h3 class='heading_title'>, but doing that on the entire source feels extremely bad.

$sourceParts = StringSplit($pageSource, "<h3 class='heading_title'>", 1)

Also, if the CONTINUE thing I mentioned above can be done nicely, I could then make the script faster by removing Local $pageSource =  and just feeding in the source directly from _IEDocReadHTML().

Note: $pageSource will be created and parsed 100+ times, so this script takes QUITE a long time for me to execute.

 

Question: What is fastest way to get the GYBERISH_ALPHANUMERI_HERE string, execute code, then continue to find the next GYBERISH_ALPHANUMERI_HERE from the same string?

Edited by IAMK

Share this post


Link to post
Share on other sites
FrancescoDiMuro

@IAMK
I don't know if it could be a possible solution, but you could read ONLY all the <a> elements from that web-page, and loop through them.
To do so, use _IETagNameGetCollection().
I think it could be faster to do your comparsion :)

Edited by FrancescoDiMuro

Click here to see my signature:

Spoiler

Thoughts:

  • I will always thank you for the time you spent for me.
    I'm here to ask, and from your response, I'd like to learn.
    By my knowledge, I can help someone else, and "that someone" could help in turn another, and so on.

/*--------------------------------------------------------------------------------------------------------------------------------------------------------------------------*/

ALWAYS GOOD TO READ:

 

Share this post


Link to post
Share on other sites
Subz

Didn't really understand the OP but you might be able to use something like this:

#include <Array.au3>
#include <IE.au3>

Local $aHREF[1]
Local $oIE = _IECreate()
Local $oH3Tags = _IETagNameGetCollection($oIE, "H3")
For $oH3Tag In $oH3Tags
    If $oH3Tag.ClassName = "heading_title" Then
        $oLinks = _IETagNameGetCollection($oH3Tag, "a")
        For $oLink In $oLinks
            _ArrayAdd($aHREF, $oLink.href)
        Next
    EndIf
Next
$aHREF[0] = UBound($aHREF) - 1
_ArrayDisplay($aHREF)

 

Share this post


Link to post
Share on other sites
IAMK

@FrancescoDiMuro Hmm, I thought that would also take some time, but @Subz solution seems fast enough (for now). I will need to play around with it some more. To be honest, from the look of the code, I expected it to be much slower for some reason.
Thank you.

Also, @FrancescoDiMuro There are TOO many <a> tags. I think the searching of h3 before searching a works better.

Is there an inbuilt way to say I want to get the 1st, skip 2nd, skip 3rd "a" tag in the For ... In ... feature? Or should I just set up a counter variable + if statement?

Edited by IAMK

Share this post


Link to post
Share on other sites
FrancescoDiMuro
31 minutes ago, IAMK said:

Or should I just set up a counter variable + if statement?

Check the @extended code ( success ) in the Return Value section of _IETagNameGetCollection:

Success:    an object variable containing the specified Tag collection, @extended = specified Tag count.

 


Click here to see my signature:

Spoiler

Thoughts:

  • I will always thank you for the time you spent for me.
    I'm here to ask, and from your response, I'd like to learn.
    By my knowledge, I can help someone else, and "that someone" could help in turn another, and so on.

/*--------------------------------------------------------------------------------------------------------------------------------------------------------------------------*/

ALWAYS GOOD TO READ:

 

Share this post


Link to post
Share on other sites
JLogan3o13
44 minutes ago, IAMK said:

Is there an inbuilt way to say I want to get the 1st, skip 2nd, skip 3rd "a" tag in the For ... In ... feature? Or should I just set up a counter variable + if statement?

Look in the help file for For Loop and the Step parameter. This can be modified for a For..In loop

Edited by JLogan3o13

√-1 2^3 ∑ π, and it was delicious!

Share this post


Link to post
Share on other sites
IAMK

@JLogan3o13 Step was the first thing I tried, but I couldn't get it to work with For...In. How do you modify it?

@FrancescoDiMuro I tried looking up how to use @extended, but I couldn't find examples. I don't have to modify the _IETagNameGetCollection function itself, do I?

Share this post


Link to post
Share on other sites
IAMK

@Subz   That is For...To. I am trying to use For...In (_IETagNameGetCollection). For now, I have just For...In'd the entire collection then I do For...To...Step.

I will try playing with @extended. Thanks

Edited by IAMK

Share this post


Link to post
Share on other sites
IAMK

I have another question. This is about _IEGetProperty().

I have the following source:

<div id='body_show_ori'>
僕は今日、仕事に行かなくて、一日中休みました。体は、まだ少し痛いですが、昨日より耐え得ります。4時に、家からスカイプで会議に参加しました。明日、仕事に戻ります。この動画をもう投稿したことがありますが、とても好きから、また投稿します。浅紫の浴衣は、綺麗過ぎます!また、水色の浴衣は、中国っぽいと思います。<br/><object width="560" height="315">
<param name="movie" value="https://www.youtube.com/v/Q2E7TLotcko"></param>
<embed src="https://www.youtube.com/v/Q2E7TLotcko" type="application/x-shockwave-flash" width="560" height="315"></embed>
</object>
<br/>また、その浅紫の浴衣の子は、あるゲームの大好きなキャラに似ています。<br/><a href="https://imgur.com/u1WzDCL" target="_blank">https://imgur.com/u1WzDCL</a>
</div>

If I get innertext, then I get the writing which is not inside any html, and if I get innerhtml, then I get the writing + the html for the YouTube video and IMGur link.

However, what I want is:

僕は今日、仕事に行かなくて、一日中休みました。体は、まだ少し痛いですが、昨日より耐え得ります。4時に、家からスカイプで会議に参加しました。明日、仕事に戻ります。この動画をもう投稿したことがありますが、とても好きから、また投稿します。浅紫の浴衣は、綺麗過ぎます!また、水色の浴衣は、中国っぽいと思います。
https://www.youtube.com/v/Q2E7TLotcko
また、その浅紫の浴衣の子は、あるゲームの大好きなキャラに似ています。
https://imgur.com/u1WzDCL

I can get the text with innertext, and the links by getting the value and href tags, but I can't get them in a way which would keep all the original ordering like the snippet above. How would I go about doing that?

Note: There can be multiple videos and images in the source, in any order.

Edited by IAMK

Share this post


Link to post
Share on other sites
JLogan3o13

@IAMK try something like this:

For $a = 0 To Ubound($aArray) - 1 Step 3
    ...
Next

 


√-1 2^3 ∑ π, and it was delicious!

Share this post


Link to post
Share on other sites
IAMK

@JLogan3o13  I have absolutely no idea why, but it turns out I don't need to step.

Simply having a while loop 1/3rd of the list does the job...

E.g.
While($linkArray[0] < 20)
      Local $oH3Tags = _IETagNameGetCollection($ie, "H3")

      For $oH3Tag In $oH3Tags
         If($oH3Tag.ClassName = "heading_title") Then
            $oLinks = _IETagNameGetCollection($oH3Tag, "a")
            For $oLink In $oLinks
                  _ArrayAdd($linkArray, $oLink.href)
            Next
         EndIf
      Next

      $linkArray[0] = UBound($linkArray) - 1
   WEnd

Without the outside loop, the 20 elements in the array become separated with 2 elements between each of them. It's black magic.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×