Jump to content

HTML Page Sorting based on embedded values


Recommended Posts

Okay, where to begin....

I have x number of pages. Each one looks like this:

blah blah blah Username blah blah blah level blah

(Note: these are all html files.)

They are called page1.htm and page2.htm and so on. They are organized so that the highest value in level is on page 1 and the lowest on the last page. If I want to find the pages with level 50s on it for example, how would I go about this promptly? I have written a script that does what I want, but takes far too much time... Here is the sorting code I have so far:

The below code *should* work, but I havent had the patience to wait for it and find out...

#include <IE.au3>
#include <INet.au3>

Dim $myarray[100][100]; declaring array for the write to table func
; begin code for finding start page NEEDS TESTING
$min=30; minimum level
$max=40; maximum level
$page=1; page to start on
While 1
    $myie=_IECreate()
    _IELoadWait($myie)
    Sleep(1000)
    $src=_INetGetSource('http://www.blahblah.com/page'&$page&'.htm')
    _IEBodyWriteHTML($myie,$src)
    $mytable=_IETableGetObjByIndex($myie,13)
    $myarray=_IETableWriteToArray($mytable)
    $level=$myarray[4][3]
    _IEQuit($myie)
    Traytip("",'page is '&$page&' and the level is: '&$level,10000)
    If $level < $min Then
        $page=$page-100
        ContinueLoop
    EndIf
    If $level > $max Then
        $page=$page+100
        ContinueLoop
    EndIf
    If $level >= $min and $level <= $max Then
        ExitLoop
    EndIf
WEnd
While 1 
    $myie=_IECreate()
    _IELoadWait($myie)
    Sleep(1000)
    $src=_INetGetSource('http://www.blahblah.com/page'&$page&'.htm')
    _IEBodyWriteHTML($myie,$src)
    $mytable=_IETableGetObjByIndex($myie,13)
    $myarray=_IETableWriteToArray($mytable)
    $level=$myarray[4][3]
    _IEQuit($myie)
    Traytip("",'page is '&$page&' and the level is: '&$level,10000)
    If $level >= $min Then
        $page=$page+10
        ContinueLoop
    EndIf
    ExitLoop
WEnd
While 1
    $myie=_IECreate()
    _IELoadWait($myie)
    Sleep(1000)
    $src=_INetGetSource('http://www.blahblah.com/page'&$page&'.htm')
    _IEBodyWriteHTML($myie,$src)
    $mytable=_IETableGetObjByIndex($myie,13)
    $myarray=_IETableWriteToArray($mytable)
    $level=$myarray[4][3]
    _IEQuit($myie)
    Traytip("",'page is '&$page&' and the level is: '&$level,10000)
    If $level<$min Then
        $page = $page-1
        ContinueLoop
    EndIf
    ExitLoop
WEnd
MsgBox(0,'should be page with the first lvl a 30',$page)

(Note: Traytips are for debugging...)

Also, on a side note, I'd like to declare Dale Hohm my personal hero for creating the IE Automation Library.

---Sparkes.

Link to comment
Share on other sites

Also, on a side note, I'd like to declare Dale Hohm my personal hero for creating the IE Automation Library.

Very glad you find it useful.

There are several things you are doing in the script that slow it down significantly.

First creating, destroying and then recreating a browser instance each time adds a tremendous amount of time to each step. You can reuse the original browser and then destroy it just once when you are all done.

Second, you can probably speed things up with general HTML access by using _INetGet instead of IE to get and parse HTML, but if you are going to create a browser instance anyway and ask the browser layout manager to do its job, there is no sense in using both. Suggest you replace the the _IELoadWait, the Sleep, the _INetGet, and the _IEBodyWrite with a simple _IENavigate (that also does an _IELoadWait for you).

You should find the following code executes A LOT faster. There may be some other optimization that can be performed, but this should make a big difference.

#include <IE.au3>

Dim $myarray[100][100]; declaring array for the write to table func
; begin code for finding start page NEEDS TESTING
$min=30; minimum level
$max=40; maximum level
$page=1; page to start on

$myie=_IECreate()

While 1
    _IENavigate($myie, 'http://www.blahblah.com/page'&$page&'.htm')
    $mytable=_IETableGetObjByIndex($myie,13)
    $myarray=_IETableWriteToArray($mytable)
    $level=$myarray[4][3]
    Traytip("",'page is '&$page&' and the level is: '&$level,10000)
    If $level < $min Then
        $page=$page-100
        ContinueLoop
    EndIf
    If $level > $max Then
        $page=$page+100
        ContinueLoop
    EndIf
    If $level >= $min and $level <= $max Then
        ExitLoop
    EndIf
WEnd
While 1
    _IENavigate($myie, 'http://www.blahblah.com/page'&$page&'.htm')
    $mytable=_IETableGetObjByIndex($myie,13)
    $myarray=_IETableWriteToArray($mytable)
    $level=$myarray[4][3]
    Traytip("",'page is '&$page&' and the level is: '&$level,10000)
    If $level >= $min Then
        $page=$page+10
        ContinueLoop
    EndIf
    ExitLoop
WEnd
While 1
    _IENavigate($myie, 'http://www.blahblah.com/page'&$page&'.htm')
    $mytable=_IETableGetObjByIndex($myie,13)
    $myarray=_IETableWriteToArray($mytable)
    $level=$myarray[4][3]
    Traytip("",'page is '&$page&' and the level is: '&$level,10000)
    If $level<$min Then
        $page = $page-1
        ContinueLoop
    EndIf
    ExitLoop
WEnd
MsgBox(0,'should be page with the first lvl a 30',$page)

_IEQuit($myie)
$myie = 0
Exit

Also, a side note about writing the contents of _INetGet to the page with _IEBodyWriteHTML... _INetGet returns everything inside the <HTML></HTML> tags on a page (i.e. EVERYTHING), _IEBodyWriteHTML only replaces what is inside the <BODY></BODY> tags so what you are doing ends up creating a page with a duplicated header section (the part of the page between the <HTML> and <BODY> tags -- usually where javascripts are stored). This may not be a problem, but it might be and it could add a lot of extra bulk to the page. The next release of IE.au3 has an _IEDocWriteHTML and IEDocReadHTML that will allow you to rewrite the full page.

Dale

Edit: Fixed typos

Edited by DaleHohm

Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y

Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Link to comment
Share on other sites

With your suggestions, it sped up quite a bit, but there has to be a better pattern for finding something like this. I reprogrammed the page hunting part and it cut the time down like 75%.

Is there any algorithm master out there with information to contribute?

There is about 5000 pages and it takes approximately 50 or 60 passes to find some pages.

A question directed towards DaleHohm:

Second, you can probably speed things up with general HTML access by using _INetGet instead of IE to get and parse HTML, but if you are going to create a browser instance anyway and ask the browser layout manager to do its job, there is no sense in using both.

You said that I should use _INetGet, but will the IE table write to array function still work on raw html? That is the main source of this code's life. Could you perhaps give an example of using _INetGet for this?

Here is the updated page finding code:

While 1
        _IENavigate ($myie, 'http://www.mysite.com/page.php&page=' & $page)
        $mytable = _IETableGetObjByIndex ($myie, 13)
        $myarray = _IETableWriteToArray ($mytable)
        $level = $myarray[4][3]
        If Int($level) >= Int($min) Then
            $page = $page + 250
            ContinueLoop
        EndIf
        TrayTip('', 'done with level 1', 5000)
        ExitLoop
    WEnd
    While 1
        _IENavigate ($myie, 'http://www.mysite.com/page.php&page=' & $page)
        $mytable = _IETableGetObjByIndex ($myie, 13)
        $myarray = _IETableWriteToArray ($mytable)
        $level = $myarray[4][3]
        If Int($level) <= Int($min) Then
            $page = $page - 100
            ContinueLoop
        EndIf
        TrayTip('', 'done with level 2', 5000)
        ExitLoop
    WEnd
    While 1
        _IENavigate ($myie, 'http://www.mysite.com/page.php&page=' & $page)
        $mytable = _IETableGetObjByIndex ($myie, 13)
        $myarray = _IETableWriteToArray ($mytable)
        $level = $myarray[4][3]
        If Int($level) >= Int($min) Then
            $page = $page + 20
            ContinueLoop
        EndIf
        TrayTip('', 'done with level 3', 5000)
        ExitLoop
    WEnd
    While 1
        _IENavigate ($myie, 'http://www.mysite.com/page.php&page=' & $page)
        $mytable = _IETableGetObjByIndex ($myie, 13)
        $myarray = _IETableWriteToArray ($mytable)
        $level = $myarray[4][3]
        If Int($level) <= Int($min) Then
            $page = $page - 5
            ContinueLoop
        EndIf
        TrayTip('', 'done with level 4', 5000)
        ExitLoop
    WEnd
    While 1
        _IENavigate ($myie, 'http://www.mysite.com/page.php&page=' & $page)
        $mytable = _IETableGetObjByIndex ($myie, 13)
        $myarray = _IETableWriteToArray ($mytable)
        $level = $myarray[4][3]
        If Int($level) >= Int($min) Then
            $page = $page + 1
            ContinueLoop
        EndIf
        TrayTip('', 'done with level 5', 5000)
        ExitLoop
    WEnd

It could probably be optimized by not passing the start page, but am a bit unsure of what to change to do this....

Another side question. When I was testing the code, Something went kapputz and I needed to throw in some Int()s when checking the variables. It works now, but nothing on the site changed, so what was the problem? It had worked before... Perhaps it is better not to question it...

---Sparkes.

Link to comment
Share on other sites

You said that I should use _INetGet, but will the IE table write to array function still work on raw html? That is the main source of this code's life. Could you perhaps give an example of using _INetGet for this?

Actually, I didn't say you should use _InetGet(). What I said was that you could speed up the retrieval of the HTML, but if you were going to need to have the layout manager do its job anyway there was not point in it.

To use the DOM and the table to array function you need to use IE and have it render the page -- this is surely where most of your time is being spent. InetGet will give you raw HTML and you will need to parse it as plain text and create your own parsing rules.

It looks to me that you have done the easy optimizations in this code at this point. To speed it up further and significanly you will need to invest a lot more time into more code and more manual manipulation. You need to decide if it is worth that effort and investment of time. It would have little to do with IE at that point and everything to do with brute force string manipulation and parsing.

Dale

Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y

Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Link to comment
Share on other sites

I think the only thing I am going to do with it is something like this:

Take out _IENavigate and replace it with:

$mysrc=_INetGetSource('http://www.mysite.com/blahblah.htm')
$mysrc=StringReplace($mysrc,'<img','<br')
_IEBodyWriteHTML($myie,$mysrc)

All that would do is take out all the begin img tags and make them line breaks, so they wouldnt show up, right? Is this a good way to take the images out of a html page? Can it be done another way?

I am highly anticipating your release of the next version of IE.au3. If it is just a bit better than the last one, I'll get on my knees and cry... It is a wonderful tool for Autoit....

Thanks for all your help.

Sparkes.

---Sparkes.

Link to comment
Share on other sites

Your logic is essentially sound, but you will have a problem because _INetGet returns the entire webpage (including the HEAD section and BodyWrite only replaces the BODY section (so you will create a page with a duplicated HEAD section.

I've PM'd you a previoe of the _IEDocReadHTML nad _IEDocWriteHTML functions from the next release that you could use instead.

You can also turn off the display of images (show pictures) in the IE Tools, Internet Options, Advanved tab...

Dale

I think the only thing I am going to do with it is something like this:

Take out _IENavigate and replace it with:

$mysrc=_INetGetSource('http://www.mysite.com/blahblah.htm')
$mysrc=StringReplace($mysrc,'<img','<br')
_IEBodyWriteHTML($myie,$mysrc)

All that would do is take out all the begin img tags and make them line breaks, so they wouldnt show up, right? Is this a good way to take the images out of a html page? Can it be done another way?

I am highly anticipating your release of the next version of IE.au3. If it is just a bit better than the last one, I'll get on my knees and cry... It is a wonderful tool for Autoit....

Thanks for all your help.

Sparkes.

Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y

Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...