Jump to content

HTTP Scraping / String manipulation


Recommended Posts

Hi I need some help with string manipulation. I am trying to scrape some data from a web page into a 2 dimensional array. I am having trouble getting the correct data into the arrays. The data is on several lines, and the data is all to a degree of different lengths. Here is an example of a page that I want to scrape: http://www.tv.com/heroes/show/17552/episod...tag=nav_bar;all I want to put the episode name into the first dimension of the array, and the air date into the second dimension of the array. I then want to iterate through the data until there is no more episode showings. The other data in each line is trash. I am sure that the solution is with StringRegExp (), but I have not been able to create a working pattern to use with the function. I am also using _IEBodyReadText to scrape the page. the HTML code was far too large to work with. Any help would be greatly appreciated.

Link to comment
Share on other sites

I'd suggest your solution is with _IETableWriteToArray

Dale

Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y

Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Link to comment
Share on other sites

Tables are often nested. You need for request data from the correct one... there are 7 tables on that page.

Dale

Edited by DaleHohm

Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y

Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Link to comment
Share on other sites

Tables are often nested. You need for request data from the correct one... there are 7 tables on that page.

Dale

Cool I dident think about that. Turns out it was the 4th table. $array = _IETableGetCollection ($iePage, 4) I only have one other question now. I have data "7/29/2008" or "10/2/2004" (no quotes) and I need to extract the individual numbers so that I can process it as a date e.g. 2008 29 7. Any ideas on how to do this? In C++ this was a snap, but I am having trouble coming to terms with a lack of explicit data types like chars. thanks again.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...