Pull information from Wikipedia

XeroFx · April 18, 2012

Ok, so im working on this project for school about companies and their headquarters location. To create either a heat map or a cluster map of information on where companies are headquartered, created, etc.

I have used wikipedia to get the company names, what im looking to do is pull the headquarter location out of wikipedia's page.

now i know that im going to need to use something like what somdcomputerguy posted

#include <INet.au3>
$aPasses = StringRegExp(_INetGetSource('http://www.generate-password.com'),"value=........", 3)
MsgBox(0, "Generated Passwords", @TAB & StringReplace($aPasses[0],'value="', "") & ' : ' & StringReplace($aPasses[1],'value="', ""))

However, i will modify the script to take whats in my clipboard and use it as the "website" in the source field... (when i press a hotkey)

that part i got down, the issue im having is that i dont understand fully how StringRegExp works...

Here is an example webpage of wikipedia that i would like to pull information out of: Wikipedia:AGCO

I took a gander at their structure and there is no specific name for the headquarters, other than headquarters, however.. the string after can differ by MANY different letters and marks...

Information:

<tr class="">
<th scope="row" style="text-align: left;">Headquarters</th>
<td class="label" style=""><a href="/wiki/Duluth,_Georgia" title="Duluth, Georgia">Duluth</a>, <a href="/wiki/Georgia_%28U.S._state%29" title="Georgia (U.S. state)">Georgia</a>, <a href="/wiki/USA" title="USA" class="mw-redirect">USA</a></td>
</tr>

The information that i would need would be: "Duluth, Georgia, USA"

if someone would point me in the direction that i can take to understanding this better, possibly with some examples, or even if someone could write up a hint to what i need to do to get headquarters working, i am pretty confident that i can get others working.

Thank you, if you need more information please let me know!

Edited April 18, 2012 by XeroFx

Realm · April 18, 2012

Hello XeroFx,

I am no expert with SRE either, and there probably is a better way to gather your information. However after testing this SRE on a the full page at Wiki, that you provided, this example worked as expected:

$text = '<tr class="">' & @CRLF _
    & '<th scope="row" style="text-align: left;">Headquarters</th>' & @CRLF _
    & '<td class="label" style=""><a href="/wiki/Duluth,_Georgia" title="Duluth, Georgia">Duluth</a>, <a href="/wiki/Georgia_%28U.S._state%29" title="Georgia (U.S. state)">Georgia</a>, <a href="/wiki/USA" title="USA" class="mw-redirect">USA</a></td>' & @CRLF _
    & '</tr>'

$sre = StringRegExp($text, '<tr class="">rn<th (?:.*?)>Headquarters</th>rn<td (?:.*?)" title="(?:.*?)">(.*?)</a>, <a href="(?:.*?)" title="(?:.*?)">(.*?)</a>, <a href="(?:.*?)" title="(?:.*?)" class="(?:.*?)">(.*?)</a></td>rn</tr>', 3)
If @error Then ConsoleWrite( '- Error: ' & @error &', Extended: ' & @extended & @LF )

_ArrayDisplay($sre)

XeroFx · April 18, 2012

Thank you for such a quick response, ill test this as soon as i get home tomorrow!!!!

So, when you are not sure the length of an item, you use:

(?:.*?)

correct?

Realm · April 18, 2012

Not exactly

?: Tells SRE not to include the following match criteria in your results

. will match any single character except newline (@LF)

* Tells it to repeat the previous criteria, in this case more single characters.

? when placed after a repeating character, will find the smallest match.

Edit: Extra info:

If we didn't include the ending '?' it would have given us the largest possible match and in your case unexpected results.

For Example:

$text = 'Test Text<Need This Text>and <Do not Need this text>'

$SRE = StringRegExp($text, 'Test Text<(.*)>', 1)
_ArrayDisplay($SRE)

$SRE = StringRegExp($text, 'Test Text<(.*?)>', 1)
_ArrayDisplay($SRE)

The First example returns = Need This Text>and <Do not Need this text

When we instruct it to return the shortest match by adding the '?' after the repeating character '*'

we get = Need This Text

Edited April 18, 2012 by Realm

Sign In

Pull information from Wikipedia

Recommended Posts

XeroFx

Realm

XeroFx

Realm

Create an account or sign in to comment

Create an account

Sign in

Browse

AutoIt Resources

Release

Beta