Jump to content

Pull information from Wikipedia


Recommended Posts

Ok, so im working on this project for school about companies and their headquarters location. To create either a heat map or a cluster map of information on where companies are headquartered, created, etc.

I have used wikipedia to get the company names, what im looking to do is pull the headquarter location out of wikipedia's page.

now i know that im going to need to use something like what somdcomputerguy posted

#include <INet.au3>
$aPasses = StringRegExp(_INetGetSource('http://www.generate-password.com'),"value=........", 3)
MsgBox(0, "Generated Passwords", @TAB & StringReplace($aPasses[0],'value="', "") & ' : ' & StringReplace($aPasses[1],'value="', ""))

However, i will modify the script to take whats in my clipboard and use it as the "website" in the source field... (when i press a hotkey)

that part i got down, the issue im having is that i dont understand fully how StringRegExp works...

Here is an example webpage of wikipedia that i would like to pull information out of: Wikipedia:AGCO

I took a gander at their structure and there is no specific name for the headquarters, other than headquarters, however.. the string after can differ by MANY different letters and marks...

Information:

<tr class="">
<th scope="row" style="text-align: left;">Headquarters</th>
<td class="label" style=""><a href="/wiki/Duluth,_Georgia" title="Duluth, Georgia">Duluth</a>, <a href="/wiki/Georgia_%28U.S._state%29" title="Georgia (U.S. state)">Georgia</a>, <a href="/wiki/USA" title="USA" class="mw-redirect">USA</a></td>
</tr>

The information that i would need would be: "Duluth, Georgia, USA"

if someone would point me in the direction that i can take to understanding this better, possibly with some examples, or even if someone could write up a hint to what i need to do to get headquarters working, i am pretty confident that i can get others working.

Thank you, if you need more information please let me know!

Edited by XeroFx
Link to comment
Share on other sites

Hello XeroFx,

I am no expert with SRE either, and there probably is a better way to gather your information. However after testing this SRE on a the full page at Wiki, that you provided, this example worked as expected:

$text = '<tr class="">' & @CRLF _
    & '<th scope="row" style="text-align: left;">Headquarters</th>' & @CRLF _
    & '<td class="label" style=""><a href="/wiki/Duluth,_Georgia" title="Duluth, Georgia">Duluth</a>, <a href="/wiki/Georgia_%28U.S._state%29" title="Georgia (U.S. state)">Georgia</a>, <a href="/wiki/USA" title="USA" class="mw-redirect">USA</a></td>' & @CRLF _
    & '</tr>'

$sre = StringRegExp($text, '<tr class="">rn<th (?:.*?)>Headquarters</th>rn<td (?:.*?)" title="(?:.*?)">(.*?)</a>, <a href="(?:.*?)" title="(?:.*?)">(.*?)</a>, <a href="(?:.*?)" title="(?:.*?)" class="(?:.*?)">(.*?)</a></td>rn</tr>', 3)
If @error Then ConsoleWrite( '- Error: ' & @error &', Extended: ' & @extended & @LF )

_ArrayDisplay($sre)

My Contributions: Unix Timestamp: Calculate Unix time, or seconds since Epoch, accounting for your local timezone and daylight savings time. RegEdit Jumper: A Small & Simple interface based on Yashied's Reg Jumper Function, for searching Hives in your registry. 

Link to comment
Share on other sites

Not exactly

?: Tells SRE not to include the following match criteria in your results

. will match any single character except newline (@LF)

* Tells it to repeat the previous criteria, in this case more single characters.

? when placed after a repeating character, will find the smallest match.

Edit: Extra info:

If we didn't include the ending '?' it would have given us the largest possible match and in your case unexpected results.

For Example:

$text = 'Test Text<Need This Text>and <Do not Need this text>'

$SRE = StringRegExp($text, 'Test Text<(.*)>', 1)
_ArrayDisplay($SRE)

$SRE = StringRegExp($text, 'Test Text<(.*?)>', 1)
_ArrayDisplay($SRE)

The First example returns = Need This Text>and <Do not Need this text

When we instruct it to return the shortest match by adding the '?' after the repeating character '*'

we get = Need This Text

Edited by Realm

My Contributions: Unix Timestamp: Calculate Unix time, or seconds since Epoch, accounting for your local timezone and daylight savings time. RegEdit Jumper: A Small & Simple interface based on Yashied's Reg Jumper Function, for searching Hives in your registry. 

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...