Jump to content

Recommended Posts

Posted (edited)

Ok, so im working on this project for school about companies and their headquarters location. To create either a heat map or a cluster map of information on where companies are headquartered, created, etc.

I have used wikipedia to get the company names, what im looking to do is pull the headquarter location out of wikipedia's page.

now i know that im going to need to use something like what somdcomputerguy posted

#include <INet.au3>
$aPasses = StringRegExp(_INetGetSource('http://www.generate-password.com'),"value=........", 3)
MsgBox(0, "Generated Passwords", @TAB & StringReplace($aPasses[0],'value="', "") & ' : ' & StringReplace($aPasses[1],'value="', ""))

However, i will modify the script to take whats in my clipboard and use it as the "website" in the source field... (when i press a hotkey)

that part i got down, the issue im having is that i dont understand fully how StringRegExp works...

Here is an example webpage of wikipedia that i would like to pull information out of: Wikipedia:AGCO

I took a gander at their structure and there is no specific name for the headquarters, other than headquarters, however.. the string after can differ by MANY different letters and marks...

Information:

<tr class="">
<th scope="row" style="text-align: left;">Headquarters</th>
<td class="label" style=""><a href="/wiki/Duluth,_Georgia" title="Duluth, Georgia">Duluth</a>, <a href="/wiki/Georgia_%28U.S._state%29" title="Georgia (U.S. state)">Georgia</a>, <a href="/wiki/USA" title="USA" class="mw-redirect">USA</a></td>
</tr>

The information that i would need would be: "Duluth, Georgia, USA"

if someone would point me in the direction that i can take to understanding this better, possibly with some examples, or even if someone could write up a hint to what i need to do to get headquarters working, i am pretty confident that i can get others working.

Thank you, if you need more information please let me know!

Edited by XeroFx
Posted

Hello XeroFx,

I am no expert with SRE either, and there probably is a better way to gather your information. However after testing this SRE on a the full page at Wiki, that you provided, this example worked as expected:

$text = '<tr class="">' & @CRLF _
    & '<th scope="row" style="text-align: left;">Headquarters</th>' & @CRLF _
    & '<td class="label" style=""><a href="/wiki/Duluth,_Georgia" title="Duluth, Georgia">Duluth</a>, <a href="/wiki/Georgia_%28U.S._state%29" title="Georgia (U.S. state)">Georgia</a>, <a href="/wiki/USA" title="USA" class="mw-redirect">USA</a></td>' & @CRLF _
    & '</tr>'

$sre = StringRegExp($text, '<tr class="">rn<th (?:.*?)>Headquarters</th>rn<td (?:.*?)" title="(?:.*?)">(.*?)</a>, <a href="(?:.*?)" title="(?:.*?)">(.*?)</a>, <a href="(?:.*?)" title="(?:.*?)" class="(?:.*?)">(.*?)</a></td>rn</tr>', 3)
If @error Then ConsoleWrite( '- Error: ' & @error &', Extended: ' & @extended & @LF )

_ArrayDisplay($sre)

My Contributions: Unix Timestamp: Calculate Unix time, or seconds since Epoch, accounting for your local timezone and daylight savings time. RegEdit Jumper: A Small & Simple interface based on Yashied's Reg Jumper Function, for searching Hives in your registry. 

Posted

Thank you for such a quick response, ill test this as soon as i get home tomorrow!!!!

So, when you are not sure the length of an item, you use:

(?:.*?)

correct?

Posted (edited)

Not exactly

?: Tells SRE not to include the following match criteria in your results

. will match any single character except newline (@LF)

* Tells it to repeat the previous criteria, in this case more single characters.

? when placed after a repeating character, will find the smallest match.

Edit: Extra info:

If we didn't include the ending '?' it would have given us the largest possible match and in your case unexpected results.

For Example:

$text = 'Test Text<Need This Text>and <Do not Need this text>'

$SRE = StringRegExp($text, 'Test Text<(.*)>', 1)
_ArrayDisplay($SRE)

$SRE = StringRegExp($text, 'Test Text<(.*?)>', 1)
_ArrayDisplay($SRE)

The First example returns = Need This Text>and <Do not Need this text

When we instruct it to return the shortest match by adding the '?' after the repeating character '*'

we get = Need This Text

Edited by Realm

My Contributions: Unix Timestamp: Calculate Unix time, or seconds since Epoch, accounting for your local timezone and daylight savings time. RegEdit Jumper: A Small & Simple interface based on Yashied's Reg Jumper Function, for searching Hives in your registry. 

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...