Jump to content

StringRegExp


Recommended Posts

I know there's already a good dictionary out there, but I wanted to make my own. I am fairly new at StringRegExp and I am having trouble with it. I used Expresso and it seemed to turn out ok there, but it doesn't seem to work in my script. I am trying to keep all lines with:

1. a single digit number followed by a period

2. a two digit number followed by a period

3. a letter followed by a period (for subdefinitions)

4. the first two characters are "--" (for the part of speech)

Please help me point out the problem here.

#include <Array.au3>
$Word = "test"
$IE = ObjCreate("InternetExplorer.Application")
If Not IsObj($IE) Then
    MsgBox(0, "ERROR", "Object is not a variable.")
    Exit
EndIf
$IE.navigate("http://dictionary.reference.com/browse/" & $Word)
Do
    Sleep(500)
Until $IE.document.readyState = "complete"
$text = $IE.document.body.innertext
$text = StringTrimLeft($text, StringInStr($text, "Show IPA") + 7)
$text = StringTrimRight($text, StringLen($text) - StringInStr($text, "Dictionary.com Unabridged") + 1)
$Array = StringSplit($text, @CR)
$x = 2

While 1
    If $x = UBound($Array) Then ExitLoop
    $Temp = StringStripWS($Array[$x], 8)
    If Not StringRegExp($Temp, "^(--|\d\.|\d\d\.|[a-zA-Z])") Then
        _ArrayDelete($Array, $x)
    Else
        $x += 1
    EndIf
WEnd

_ArrayDisplay($Array)
Link to comment
Share on other sites

MsgBox(0,"",StringRegExp("1.",'[0-9A-Za-z][0-9.]\.?|^--'))
MsgBox(0,"",StringRegExp("12.",'[0-9A-Za-z][0-9.]\.?|^--'))
MsgBox(0,"",StringRegExp("A.",'[0-9A-Za-z][0-9.]\.?|^--'))
MsgBox(0,"",StringRegExp("--",'[0-9A-Za-z][0-9.]\.?|^--'))

This matches all your cases but I expect is actually a little sloppy. If you want exactly what you asked for I think the above has it covered but would also match things your didn't ask for.

The -- has to be at the start of the input that's what the ^ denotes before it but as for the rest of them you didn't say anything about them being at the start of the line. If the regex don't make sense let me know I would be happy to help break it down. If they are too sloppy you will have to get us some better example cases with some more specific rules.

AutoIt changed my life.

Link to comment
Share on other sites

"(\d{1,2}|\--)\.\s.*\r"
This would require the . even after the -- I guess I really have no idea what he is after with no examples but --. was not in the 4 rules he gave. Also the \r would require the CRLF causing it to not work if the line was the last line on a page $ is end of line char and might be more appropriate for web parsing.

But again I dont know...

AutoIt changed my life.

Link to comment
Share on other sites

  • Moderators

1. The thing you think is a hyphen before the part of speech is actually decimal 150 for ascii, and some of the "periods" are decimal 183.

2. I never had more than one char 150, but made an exception in the code below.

You could shorten everything quite a bit I think:

#include <Array.au3>
#include <IE.au3>

Global $s_word = "test"
Global $o_ie = _IECreate("http://dictionary.reference.com/browse/" & $s_word, 0, 0)
Global $s_text = StringRegExpReplace(_IEBodyReadText($o_ie), _
                    "(?i)(?s)(.*?Show IPA)(.*?)(Dictionary\.com Unabridged.*?)\z", "\2")
_IEQuit($o_ie)
Global $a_result = StringRegExp($s_text, "(?:\A|\v)((?:(?:–|-)+\w|\d+(?:\xB7|\.)|[a-zA-Z](?:\xB7|\.)).+?)\v", 3)
_ArrayDisplay($a_result)

Edit:

BTW, for some odd reason, I couldn't get \x96 to work for decimal 150!

Edited by SmOke_N

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...