Jump to content

Robjong

Active Members
  • Posts

    315
  • Joined

  • Last visited

  • Days Won

    1

Robjong last won the day on January 31 2013

Robjong had the most liked content!

Profile Information

  • Location
    The Netherlands

Recent Profile Visitors

499 profile views

Robjong's Achievements

Universalist

Universalist (7/7)

13

Reputation

  1. Hi Robdog1955, What is the exact problem you encounter? Do you get an error message or just unexpected results? What have you tried so far? Can you share your code? Ideally you would provide us a link to the webpage in question. If that is not option, post the (redacted) source or at least see if you can identify the editor. Regards, Rob
  2. Hi, I would suggest also looking into the _IE* functions to retrieve data. If you are going the InetGet route and feel adventurous maybe take a look at StringRegExp. What kind of games are you talking about? Tabletop/PC/Console? Old/New?
  3. @kylomas Any particular reason you went with 'Amount Due=((?:\d+)?\.\d+)' instead of something like this 'Amount\h+Due=(\d*\.\d+)' ? While we are at it we may as well include negative amounts and match case insensitive. '(?i)Amount\h*Due=(-?\d*\.\d+)'
  4. Exactly. Added an explanation of the pattern to my as I said I would.
  5. It's a pretty basic pattern, and I just wrote it, no special tools. I will explain the pattern later today when I have more time. [ opens a character class [:alpha:] match uppercase and lowercase letters (POSIX character class) .,: $ matches dot, comma, semicolon, space or dollar sign. ] close the character class ( open capturing group [\d.]+ matches digits or dot, 1 or more times ) close capturing group Look in the help file for StringRegExp, see flag 4 for return values, it returns an array with arrays, not a 2D array.
  6. As ugly as the sample text, but it should work (for the sample at least): $s = 'invoice 928.00 paid 880.00 pricing.' & @CRLF $s &= 'Invoice $ 35.20 Paid $ 31.12 Paid invoice per system pricing' & @CRLF $s &= 'inv 1681.00 pd 1575.00 no pay' & @CRLF $s &= 'Invoice $80.00 Paid $79.50 paid per g' & @CRLF $s &= '(2012-10-08:61516 ) Invoice $ 218.50 Paid $ 164.30 Paid invoice per system pricing' & @CRLF $s &= 'inv 220.89 pd 212.10 paid per pricing less.' & @CRLF $s &= 'Invoiced Amt $76, Paid $64.48 - paid as per flat fee' & @CRLF $s &= 'Invice64.00 Paid 63.50 Paid per admin pricing' & @CRLF $s &= 'Invoiced: $32.00 Paid: $30.00' & @CRLF $s &= 'Inv. $136 Pd. $126 per flat rate of $50 for' & @CRLF ;~ $a = StringRegExp($s, '(?im)^.*\bin[voiced]{0,6}(?:\h+[a-z]+\b)?\W*?\K(?:\$\h*)?\d+(?:[,.]\d{1,2})?(?=.*\bp[aid]{1,3}\W*?(\$?\h*\d+(?:[,.]\d{1,2})?))', 4) ; strict as per the sample text $a = StringRegExp($s, '(?im)^.*\bin[voiced]{0,6}(?:\h+[a-z]+\b)?\W*?\$?\h*\K\d+(?:[,.]\d{1,2})?(?=.*\bp[aid]{1,3}\W*?\$?\h*(\d+(?:[,.]\d{1,2})?))', 4) ; suggested pattern, does not capture $ For $i = 0 To UBound($a) - 1 Step 1 $b = $a[$i] ConsoleWrite(StringFormat('Invoiced: %.2f Paid: %.2f\n', $b[0], $b[1])) Next To understand this pattern we break it down. (?im) Options: i = case-insensetive matching, m = ^ and $ also match the start of a line and the end of a line respectively. ^.* ^ matches the start of the line. .* matches any character except line feed 0 or more times. This anchors our match to a line. \b \b matches a word boundary, which is a point between a non-word and word character, this helps to avoid 'in' being part of another word. in[voiced]{0,6} 'in' matches the start of the invoice-like word we want to precede the amount. [voiced]{0,6} matches v, o, i, c, e or d, 0 to 6 times. (?: ( opens a group, ?: makes it non-capturing. This group allows one word after 'invoiced'. \h+[a-z]+ \h+ matches 1 or more horizontal space. [a-z]+ matches any letter, 1 or more times. \b \b matches a word boundary, this makes sure that if this word is present one of the following optional parts (\W*?, \$? or \h*) matches. )? ) closes the group. ? makes it optional, match 0 or 1 times. \W*?\$?\h* \W* matches 0 or more non-word characters, ? makes the match lazy, as small as possible. \$? optionally matches a $ dollar sign, \ escapes the dollar sign, $ is the dollar sign, and again, the ? makes it optional. This allows the for spaces, dots and colons seen in the sample text, as well as the dollar sign and space. \K This is an interesting one, and it's not in the AutoIt help file (yet). It resets our global match, everything matched so far in the global match is discarded. Because we only match the first amount in the global match from here on, only the first amount will be in the global match (b[0]). We capture the second amount in a lookahead, which does not end up in the global match. In short \K basically turns the part of the pattern before it into a less restricted lookbehind. \d+(?:[,.]\d{1,2})? \d+ matches 1 or more digits. (?: open non-capturing group. [,.] matches comma or dot. \d{1,2} matches 1 or 2 digits. ) closes the group. ? makes it optional. This matches the first amount, e.g. 1, 23, 45.67 or 78,90. (?= ( opens a group, ?= makes it a positive lookahead, meaning the subpattern must match the subject ahead of this point. .*\b .* matches any character except line feed 0 or more times. \b matches a word boundary to help avoid p (see below) being part of another word. p[aid]{1,3} p[aid]{1,3} matches a, i or d 1 2 or 3 times. \W*?\$?\h* Covered this already. ( ( Opens a capturing group, which will capture the second amount (b[1]). \d+(?:[,.]\d{1,2})? Covered this already. ) Close capturing group. ) Close lookahead.
  7. Here you will find a decent explenation of these patterns: http://tinyurl.com/regexp-primes Edit: the original URL was not converted correctly so I had to make a tiny one
  8. Not at all, even with the full editor I had to put in indentation manually, when pasting it all disappears.
  9. I know, it's a habit I picked up not long ago, to write the Step as well. Edit: And is it me or did the code indentation issue get worse?
  10. @guinness According to that function 1 is a prime... Here is an SRE version, just for giggles. ConsoleWrite(_IsPrime(1) & @CRLF) ConsoleWrite(_IsPrime(3) & @CRLF) ConsoleWrite(_IsPrime(9) & @CRLF) ConsoleWrite(_IsPrime(29) & @CRLF) Func _IsPrime($iNum) Local $sNum = '' For $i = 1 To $iNum Step 1 $sNum &= '1' Next Return Not StringRegExp($sNum, '^(1?|(11+?)\2+)$') EndFunc
  11. ConsoleWrite(_INetGetText('http://autoitscript.com') & @LF) Func _INetGetText($sURL) Local $bStr = InetRead($sURL, 19) If @error Then Return SetError(1, 0, 0) EndIf Local $oHTML = ObjCreate("HTMLFILE") If @error Then Return SetError(2, 0, 0) $oHTML.Open() $oHTML.Write(BinaryToString($bStr)) ; $oHTML.... Return SetError(0, 0, $oHTML.Body.InnerText) EndFunc ;==>_INetGetText Maybe this will help you get started...
  12. Hi, I'm late to the party I see, but here is some additional information. The circumflex/caret has 3 functions in a regular expression, depending on the mode and position in the pattern. (1) By default it matches the beginning of the subject, same as \A. ($ will match the end of subject, same as \z) (2) If multiline mode is enabled, by setting the m flag, it matches the beginning of the subject and the start of a new line ($ will match the end of subject and end of line). (3) If it is used in a character class it only has a special meaning if it is the first character in the class. It will then negate the class, i.e. the class will match any character not in it. If you want to match a literal circumflex in a pattern it must be escaped (with a backslash, \^), in a character class it only needs to be escaped if it is the first character because it has no special meaning at any other position. It is also a zero-width expression, meaning it will match the position of a character (start of subject or newline) but not the character itsef, i.e. it will not include/return the character(s) in the match. ; normal mode, beginning of subject anchor #cs - Match a string of only word characters (word characters are A-Z, a-z, 0-9 and _ (underscore). Equivalent to class [A-Za-z0-9_]) ^ the start of the subject \w+ one or more word characters $ the end of the subject #ce If StringRegExp("ABCD", '^\w+$') Then ConsoleWrite("The string consists only of ""word"" characters." & @LF) Else ConsoleWrite("The string does NOT consist of only ""word"" characters." & @LF) EndIf ; multiline mode, beginning of subject or newline #cs - Match full line comments (?m) m flag, enables multiline mode ^ the beginning of the subject or the beginning of a newline \h*; 0 or more horizontal whitespaces followed by a semicolon (?: open non-capturing group [[:punct:]]\h* non-alphanumeric priniting character followed by 0 or more horizontal whitespaces )? close non-capturing group, match group 0 or 1 times (? makes the group optional) ( open capturing group [^\r\n]* match any character except for \r (CR) or \n (LF) 0 or more times (^ negates the class) ) close capturing group #ce $aMatches = StringRegExp(FileRead(@ScriptFullPath), '(?m)^\h*;(?:[[:punct:]]\h*)?([^\r\n]*)', 3) For $i = 0 To UBound($aMatches) - 1 Step 1 ConsoleWrite("COMMENT: " & $aMatches[$i] & @LF) Next To answer your question more directly... The beginning of a subject/string is always at position 0 (before the first character). The beginning of a line is either the beginning of the subject or directly after a newline (CRLF/CR/LF). Example: #include <Array.au3> ; subject $sString = "This is an example subject." & @LF & "Made up of two lines." ; ^ singleline mode, match the first 4 characters of the subject #cs ^ start of the subject .... followed by 4 characters #ce $aMatches = StringRegExp($sString, "^....", 3) _ArrayDisplay($aMatches, "Singleline") ; ^ multiline mode, match the first 4 characters of a line #cs (?m) m flag, enables multiline mode ^ start of the subject or line .... followed by 4 characters #ce $aMatches = StringRegExp($sString, "(?m)^....", 3) _ArrayDisplay($aMatches, "Multiline") ; singleline mode reproducer, match the first 4 characters of a subject #cs (....) capture 4 characters that are not newline characters [\s\S]* match any character (space and non-space) 0 or more times #ce $aMatches = StringRegExp($sString, "(....)[\s\S]*", 3) _ArrayDisplay($aMatches, "Singleline Reproducer") ; multiline mode reproducer, match the first 4 characters of a line #cs (?: open non-capturing group \A|\r\n|\r|\n match the start of the subject or newline characters ) close non-capturing group (....) capture 4 characters that are not newline characters #ce $aMatches = StringRegExp($sString, "(?:\A|\r\n|\r|\n)(....)", 3) _ArrayDisplay($aMatches, "Multiline Reproducer") ; advanced multiline mode reproducer, match the first 4 characters of a line #cs (?<= open positive lookbehind \A|\r\n|\r|\n match the start of the subject or newline characters ) close positive lookbehind .... match (and capture as global match) 4 characters that are not newline characters #ce $aMatches = StringRegExp($sString, "(?<=\A|\r\n|\r|\n)....", 3) _ArrayDisplay($aMatches, "Advanced Multiline Reproducer") Edit: Fixed spacing of comments Edit2: added "word" character description Edit3: added more direct answer
  13. No. .* will match 0 or more occurrences of any character (except for newline, assuming single line mode), .+ would match 1 or more occurrences of any character, they would consume the largest possible match, this is called greedy. But in .*? the lazy operator (?) will tell it to return the smallest possible match, which would match nothing or 1 character. Because the pattern ".*?" starts and ends with a quote the engine will look for 0 or more character between 2 quotes, this would match against "", as well as "1". If we had the pattern ".+?" and matched it against the subject "" it would fail, because it needs at least 1 character, so it would match "1". Example: $sSubject = 'A string with a "quoted part", and a separate " floating in there.' ; Greedy $aResult = StringRegExp($sSubject, '".*"', 3) ; matches : "quoted part", and a separate " _ArrayDisplay($aResult, "Greedy") ; Lazy $aResult = StringRegExp($sSubject, '".*?"', 3) ; matches: "quoted part" _ArrayDisplay($aResult, "Lazy")
  14. Try to avoid using a lazy dot ".*?", instead use a negating character class "[^"]*" whenever possible. This avoids unnecessary backtracking and thus increases performance. #include <Array.au3> Global $sString = '<META CONTENT="1; url=http://roundtopstatebank.com" HTTP-EQUIV=refresh>' & @LF & _ '<META HTTP-EQUIV=refresh CONTENT="1;url=http://roundtopstatebank.com">' & @LF $aResult = StringRegExp($sString, '(?i)<meta[^>]+contents*=s*"([^"]+)"[^>]*>', 3) ; content $aResult = StringRegExp($sString, '(?i)<meta[^>]+contents*=s*"d+;s*url=([^"]+)"[^>]*>', 3) ; URL only _ArrayDisplay($aResult)
×
×
  • Create New...