Sign in to follow this  
Followers 0
litlmike

Counting Words, Unique Strings

9 posts in this topic

#1 ·  Posted (edited)

Any help that you can provide would be appreciated. I would like to open a webpage, and create a "ranking" system for the words that appear on that page. For instance, if we went to www.CNN.com the script would create a unique list of all the words on a page and for everytime the same word appears, it will be assigned +1.

So if the word "George" appeared on the page 5 times, and the word "Bush" appeared 10 times, the script would output:

"Bush 10"

"George 5"

Followed by the other words that appeared on the page and their scores.

Can you tell me how to start working towards my solution. I am not sure how to make the words unique, nor how to give them +1 per occurence. Thanks in advance.

#include <IE.au3>

$s_Url = "http://www.CNN.com/"
$oIE = _IECreate ($s_Url, 1)
$oText = _IEBodyReadText ($oIE)


$aText = StringSplit ( $oText, " ")

For $iCC = 1 To UBound ($aText) -1
    MsgBox (0, "", $aText[$iCC],1)
Next

P.S. I can't get the au3 forum code to appear I can only get the "[ code ] [/ code ]" to work

Edited by litlmike

Share this post


Link to post
Share on other sites



A bit difficult to do, in particular, I noticed StringSplit won't give you each individual word all the time. However, if you figure that out, you should be able to count and add unique strings by doing something like this:

#include <Array.au3>
#include <IE.au3>

$s_Url = "http://www.CNN.com/"
$oIE = _IECreate ($s_Url, 1)
$oText = _IEBodyReadText ($oIE)

Dim $output[1][2]

$k = 0
$aText = StringSplit ( $oText, " ")
For $i = 0 To UBound($aText) - 1
    For $j = 0 To UBound($output) - 1
        $match = False
        If $aText[$i] = $output[$j][0] Then
            $output[$j][1] += 1
            $match = True
        EndIf
    Next
    If $match = False Then
        ReDim $output[$k + 1][2]
        $output[$k][0] = $aText[$i]
        $output[$k][1] = 1
        $k += 1
    EndIf
Next

_ArrayDisplay($output)

PS. Are you using [ autoit ] and [ /autoit ], without the spaces?


IE Dev ToolbarMSDN: InternetExplorer ObjectMSDN: HTML/DHTML Reference Guide[quote]It is surprising what a man can do when he has to, and how little most men will do when they don't have to. - Walter Linn[/quote]--------------------[font="Franklin Gothic Medium"]Post a reproducer with less than 100 lines of code.[/font]

Share this post


Link to post
Share on other sites

Thanks for that, it really helps. The arraydisplay would not work because we are using a 2D array, so I added the 2D version made by big_daddy (I think). I was able to get the words all individually, I just had to replace the @CRLF. Below is the updated code, can anyone think of a way to produce the display in descending order from highest number to lowest? I am open to using Excel instead of array display and it may be better in the long run.

#include <IE.au3>

$s_Url = "http://www.CNN.com/"
$oIE = _IECreate ($s_Url, 1)
$oText = _IEBodyReadText ($oIE)

Global $sTitle
Global $iBase
Global $sToConsole

Dim $output[1][2]

$k = 0
$oString = StringReplace ($oText, @CRLF, " ")
$aText = StringSplit ( $oString, " ")

For $i = 0 To UBound($aText) - 1
    For $j = 0 To UBound($output) - 1
        $match = False
        If $aText[$i] = $output[$j][0] Then
            $output[$j][1] += 1
            $match = True
        EndIf
    Next
    If $match = False Then
        ReDim $output[$k + 1][2]
        $output[$k][0] = $aText[$i]
        $output[$k][1] = 1
        $k += 1
    EndIf
Next

_ArrayDisplay2D($output); base at 0 to get the [0][0]

Func _ArrayDisplay2D($aArray, $sTitle = 'Array Display 2Dim', $iBase = 0, $sToConsole = 1); base at 0 to get the [0][0]
    ;If $aArray is not an array then 'Return' and Set error... Wish I knew that IsArray was a function about 3 weeks ago!
    ;Where does $aArray come from?  Where is it previously declared?
    If Not IsArray($aArray) Then Return SetError(1, 0, 0)
    
    Local $sHold = 'Dimension 1 Has:  ' & UBound($aArray, 1) - 1 & ' Element(s)' & @LF & _
            'Dimension 2 Has:  ' & UBound($aArray, 2) - 1 & ' Element(s)' & @LF & @LF
    ;Loop through the First Dimension of $aArray
    For $iCC = $iBase To UBound($aArray, 1) - 1
        ;Loop through the 2nd Dimension of $aArray (up to the Ubound of the 1st dimension - 1)
        For $xCC = 0 To UBound($aArray, 2) - 1
            ;I think the $iCC and $xCC coorelate to the keys, and $aArray must be $aValues?
            $sHold &= '[' & $iCC & '][' & $xCC & ']  = ' & $aArray[$iCC][$xCC] & @LF
        Next
    Next
    
    If $sToConsole Then Return ConsoleWrite(@LF & $sHold)
    ;Display Results.  
    Return MsgBox(262144, $sTitle, StringTrimRight($sHold, 1))
EndFunc   ;==>_ArrayDisplay2DoÝ÷ Ú«¨µéÚ

PS. Are you using [ autoit ] and [ /autoit ], without the spaces?

Share this post


Link to post
Share on other sites

You should be able to get it to work by doing something like this:

#include <IE.au3>
#include <Array.au3>

$s_Url = "http://www.CNN.com/"
$oIE = _IE_Example()
$oText = _IEBodyReadText ($oIE)

Global $sTitle
Global $iBase
Global $sToConsole

Dim $output[1][2]

$k = 0
$oString = StringReplace ($oText, @CRLF, " ")
$aText = StringSplit ( $oString, " ")

For $i = 0 To UBound($aText) - 1
    For $j = 0 To UBound($output) - 1
        $match = False
        If $aText[$i] = $output[$j][0] Then
            $output[$j][1] += 1
            $match = True
        EndIf
    Next
    If $match = False Then
        ReDim $output[$k + 1][2]
        $output[$k][0] = $aText[$i]
        $output[$k][1] = 1
        $k += 1
    EndIf
Next

_ArrayDisplay($output)

Dim $newArray[1][2]
$max = ""
$j = 0
Do
    For $i = 0 To UBound($output) - 1
        If $output[$i][1] > $max Then
            $max = $output[$i][1]
            $index = $i
        EndIf
    Next
    ReDim $newArray[$j + 1][2]
    $newArray[$j][0] = $output[$index][0]
    $newArray[$j][1] = $output[$index][1]
    $output[$index][0] = ""
    $output[$index][1] = ""
    $j += 1
    $max = ""
Until $j = UBound($output)

_ArrayDisplay($newArray)

IE Dev ToolbarMSDN: InternetExplorer ObjectMSDN: HTML/DHTML Reference Guide[quote]It is surprising what a man can do when he has to, and how little most men will do when they don't have to. - Walter Linn[/quote]--------------------[font="Franklin Gothic Medium"]Post a reproducer with less than 100 lines of code.[/font]

Share this post


Link to post
Share on other sites

You should be able to get it to work by doing something like this:

I have made some more updates to the script and it is working nicely. However, the first several hundred elements are blank, and because I am still trying to fully comprehend how your array does what it does, I am finding it hard to manipulate. If you run the script exactly as I have posted it here, approx. the first 300 elements will be null/blank, can you figure out what those null elements might be from and how to eliminate them??

thanks

Share this post


Link to post
Share on other sites

#6 ·  Posted (edited)

If I've read this all right... this seems to be what you are looking for... (Might have to work on the punctuation part in the regexp's, I didn't add them all)

#include <array.au3>
$sString = "How many words do think a person can count or think of?"
$aArray = _StringCountInstances($sString)

MsgBox(0, 'Word Count', UBound($aArray))
_ArrayDisplay($aArray)

Func _StringCountInstances($sString, $iCase = 1)
    Local $aArray = StringRegExp($sString, "[\s\.:;,\?\!]*([a-zA-Z0-9-_]+)[\s\.:;,\?\!]*", 3)
    If Not IsArray($aArray) Then Return SetError(1, 0, 0)
    _ArrayUnique($aArray, '', 0, $iCase)
    Local $aReturn[UBound($aArray)]
    If $iCase Then $iCase = '(?i)'
    For $iCC = 1 To UBound($aArray) - 1
        StringRegExpReplace($sString, '(?s)' & $iCase & '(?m:^|\s|\.|:|;|,|\?|\!)' & $aArray[$iCC] & '(?m:$|\s|\.|:|;|,|\?|\!)', '')
        $aReturn[$iCC] = $aArray[$iCC] & ' ' & @extended
    Next
    Return $aReturn
EndFunc

Func _ArrayUnique(ByRef $aArray, $vDelim = -1, $iBase = 1, $iCase = '')
    If Not IsArray($aArray) Then Return SetError(1, 0, 0)
    If $vDelim = '' Then $vDelim = Chr(01)
    Local $sHold
    For $iCC = $iBase To UBound($aArray) - 1
        If Not StringInStr($vDelim & $sHold, $vDelim & $aArray[$iCC] & $vDelim, $iCase) Then _
            $sHold &= $aArray[$iCC] & $vDelim
    Next
    If $sHold Then
        $aArray = StringSplit(StringTrimRight($sHold, StringLen($vDelim)), $vDelim)
        Return SetError(0, 0, 0)
    EndIf
    Return SetError(2, 0, 0)
EndFunc
Edit:

had to fix the $iCase ... if you don't want it to be case sensitive search then just leave the param blank.

Edited by SmOke_N

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Share this post


Link to post
Share on other sites

Can you post the code you modified litlmike? That example I posted works well for me.


IE Dev ToolbarMSDN: InternetExplorer ObjectMSDN: HTML/DHTML Reference Guide[quote]It is surprising what a man can do when he has to, and how little most men will do when they don't have to. - Walter Linn[/quote]--------------------[font="Franklin Gothic Medium"]Post a reproducer with less than 100 lines of code.[/font]

Share this post


Link to post
Share on other sites

If I've read this all right... this seems to be what you are looking for... (Might have to work on the punctuation part in the regexp's, I didn't add them all)

Lol... man this makes me come into full realization of how poor of a coder I am... Not only is it condensed, it completes it soo much faster!

I will plan to use this long term, but until I fully understand all of how your script accomplishes the task, I will have to complete mine, then return to yours. The reason being, for me to modify your script into the final format that I will want, I will have to comprehend how you are doing the same thing, in so much less code! haha! Very well done.

Can someone explain the following line of code, I don't grasp the relevance yet.

Local $aArray = StringRegExp($sString, "[\s\.:;,\?\!]*([a-zA-Z0-9-_]+)[\s\.:;,\?\!]*", 3)oÝ÷ Ú«¨µéÚ

Share this post


Link to post
Share on other sites

Local $aArray = StringRegExp($sString, "[\s\.:;,\?\!]*([a-zA-Z0-9-_]+)[\s\.:;,\?\!]*", 3)

Because words are not just separated by spaces, they may have a punctuation before or after them, most of the examples that you were given would never actually return true results.

[\s\.:;,\?\!]*

Says to find any space, decimal, colon, semi colon, comma, question mark, or exclamation mark before the start of:

([a-zA-Z0-9-_]+)

Which this tells it to find any character A to Z (upper or lower) any number 0 through 9 (didn't know if you wanted numbers too), any hyphen and underscore (as they are considered legal word characters to some), because it's surrounded by parenthesis, whatever is found here will be part of the return.

[\s\.:;,\?\!]*

This says that the word you just found must follow one of these: any space, decimal, colon, semi colon, comma, question mark, or exclamation mark

The ",3" says to return all instances found.

Hope you understand now.


Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0