Jump to content

Dictionary, Trimming down HTML


Recommended Posts

Hello I am working on a dictionary program as a small coding project. At first, my application would liteally load a web page and use a mouse macro to paste the definition. This worked but was really quirky and often broke.

Now, I am using Inetget to grab the html from the page. The problem arises in that I can get and view the html fine, yet I need to somehow specify what portion of the page is the definition and what part is html that isn't needed.

(Above code is GUI stuff, heres where the meat / issue is)

$mystr="http://dictionary.reference.com/browse/" & $str;
InetGet($mystr, "C:\results.txt", 1, 0) 

$str=FileRead("C:\results.txt")
$str=StringRegExpReplace($str,"""","")
MsgBox(64,"Definition",$str)

Looking for pointers, I don't have much regex experience.

Edited by floodge
Link to comment
Share on other sites

Hmm...? What sort of information are you trying to retrieve? I mean anything that is between <> should be the HTML part you're not interested in? And what part you're interested in? ;]

Trying to retrieve the portion with the definitions in it, not sure how to define that single part out of the file

Link to comment
Share on other sites

Something like this?:

#include <INet.au3>

Dim $sSource = _INetGetSource('http://www.autoitscript.com/')
$sSource = StringRegExpReplace($sSource, '<[^>]++>', '')
$sSource = StringRegExpReplace($sSource, '(\r\n){2,}', @CRLF)
$sSource = StringRegExpReplace($sSource, '(?>[[:blank:]]+)\r\n', '')
$sSource = StringStripWS($sSource, 3)
ConsoleWrite($sSource & @LF)

$hFile = FileOpen(@ScriptDir & '\TempHTML.txt', 2)
    If $hFile = -1 Then Exit

FileWrite($hFile, $sSource)
FileClose($hFile)
Link to comment
Share on other sites

Not exactly sure if you want all the definitions returned or not. This returns the Full definition as shown on that page. You can parse the portion you want from the return.

$Str = StringRegExp($Str, "(?i)<td width=.* class=\x22?dnindex\x22?>(1\..*)</table>", 1)
If Not @Error Then
    $str = $Str[0]
Else
    MsgBox(0, "Oooops!", "Houston, we have a problem")
EndIf

If you don't need the whole page for other reasons, why not use _InetGetSource() instead of InetGet() as Authenticity shows?

Edited by GEOSoft

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

$mystr="http://dictionary.reference.com/browse/" & $str;
$str = _InetGetSource($mystr) 

$Str = StringRegExp($Str, "(?i)<td width=.* class=\x22?dnindex\x22?>(1\..*)</table>", 1)
;StringRegExpReplace($str, "</td>", "")

msgbox(48, "Definition", $str)

My problem is at the commented out line.

Excuse my noobiness but that should white out all af the "</td>" in the html, right?

I am having trouble

Link to comment
Share on other sites

$mystr="http://dictionary.reference.com/browse/" & $str;
$str = _InetGetSource($mystr) 

$Str = StringRegExp($Str, "(?i)<td width=.* class=\x22?dnindex\x22?>(1\..*)</table>", 1)
;StringRegExpReplace($str, "</td>", "")

msgbox(48, "Definition", $str)

My problem is at the commented out line.

Excuse my noobiness but that should white out all af the "</td>" in the html, right?

I am having trouble

You were close

$Str = StringRegExpReplace($str, "</td>", "")

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

Local $a 
$a = StringRegExp($str, "(.*)", 3)
_ArrayDisplay($a, "Definition of - "& $str2)

Trimmed down all the html (Woot!)

Figuring out now how to line up all of the definitions into an array

The definitions go as 1. blah blah 2. blah blah etc, so is there certain arguements that I can place here

$a = StringRegExp($str, "(.*)", 3)

That will place each definition on a seperate column?

EDIT: I am making progress, almost have it

Edited by floodge
Link to comment
Share on other sites

Local $a 
 $a = StringRegExp($str, "(.*)", 3)
 _ArrayDisplay($a, "Definition of - "& $str2)

Trimmed down all the html (Woot!)

Figuring out now how to line up all of the definitions into an array

The definitions go as 1. blah blah 2. blah blah etc, so is there certain arguements that I can place here

$a = StringRegExp($str, "(.*)", 3)

That will place each definition on a seperate column?

It might be easier to filter out the bit you want first, then when you have an array of the lines go through each line to remove the unwanted bits

#include <array.au3>
#include <string.au3>
#include <INet.au3>


$mystr="http://dictionary.reference.com/browse/search"; & $str;
$str = _InetGetSource($mystr)

$str = _stringbetween($str,'<td width="35" class="dnindex">1.</td> <td>',"Synonyms:")
$lines = StringSplit($str[0],'</span></td> </tr> </table> <table class="luna-Ent"> <tr> <td width="35" class="dnindex">',1)
_ArrayDisplay($lines)
Serial port communications UDF Includes functions for binary transmission and reception.printing UDF Useful for graphs, forms, labels, reports etc.Add User Call Tips to SciTE for functions in UDFs not included with AutoIt and for your own scripts.Functions with parameters in OnEvent mode and for Hot Keys One function replaces GuiSetOnEvent, GuiCtrlSetOnEvent and HotKeySet.UDF IsConnected2 for notification of status of connected state of many urls or IPs, without slowing the script.
Link to comment
Share on other sites

That code filters well, but I cant do a

$Str = StringRegExpReplace($str, "</td>", "")

because it either freezes the app or says it is an undefined array variable

Using the other code I have it filtered this far.

attached is a picture af what that spits out post-47346-1237496283_thumb.jpgright now, working on merging your code and removing the lines etc

EDIT: Making some progress, the problem (dare I ask this many questions) is that I am having trouble filtering from this point

post-47346-1237501963_thumb.jpg

Using the code:

$mystr="http://dictionary.reference.com/browse/" & $str2;
$str = _InetGetSource($mystr) 

$str = _stringbetween($str,'<td width="35" class="dnindex">1.</td> <td>',"Synonyms:")
$lines = StringSplit($str[0],'</span></td> </tr> </table> <table class="luna-Ent"> <tr> <td width="35" class="dnindex">',1)

_ArrayDisplay($lines, "Definition of "& $str2)

I am soooo close!!

Edited by floodge
Link to comment
Share on other sites

That code filters well, but I cant do a

$Str = StringRegExpReplace($str, "</td>", "")

because it either freezes the app or says it is an undefined array variable

Using the other code I have it filtered this far.

attached is a picture af what that spits out post-47346-1237496283_thumb.jpgright now, working on merging your code and removing the lines etc

EDIT: Making some progress, the problem (dare I ask this many questions) is that I am having trouble filtering from this point

post-47346-1237501963_thumb.jpg

Using the code:

$mystr="http://dictionary.reference.com/browse/" & $str2;
$str = _InetGetSource($mystr) 

$str = _stringbetween($str,'<td width="35" class="dnindex">1.</td> <td>',"Synonyms:")
$lines = StringSplit($str[0],'</span></td> </tr> </table> <table class="luna-Ent"> <tr> <td width="35" class="dnindex">',1)

_ArrayDisplay($lines, "Definition of "& $str2)

I am soooo close!!

Just use StringRegExReplace($lines[$I], "<.*?>")

Link to comment
Share on other sites

Just use StringRegExReplace($lines[$I], "<.*?>")

Doesn't work. I have tried using this code

$mystr="http://dictionary.reference.com/browse/" & $str2;
$str = _InetGetSource($mystr) 

$str = _stringbetween($str,'<td width="35" class="dnindex">1.</td> <td>',"Synonyms:")
$lines = StringSplit($str[0],'</span></td> </tr> </table> <table class="luna-Ent"> <tr> <td width="35" class="dnindex">',1)

#Region HTML Filter
$lines = StringRegExpReplace($lines, "</td>", "")
$lines = StringRegExpReplace($lines, "<td>", "")
$lines = StringRegExpReplace($lines, "<tr>", "")
$lines = StringRegExpReplace($lines, "</tr>", "")
$lines = StringRegExpReplace($lines, "<class=>", "")
$lines = StringRegExpReplace($lines, "</table>", "")
$lines = StringRegExpReplace($lines, "<span>", "")
$lines = StringRegExpReplace($lines, "</span>", "")
$lines = StringRegExpReplace($lines, "<table class=""luna-Ent"">", "")
$lines = StringRegExpReplace($lines, "</div>", "")

ETC ETC ETC ETC
#EndRegion

_ArrayDisplay($lines, "Definition of "& $str2)

Nothing comes up when I press the button, window just stays there.

Link to comment
Share on other sites

That will never work because you have declared $lines as an array and didn't reference the elements.

This will be close but it's untested.

$mystr="http://dictionary.reference.com/browse/" & $str2;
$str = _InetGetSource($mystr) 

$str = _stringbetween($str,'<td width="35" class="dnindex">1.</td> <td>',"Synonyms:")
If IsArray($str) Then
    MsgBox(0, "Results", _StripHTML($str[0]))
EndIf

Func _StripHTML($sStr)
    $sStr = StringReplace($sStr, "&lt;", "<")
    $sStr = StringReplace($sStr, "&gt;", ">")
    $sStr = StringReplace($sStr, "<br />", @CRLF)
    $sStr = StringReplace($sStr, "<p>", @CRLF & @CRLF)
    $sStr = StringReplace($sStr, "&nbsp;", " ")
    $sStr = StringReplace($sStr, "&amp;", "&")
    $aStr = StringRegExp($sStr, "&#(\d+);", 3)
    If NOT @Error Then
         For $i = 0 To Ubound($aStr) -1
              $sStr = StringReplace($sStr, "&#" & $aStr[$i] & ";", Chr($aStr[$i]))
         Next
    EndIf
    $sStr = StringRegExpReplace($sStr, "(?i)(?s)<.+?>", "")
    Return $sStr
EndFunc

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

Yeah it works similarly to mine in a Msgbox

but it would be neat to have had it in an array.

EDIT: I dont need an array, I am focusing on another method

Will post results, thank you for all af your help

Edited by floodge
Link to comment
Share on other sites

Yeah it works similarly to mine in a Msgbox

but it would be neat to have had it in an array.

EDIT: I dont need an array, I am focusing on another method

Will post results, thank you for all af your help

Try this

#include <array.au3>
#include <string.au3>
#include <INet.au3>

$tofind = "hammer"
$mystr = "http://dictionary.reference.com/browse/" & $tofind
$str =  _INetGetSource($mystr)
$str = stringtrimleft($str,StringInStr($str,'<td width="35" class="dnindex">1.</td> <td>')-1)

$str = StringReplace($str,'<div class="ety"> <b>Origin:','<span class="sectionLabel">Synonyms:')

ConsoleWrite(@extended & @CRLF)
$str =  _StringBetween($str, '<td width="35" class="dnindex">1.</td> <td>','<span class="sectionLabel">Synonyms:')
$lines = StringSplit($str[0], '<td width="35" class="dnindex">', 1)
$lines[1] = "1. " & $Lines[1]
_ArrayDisplay($lines)
For $n = 1 To $lines[0]
    $lines[$n] = StringRegExpReplace($lines[$n], "(<.*?>)", "")
Next
_ArrayDisplay($lines);<--now gives 14 results

;version II
$lines = "1. " & StringReplace($str[0], '<td width="35" class="dnindex">', @CRLF)

$lines = StringRegExpReplace($lines, "(<.*?>)", "")
MsgBox(262144, "result ", $lines)

EDIT: changed because not all words searched have Synonyms.

Edited by martin
Serial port communications UDF Includes functions for binary transmission and reception.printing UDF Useful for graphs, forms, labels, reports etc.Add User Call Tips to SciTE for functions in UDFs not included with AutoIt and for your own scripts.Functions with parameters in OnEvent mode and for Hot Keys One function replaces GuiSetOnEvent, GuiCtrlSetOnEvent and HotKeySet.UDF IsConnected2 for notification of status of connected state of many urls or IPs, without slowing the script.
Link to comment
Share on other sites

Try this

#include <array.au3>
#include <string.au3>
#include <INet.au3>

$tofind = "hammer"
$mystr = "http://dictionary.reference.com/browse/" & $tofind
$str =  _INetGetSource($mystr)
$str = stringtrimleft($str,StringInStr($str,'<td width="35" class="dnindex">1.</td> <td>')-1)

$str = StringReplace($str,'<div class="ety"> <b>Origin:','<span class="sectionLabel">Synonyms:')

ConsoleWrite(@extended & @CRLF)
$str =  _StringBetween($str, '<td width="35" class="dnindex">1.</td> <td>','<span class="sectionLabel">Synonyms:')
$lines = StringSplit($str[0], '<td width="35" class="dnindex">', 1)
$lines[1] = "1. " & $Lines[1]
_ArrayDisplay($lines)
For $n = 1 To $lines[0]
    $lines[$n] = StringRegExpReplace($lines[$n], "(<.*?>)", "")
Next
_ArrayDisplay($lines);<--now gives 14 results

;version II
$lines = "1. " & StringReplace($str[0], '<td width="35" class="dnindex">', @CRLF)

$lines = StringRegExpReplace($lines, "(<.*?>)", "")
MsgBox(262144, "result ", $lines)

EDIT: changed because not all words searched have Synonyms.

It works!!!

Added to the program, which is almost done

Link to comment
Share on other sites

Alright, curious if there is a way to automatically resize the array window.

$str2 = IniRead("dictionary.ini", "words", "word2", "NotFound")
    if $str2 = "" Then
        exit
    EndIf
    $mystr="http://dictionary.reference.com/browse/" & $str2;
    $str = _InetGetSource($mystr) 
    $str = _stringbetween($str,'<td width="35" class="dnindex">1.</td> <td>',"</td> ")
    _process($str[0])
    
    $str2 = IniRead("dictionary.ini", "words", "word3", "NotFound")
    if $str2 = "" Then
        exit
    EndIf
    $mystr="http://dictionary.reference.com/browse/" & $str2;
    $str = _InetGetSource($mystr) 
    $str = _stringbetween($str,'<td width="35" class="dnindex">1.</td> <td>',"</td> ")
    _process($str[0])

I am pulling from an ini now (10 itterations of this), and I am looking for a method to still continue, or somehow ignore the code when no word is entered in the ini. Right now I just use an if ="" then exit, which sort of works

Link to comment
Share on other sites

Alright, curious if there is a way to automatically resize the array window.

Do you mean the _ArrayDisplay window?

$str2 = IniRead("dictionary.ini", "words", "word2", "NotFound")
       if $str2 = "" Then
           exit
       EndIf
       $mystr="http://dictionary.reference.com/browse/" & $str2;
       $str = _InetGetSource($mystr) 
       $str = _stringbetween($str,'<td width="35" class="dnindex">1.</td> <td>',"</td> ")
       _process($str[0])
       
       $str2 = IniRead("dictionary.ini", "words", "word3", "NotFound")
       if $str2 = "" Then
           exit
       EndIf
       $mystr="http://dictionary.reference.com/browse/" & $str2;
       $str = _InetGetSource($mystr) 
       $str = _stringbetween($str,'<td width="35" class="dnindex">1.</td> <td>',"</td> ")
       _process($str[0])

I am pulling from an ini now (10 itterations of this), and I am looking for a method to still continue, or somehow ignore the code when no word is entered in the ini. Right now I just use an if ="" then exit, which sort of works

If you have a default of "NotFound" then shouldn't you have

If $str2 = "NotFound" Then

?

Serial port communications UDF Includes functions for binary transmission and reception.printing UDF Useful for graphs, forms, labels, reports etc.Add User Call Tips to SciTE for functions in UDFs not included with AutoIt and for your own scripts.Functions with parameters in OnEvent mode and for Hot Keys One function replaces GuiSetOnEvent, GuiCtrlSetOnEvent and HotKeySet.UDF IsConnected2 for notification of status of connected state of many urls or IPs, without slowing the script.
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...