Jump to content

Recommended Posts

Posted (edited)

Hi

 

Let's say I have a URL to some Webpage,

and I want to get only the Text from that Webpage.

 

I can get the Webpage's content via InetRead(),

but how do I then get only the Text, and not all the Tags, Attributes, etc?

 

Thank you

Edited by Zohar
Posted

Search for "HTML Strip"

UDF List:

 
_AdapterConnections()_AlwaysRun()_AppMon()_AppMonEx()_ArrayFilter/_ArrayReduce_BinaryBin()_CheckMsgBox()_CmdLineRaw()_ContextMenu()_ConvertLHWebColor()/_ConvertSHWebColor()_DesktopDimensions()_DisplayPassword()_DotNet_Load()/_DotNet_Unload()_Fibonacci()_FileCompare()_FileCompareContents()_FileNameByHandle()_FilePrefix/SRE()_FindInFile()_GetBackgroundColor()/_SetBackgroundColor()_GetConrolID()_GetCtrlClass()_GetDirectoryFormat()_GetDriveMediaType()_GetFilename()/_GetFilenameExt()_GetHardwareID()_GetIP()_GetIP_Country()_GetOSLanguage()_GetSavedSource()_GetStringSize()_GetSystemPaths()_GetURLImage()_GIFImage()_GoogleWeather()_GUICtrlCreateGroup()_GUICtrlListBox_CreateArray()_GUICtrlListView_CreateArray()_GUICtrlListView_SaveCSV()_GUICtrlListView_SaveHTML()_GUICtrlListView_SaveTxt()_GUICtrlListView_SaveXML()_GUICtrlMenu_Recent()_GUICtrlMenu_SetItemImage()_GUICtrlTreeView_CreateArray()_GUIDisable()_GUIImageList_SetIconFromHandle()_GUIRegisterMsg()_GUISetIcon()_Icon_Clear()/_Icon_Set()_IdleTime()_InetGet()_InetGetGUI()_InetGetProgress()_IPDetails()_IsFileOlder()_IsGUID()_IsHex()_IsPalindrome()_IsRegKey()_IsStringRegExp()_IsSystemDrive()_IsUPX()_IsValidType()_IsWebColor()_Language()_Log()_MicrosoftInternetConnectivity()_MSDNDataType()_PathFull/GetRelative/Split()_PathSplitEx()_PrintFromArray()_ProgressSetMarquee()_ReDim()_RockPaperScissors()/_RockPaperScissorsLizardSpock()_ScrollingCredits_SelfDelete()_SelfRename()_SelfUpdate()_SendTo()_ShellAll()_ShellFile()_ShellFolder()_SingletonHWID()_SingletonPID()_Startup()_StringCompact()_StringIsValid()_StringRegExpMetaCharacters()_StringReplaceWholeWord()_StringStripChars()_Temperature()_TrialPeriod()_UKToUSDate()/_USToUKDate()_WinAPI_Create_CTL_CODE()_WinAPI_CreateGUID()_WMIDateStringToDate()/_DateToWMIDateString()Au3 script parsingAutoIt SearchAutoIt3 PortableAutoIt3WrapperToPragmaAutoItWinGetTitle()/AutoItWinSetTitle()CodingDirToHTML5FileInstallrFileReadLastChars()GeoIP databaseGUI - Only Close ButtonGUI ExamplesGUICtrlDeleteImage()GUICtrlGetBkColor()GUICtrlGetStyle()GUIEventsGUIGetBkColor()Int_Parse() & Int_TryParse()IsISBN()LockFile()Mapping CtrlIDsOOP in AutoItParseHeadersToSciTE()PasswordValidPasteBinPosts Per DayPreExpandProtect GlobalsQueue()Resource UpdateResourcesExSciTE JumpSettings INISHELLHOOKShunting-YardSignature CreatorStack()Stopwatch()StringAddLF()/StringStripLF()StringEOLToCRLF()VSCROLLWM_COPYDATAMore Examples...

Updated: 22/04/2018

Posted (edited)

Hi Zohar

take a look in the help file to the _IEBodyReadText() function of the ie.au3 UDF

this is the example from the help:

; *******************************************************
; Example 1 - Open a browser with the basic example, read the body Text
; (the content with all HTML tags removed) and display it in a MsgBox
; *******************************************************

#include <IE.au3>

Local $oIE = _IE_Example("basic")
Local $sText = _IEBodyReadText($oIE)
MsgBox(0, "Body Text", $sText)

bye

Edited by PincoPanco

 

image.jpeg.9f1a974c98e9f77d824b358729b089b0.jpeg Chimp

small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Posted (edited)

Hi guinness and PincoPanco

 

PincoPanco:

IEBodyReadText() is usually great, but this time I don't have an IE Window open..

(I am using InetRead(), and working with the returned string..)

So I want to do this in code, without opening an IE Window..

Is it possible?

I viewed IEBodyReadText()'s source code, and in it, it has:

$o_object.document.body.innerText

Can I somehow create a Document object, without creating an IE Window?

That way I can supply it with the HTML, and then get the Text outof it, via .body.innerText

 

guinness:

I will search now for what you wrote..

(but still I am curious regarding my previous question about the possibility to work with a Document object, without creating an IE Window.. so If anyone knows, please tell..)

 

Thank you

Edited by Zohar
Posted (edited)

Hi somdcomputerguy

Yes I know, but I am hoping to avoid creating an IE Window at all..

(even an invisible one)

 

If anyone knows whether I can instantiate a Document object somehow, it'll be useful.

Maybe there is a way by using ObjCreate()?

Edited by Zohar
Posted

UDF List:

 
_AdapterConnections()_AlwaysRun()_AppMon()_AppMonEx()_ArrayFilter/_ArrayReduce_BinaryBin()_CheckMsgBox()_CmdLineRaw()_ContextMenu()_ConvertLHWebColor()/_ConvertSHWebColor()_DesktopDimensions()_DisplayPassword()_DotNet_Load()/_DotNet_Unload()_Fibonacci()_FileCompare()_FileCompareContents()_FileNameByHandle()_FilePrefix/SRE()_FindInFile()_GetBackgroundColor()/_SetBackgroundColor()_GetConrolID()_GetCtrlClass()_GetDirectoryFormat()_GetDriveMediaType()_GetFilename()/_GetFilenameExt()_GetHardwareID()_GetIP()_GetIP_Country()_GetOSLanguage()_GetSavedSource()_GetStringSize()_GetSystemPaths()_GetURLImage()_GIFImage()_GoogleWeather()_GUICtrlCreateGroup()_GUICtrlListBox_CreateArray()_GUICtrlListView_CreateArray()_GUICtrlListView_SaveCSV()_GUICtrlListView_SaveHTML()_GUICtrlListView_SaveTxt()_GUICtrlListView_SaveXML()_GUICtrlMenu_Recent()_GUICtrlMenu_SetItemImage()_GUICtrlTreeView_CreateArray()_GUIDisable()_GUIImageList_SetIconFromHandle()_GUIRegisterMsg()_GUISetIcon()_Icon_Clear()/_Icon_Set()_IdleTime()_InetGet()_InetGetGUI()_InetGetProgress()_IPDetails()_IsFileOlder()_IsGUID()_IsHex()_IsPalindrome()_IsRegKey()_IsStringRegExp()_IsSystemDrive()_IsUPX()_IsValidType()_IsWebColor()_Language()_Log()_MicrosoftInternetConnectivity()_MSDNDataType()_PathFull/GetRelative/Split()_PathSplitEx()_PrintFromArray()_ProgressSetMarquee()_ReDim()_RockPaperScissors()/_RockPaperScissorsLizardSpock()_ScrollingCredits_SelfDelete()_SelfRename()_SelfUpdate()_SendTo()_ShellAll()_ShellFile()_ShellFolder()_SingletonHWID()_SingletonPID()_Startup()_StringCompact()_StringIsValid()_StringRegExpMetaCharacters()_StringReplaceWholeWord()_StringStripChars()_Temperature()_TrialPeriod()_UKToUSDate()/_USToUKDate()_WinAPI_Create_CTL_CODE()_WinAPI_CreateGUID()_WMIDateStringToDate()/_DateToWMIDateString()Au3 script parsingAutoIt SearchAutoIt3 PortableAutoIt3WrapperToPragmaAutoItWinGetTitle()/AutoItWinSetTitle()CodingDirToHTML5FileInstallrFileReadLastChars()GeoIP databaseGUI - Only Close ButtonGUI ExamplesGUICtrlDeleteImage()GUICtrlGetBkColor()GUICtrlGetStyle()GUIEventsGUIGetBkColor()Int_Parse() & Int_TryParse()IsISBN()LockFile()Mapping CtrlIDsOOP in AutoItParseHeadersToSciTE()PasswordValidPasteBinPosts Per DayPreExpandProtect GlobalsQueue()Resource UpdateResourcesExSciTE JumpSettings INISHELLHOOKShunting-YardSignature CreatorStack()Stopwatch()StringAddLF()/StringStripLF()StringEOLToCRLF()VSCROLLWM_COPYDATAMore Examples...

Updated: 22/04/2018

Posted (edited)

#include <Array.au3>
$text = BinaryToString(InetRead("http://www.autoitscript.com/forum/topic/155354-how-to-convert-a-webpagehtml-to-text"))

$file = FileOpen(@ScriptDir & '\Name.htm', 2)
FileWrite($file, $text)
FileClose($file)

$text = StringRegExpReplace($text, '(?si)<head>.*?</head>', '')
$text = StringRegExpReplace($text, '(?si)<script[^>]*?>.*?</script>', '')
$text = StringRegExpReplace($text, '>\s+<', '> <')
; $text = StringRegExpReplace($text, '[\v]', '')
$text = StringRegExpReplace($text, '<(br|p)>', @CRLF)
$text = StringRegExpReplace($text, '<[/!]*?[^<>]*?>', '')
; $text = StringRegExpReplace($text, '<[\/\!]*?[^<>]*?>', @CRLF)
$text = StringReplace($text, '&quot;', '"')
$text = StringReplace($text, '&amp;', '&')
$text = StringReplace($text, '&lt;', '<')
$text = StringReplace($text, '&gt;', '>')
$text = StringReplace($text, '&nbsp;', ' ')
$text = StringReplace($text, '&iexcl;', '&#161;')
$text = StringReplace($text, '&cent;', '&#162;')
$text = StringReplace($text, '&pound;', '&#163;')
$text = StringReplace($text, '&copy;', '&#169;')

; &#444 -> character
$a = StringRegExp($text, '&#(\d+);', 3)
If Not @error Then
    ; $log &= UBound($a) & '   &#(\d+);' & @CRLF
    $a = _ArrayUnique($a)
    For $i = 1 To $a[0]
        $a[$i] = Number($a[$i])
    Next
    _ArraySort($a, 1, 1)
    _ArrayDisplay($a)
    For $i = 1 To $a[0]
        $text = StringReplace($text, '&#' & $a[$i] & ';', ChrW($a[$i]))
    Next
EndIf
$text = StringRegExpReplace($text, '([\r\n])[\s]+', @CRLF)
MsgBox(0, ';)', $text)

$file = FileOpen(@ScriptDir & '\Name.txt', 2)
FileWrite($file, $text)
FileClose($file)

Edited by AZJIO
Posted (edited)

Also, instead of parsing by yourself html page, you could use browser's internal method .document.body.innerText

$oIE = ObjCreate("Shell.Explorer.2")

; Create a simple invisible dummy GUI needed to create the obj
GUICreate("Dummy browser", 0, 0, 10, 10)
$GUIActiveX = GUICtrlCreateObj($oIE, 0, 0, 10, 10)

GUISetState(@SW_HIDE); hide dummy browser

$oIE.navigate("http://www.autoitscript.com") ; <-- your url here
Do
    Sleep(100)
Until Not $oIE.Busy

ConsoleWrite($oIE.document.body.innerText)
MsgBox(0, "", $oIE.document.body.innerText)

edit:

added MsgBox(0, "", $oIE.document.body.innerText) in listing

Edited by PincoPanco

 

image.jpeg.9f1a974c98e9f77d824b358729b089b0.jpeg Chimp

small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Posted (edited)

AZJIO:

Really nice - full implementation for HTML stripping, thank you.

 

guinness and PincoPanco:

Great - that's what I was hoping to find,

thank you :)

Edited by Zohar
Posted

Hi Jury

Thank you

The problem is that Removing Tags will not be enough, since sometimes you have Tags that have an Opening Tag, and Closing Tag,

like:

<SCRIPT>

....

....

</SCRIPT>

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...