Jump to content

How to Convert a Webpage(HTML) to Text?


Zohar
 Share

Recommended Posts

Hi

 

Let's say I have a URL to some Webpage,

and I want to get only the Text from that Webpage.

 

I can get the Webpage's content via InetRead(),

but how do I then get only the Text, and not all the Tags, Attributes, etc?

 

Thank you

Edited by Zohar
Link to comment
Share on other sites

Search for "HTML Strip"

UDF List:

 
_AdapterConnections()_AlwaysRun()_AppMon()_AppMonEx()_ArrayFilter/_ArrayReduce_BinaryBin()_CheckMsgBox()_CmdLineRaw()_ContextMenu()_ConvertLHWebColor()/_ConvertSHWebColor()_DesktopDimensions()_DisplayPassword()_DotNet_Load()/_DotNet_Unload()_Fibonacci()_FileCompare()_FileCompareContents()_FileNameByHandle()_FilePrefix/SRE()_FindInFile()_GetBackgroundColor()/_SetBackgroundColor()_GetConrolID()_GetCtrlClass()_GetDirectoryFormat()_GetDriveMediaType()_GetFilename()/_GetFilenameExt()_GetHardwareID()_GetIP()_GetIP_Country()_GetOSLanguage()_GetSavedSource()_GetStringSize()_GetSystemPaths()_GetURLImage()_GIFImage()_GoogleWeather()_GUICtrlCreateGroup()_GUICtrlListBox_CreateArray()_GUICtrlListView_CreateArray()_GUICtrlListView_SaveCSV()_GUICtrlListView_SaveHTML()_GUICtrlListView_SaveTxt()_GUICtrlListView_SaveXML()_GUICtrlMenu_Recent()_GUICtrlMenu_SetItemImage()_GUICtrlTreeView_CreateArray()_GUIDisable()_GUIImageList_SetIconFromHandle()_GUIRegisterMsg()_GUISetIcon()_Icon_Clear()/_Icon_Set()_IdleTime()_InetGet()_InetGetGUI()_InetGetProgress()_IPDetails()_IsFileOlder()_IsGUID()_IsHex()_IsPalindrome()_IsRegKey()_IsStringRegExp()_IsSystemDrive()_IsUPX()_IsValidType()_IsWebColor()_Language()_Log()_MicrosoftInternetConnectivity()_MSDNDataType()_PathFull/GetRelative/Split()_PathSplitEx()_PrintFromArray()_ProgressSetMarquee()_ReDim()_RockPaperScissors()/_RockPaperScissorsLizardSpock()_ScrollingCredits_SelfDelete()_SelfRename()_SelfUpdate()_SendTo()_ShellAll()_ShellFile()_ShellFolder()_SingletonHWID()_SingletonPID()_Startup()_StringCompact()_StringIsValid()_StringRegExpMetaCharacters()_StringReplaceWholeWord()_StringStripChars()_Temperature()_TrialPeriod()_UKToUSDate()/_USToUKDate()_WinAPI_Create_CTL_CODE()_WinAPI_CreateGUID()_WMIDateStringToDate()/_DateToWMIDateString()Au3 script parsingAutoIt SearchAutoIt3 PortableAutoIt3WrapperToPragmaAutoItWinGetTitle()/AutoItWinSetTitle()CodingDirToHTML5FileInstallrFileReadLastChars()GeoIP databaseGUI - Only Close ButtonGUI ExamplesGUICtrlDeleteImage()GUICtrlGetBkColor()GUICtrlGetStyle()GUIEventsGUIGetBkColor()Int_Parse() & Int_TryParse()IsISBN()LockFile()Mapping CtrlIDsOOP in AutoItParseHeadersToSciTE()PasswordValidPasteBinPosts Per DayPreExpandProtect GlobalsQueue()Resource UpdateResourcesExSciTE JumpSettings INISHELLHOOKShunting-YardSignature CreatorStack()Stopwatch()StringAddLF()/StringStripLF()StringEOLToCRLF()VSCROLLWM_COPYDATAMore Examples...

Updated: 22/04/2018

Link to comment
Share on other sites

Hi Zohar

take a look in the help file to the _IEBodyReadText() function of the ie.au3 UDF

this is the example from the help:

; *******************************************************
; Example 1 - Open a browser with the basic example, read the body Text
; (the content with all HTML tags removed) and display it in a MsgBox
; *******************************************************

#include <IE.au3>

Local $oIE = _IE_Example("basic")
Local $sText = _IEBodyReadText($oIE)
MsgBox(0, "Body Text", $sText)

bye

Edited by PincoPanco

 

image.jpeg.9f1a974c98e9f77d824b358729b089b0.jpeg Chimp

small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Link to comment
Share on other sites

Hi guinness and PincoPanco

 

PincoPanco:

IEBodyReadText() is usually great, but this time I don't have an IE Window open..

(I am using InetRead(), and working with the returned string..)

So I want to do this in code, without opening an IE Window..

Is it possible?

I viewed IEBodyReadText()'s source code, and in it, it has:

$o_object.document.body.innerText

Can I somehow create a Document object, without creating an IE Window?

That way I can supply it with the HTML, and then get the Text outof it, via .body.innerText

 

guinness:

I will search now for what you wrote..

(but still I am curious regarding my previous question about the possibility to work with a Document object, without creating an IE Window.. so If anyone knows, please tell..)

 

Thank you

Edited by Zohar
Link to comment
Share on other sites

Hi somdcomputerguy

Yes I know, but I am hoping to avoid creating an IE Window at all..

(even an invisible one)

 

If anyone knows whether I can instantiate a Document object somehow, it'll be useful.

Maybe there is a way by using ObjCreate()?

Edited by Zohar
Link to comment
Share on other sites

UDF List:

 
_AdapterConnections()_AlwaysRun()_AppMon()_AppMonEx()_ArrayFilter/_ArrayReduce_BinaryBin()_CheckMsgBox()_CmdLineRaw()_ContextMenu()_ConvertLHWebColor()/_ConvertSHWebColor()_DesktopDimensions()_DisplayPassword()_DotNet_Load()/_DotNet_Unload()_Fibonacci()_FileCompare()_FileCompareContents()_FileNameByHandle()_FilePrefix/SRE()_FindInFile()_GetBackgroundColor()/_SetBackgroundColor()_GetConrolID()_GetCtrlClass()_GetDirectoryFormat()_GetDriveMediaType()_GetFilename()/_GetFilenameExt()_GetHardwareID()_GetIP()_GetIP_Country()_GetOSLanguage()_GetSavedSource()_GetStringSize()_GetSystemPaths()_GetURLImage()_GIFImage()_GoogleWeather()_GUICtrlCreateGroup()_GUICtrlListBox_CreateArray()_GUICtrlListView_CreateArray()_GUICtrlListView_SaveCSV()_GUICtrlListView_SaveHTML()_GUICtrlListView_SaveTxt()_GUICtrlListView_SaveXML()_GUICtrlMenu_Recent()_GUICtrlMenu_SetItemImage()_GUICtrlTreeView_CreateArray()_GUIDisable()_GUIImageList_SetIconFromHandle()_GUIRegisterMsg()_GUISetIcon()_Icon_Clear()/_Icon_Set()_IdleTime()_InetGet()_InetGetGUI()_InetGetProgress()_IPDetails()_IsFileOlder()_IsGUID()_IsHex()_IsPalindrome()_IsRegKey()_IsStringRegExp()_IsSystemDrive()_IsUPX()_IsValidType()_IsWebColor()_Language()_Log()_MicrosoftInternetConnectivity()_MSDNDataType()_PathFull/GetRelative/Split()_PathSplitEx()_PrintFromArray()_ProgressSetMarquee()_ReDim()_RockPaperScissors()/_RockPaperScissorsLizardSpock()_ScrollingCredits_SelfDelete()_SelfRename()_SelfUpdate()_SendTo()_ShellAll()_ShellFile()_ShellFolder()_SingletonHWID()_SingletonPID()_Startup()_StringCompact()_StringIsValid()_StringRegExpMetaCharacters()_StringReplaceWholeWord()_StringStripChars()_Temperature()_TrialPeriod()_UKToUSDate()/_USToUKDate()_WinAPI_Create_CTL_CODE()_WinAPI_CreateGUID()_WMIDateStringToDate()/_DateToWMIDateString()Au3 script parsingAutoIt SearchAutoIt3 PortableAutoIt3WrapperToPragmaAutoItWinGetTitle()/AutoItWinSetTitle()CodingDirToHTML5FileInstallrFileReadLastChars()GeoIP databaseGUI - Only Close ButtonGUI ExamplesGUICtrlDeleteImage()GUICtrlGetBkColor()GUICtrlGetStyle()GUIEventsGUIGetBkColor()Int_Parse() & Int_TryParse()IsISBN()LockFile()Mapping CtrlIDsOOP in AutoItParseHeadersToSciTE()PasswordValidPasteBinPosts Per DayPreExpandProtect GlobalsQueue()Resource UpdateResourcesExSciTE JumpSettings INISHELLHOOKShunting-YardSignature CreatorStack()Stopwatch()StringAddLF()/StringStripLF()StringEOLToCRLF()VSCROLLWM_COPYDATAMore Examples...

Updated: 22/04/2018

Link to comment
Share on other sites

#include <Array.au3>
$text = BinaryToString(InetRead("http://www.autoitscript.com/forum/topic/155354-how-to-convert-a-webpagehtml-to-text"))

$file = FileOpen(@ScriptDir & '\Name.htm', 2)
FileWrite($file, $text)
FileClose($file)

$text = StringRegExpReplace($text, '(?si)<head>.*?</head>', '')
$text = StringRegExpReplace($text, '(?si)<script[^>]*?>.*?</script>', '')
$text = StringRegExpReplace($text, '>\s+<', '> <')
; $text = StringRegExpReplace($text, '[\v]', '')
$text = StringRegExpReplace($text, '<(br|p)>', @CRLF)
$text = StringRegExpReplace($text, '<[/!]*?[^<>]*?>', '')
; $text = StringRegExpReplace($text, '<[\/\!]*?[^<>]*?>', @CRLF)
$text = StringReplace($text, '&quot;', '"')
$text = StringReplace($text, '&amp;', '&')
$text = StringReplace($text, '&lt;', '<')
$text = StringReplace($text, '&gt;', '>')
$text = StringReplace($text, '&nbsp;', ' ')
$text = StringReplace($text, '&iexcl;', '&#161;')
$text = StringReplace($text, '&cent;', '&#162;')
$text = StringReplace($text, '&pound;', '&#163;')
$text = StringReplace($text, '&copy;', '&#169;')

; &#444 -> character
$a = StringRegExp($text, '&#(\d+);', 3)
If Not @error Then
    ; $log &= UBound($a) & '   &#(\d+);' & @CRLF
    $a = _ArrayUnique($a)
    For $i = 1 To $a[0]
        $a[$i] = Number($a[$i])
    Next
    _ArraySort($a, 1, 1)
    _ArrayDisplay($a)
    For $i = 1 To $a[0]
        $text = StringReplace($text, '&#' & $a[$i] & ';', ChrW($a[$i]))
    Next
EndIf
$text = StringRegExpReplace($text, '([\r\n])[\s]+', @CRLF)
MsgBox(0, ';)', $text)

$file = FileOpen(@ScriptDir & '\Name.txt', 2)
FileWrite($file, $text)
FileClose($file)

Edited by AZJIO
Link to comment
Share on other sites

Also, instead of parsing by yourself html page, you could use browser's internal method .document.body.innerText

$oIE = ObjCreate("Shell.Explorer.2")

; Create a simple invisible dummy GUI needed to create the obj
GUICreate("Dummy browser", 0, 0, 10, 10)
$GUIActiveX = GUICtrlCreateObj($oIE, 0, 0, 10, 10)

GUISetState(@SW_HIDE); hide dummy browser

$oIE.navigate("http://www.autoitscript.com") ; <-- your url here
Do
    Sleep(100)
Until Not $oIE.Busy

ConsoleWrite($oIE.document.body.innerText)
MsgBox(0, "", $oIE.document.body.innerText)

edit:

added MsgBox(0, "", $oIE.document.body.innerText) in listing

Edited by PincoPanco

 

image.jpeg.9f1a974c98e9f77d824b358729b089b0.jpeg Chimp

small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...