Sign in to follow this  
Followers 0
Zohar

How to Convert a Webpage(HTML) to Text?

16 posts in this topic

#1 ·  Posted (edited)

Hi

 

Let's say I have a URL to some Webpage,

and I want to get only the Text from that Webpage.

 

I can get the Webpage's content via InetRead(),

but how do I then get only the Text, and not all the Tags, Attributes, etc?

 

Thank you

Edited by Zohar

Share this post


Link to post
Share on other sites



Search for "HTML Strip"


_AdapterConnections()_AlwaysRun()_AppMon()_AppMonEx()_BinaryBin()_CheckMsgBox()_CmdLineRaw()_ContextMenu()_ConvertLHWebColor()/_ConvertSHWebColor()_DesktopDimensions()_DisplayPassword()_DotNet_Load()/_DotNet_Unload()_Fibonacci()_FileCompare()_FileCompareContents()_FileNameByHandle()_FilePrefix/SRE()_FindInFile()_GetBackgroundColor()/_SetBackgroundColor()_GetConrolID()_GetCtrlClass()_GetDirectoryFormat()_GetDriveMediaType()_GetFilename()/_GetFilenameExt()_GetHardwareID()_GetIP()_GetIP_Country()_GetOSLanguage()_GetSavedSource()_GetStringSize()_GetSystemPaths()_GetURLImage()_GIFImage()_GoogleWeather()_GUICtrlCreateGroup()_GUICtrlListBox_CreateArray()_GUICtrlListView_CreateArray()_GUICtrlListView_SaveCSV()_GUICtrlListView_SaveHTML()_GUICtrlListView_SaveTxt()_GUICtrlListView_SaveXML()_GUICtrlMenu_Recent()_GUICtrlMenu_SetItemImage()_GUICtrlTreeView_CreateArray()_GUIDisable()_GUIImageList_SetIconFromHandle()_GUIRegisterMsg()_GUISetIcon()_Icon_Clear()/_Icon_Set()_IdleTime()_InetGet()_InetGetGUI()_InetGetProgress()_IPDetails()_IsFileOlder()_IsGUID()_IsHex()_IsPalindrome()_IsRegKey()_IsStringRegExp()_IsSystemDrive()_IsUPX()_IsValidType()_IsWebColor()_Language()_Log()_MicrosoftInternetConnectivity()_MSDNDataType()_PathFull/GetRelative/Split()_PathSplitEx()_PrintFromArray()_ProgressSetMarquee()_ReDim()_RockPaperScissors()/_RockPaperScissorsLizardSpock()_ScrollingCredits_SelfDelete()_SelfRename()_SelfUpdate()_SendTo()_ShellAll()_ShellFile()_ShellFolder()_SingletonHWID()_SingletonPID()_Startup()_StringCompact()_StringIsValid()_StringRegExpMetaCharacters()_StringReplaceWholeWord()_StringStripChars()_Temperature()_TrialPeriod()_UKToUSDate()/_USToUKDate()_WinAPI_Create_CTL_CODE()_WinAPI_CreateGUID()_WMIDateStringToDate()/_DateToWMIDateString()Au3 script parsingAutoIt SearchAutoIt3 PortableAutoIt3WrapperToPragmaAutoItWinGetTitle()/AutoItWinSetTitle()CodingDirToHTML5FileInstallrFileReadLastChars()GeoIP databaseGUI - Only Close ButtonGUI ExamplesGUICtrlDeleteImage()GUICtrlGetBkColor()GUICtrlGetStyle()GUIEventsGUIGetBkColor()Int_Parse() & Int_TryParse()IsISBN()LockFile()Mapping CtrlIDsOOP in AutoItParseHeadersToSciTE()PasswordValidPasteBinPosts Per DayPreExpandProtect GlobalsQueue()Resource UpdateResourcesExSciTE JumpSettings INISHELLHOOKShunting-YardSignature CreatorStack()Stopwatch()StringAddLF()/StringStripLF()StringEOLToCRLF()VSCROLLWM_COPYDATAMore Examples...

Updated: 04/09/2015

Share this post


Link to post
Share on other sites

#3 ·  Posted (edited)

Hi Zohar

take a look in the help file to the _IEBodyReadText() function of the ie.au3 UDF

this is the example from the help:

; *******************************************************
; Example 1 - Open a browser with the basic example, read the body Text
; (the content with all HTML tags removed) and display it in a MsgBox
; *******************************************************

#include <IE.au3>

Local $oIE = _IE_Example("basic")
Local $sText = _IEBodyReadText($oIE)
MsgBox(0, "Body Text", $sText)

bye

Edited by PincoPanco

small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Share this post


Link to post
Share on other sites

#4 ·  Posted (edited)

Hi guinness and PincoPanco

 

PincoPanco:

IEBodyReadText() is usually great, but this time I don't have an IE Window open..

(I am using InetRead(), and working with the returned string..)

So I want to do this in code, without opening an IE Window..

Is it possible?

I viewed IEBodyReadText()'s source code, and in it, it has:

$o_object.document.body.innerText

Can I somehow create a Document object, without creating an IE Window?

That way I can supply it with the HTML, and then get the Text outof it, via .body.innerText

 

guinness:

I will search now for what you wrote..

(but still I am curious regarding my previous question about the possibility to work with a Document object, without creating an IE Window.. so If anyone knows, please tell..)

 

Thank you

Edited by Zohar

Share this post


Link to post
Share on other sites

An alternative would be to use the _IECreate function with it's $f_visible parameter set to 0, to make the window hidden.


- Bruce /*somdcomputerguy */  If you change the way you look at things, the things you look at change.

Share this post


Link to post
Share on other sites

#6 ·  Posted (edited)

Hi somdcomputerguy

Yes I know, but I am hoping to avoid creating an IE Window at all..

(even an invisible one)

 

If anyone knows whether I can instantiate a Document object somehow, it'll be useful.

Maybe there is a way by using ObjCreate()?

Edited by Zohar

Share this post


Link to post
Share on other sites

_AdapterConnections()_AlwaysRun()_AppMon()_AppMonEx()_BinaryBin()_CheckMsgBox()_CmdLineRaw()_ContextMenu()_ConvertLHWebColor()/_ConvertSHWebColor()_DesktopDimensions()_DisplayPassword()_DotNet_Load()/_DotNet_Unload()_Fibonacci()_FileCompare()_FileCompareContents()_FileNameByHandle()_FilePrefix/SRE()_FindInFile()_GetBackgroundColor()/_SetBackgroundColor()_GetConrolID()_GetCtrlClass()_GetDirectoryFormat()_GetDriveMediaType()_GetFilename()/_GetFilenameExt()_GetHardwareID()_GetIP()_GetIP_Country()_GetOSLanguage()_GetSavedSource()_GetStringSize()_GetSystemPaths()_GetURLImage()_GIFImage()_GoogleWeather()_GUICtrlCreateGroup()_GUICtrlListBox_CreateArray()_GUICtrlListView_CreateArray()_GUICtrlListView_SaveCSV()_GUICtrlListView_SaveHTML()_GUICtrlListView_SaveTxt()_GUICtrlListView_SaveXML()_GUICtrlMenu_Recent()_GUICtrlMenu_SetItemImage()_GUICtrlTreeView_CreateArray()_GUIDisable()_GUIImageList_SetIconFromHandle()_GUIRegisterMsg()_GUISetIcon()_Icon_Clear()/_Icon_Set()_IdleTime()_InetGet()_InetGetGUI()_InetGetProgress()_IPDetails()_IsFileOlder()_IsGUID()_IsHex()_IsPalindrome()_IsRegKey()_IsStringRegExp()_IsSystemDrive()_IsUPX()_IsValidType()_IsWebColor()_Language()_Log()_MicrosoftInternetConnectivity()_MSDNDataType()_PathFull/GetRelative/Split()_PathSplitEx()_PrintFromArray()_ProgressSetMarquee()_ReDim()_RockPaperScissors()/_RockPaperScissorsLizardSpock()_ScrollingCredits_SelfDelete()_SelfRename()_SelfUpdate()_SendTo()_ShellAll()_ShellFile()_ShellFolder()_SingletonHWID()_SingletonPID()_Startup()_StringCompact()_StringIsValid()_StringRegExpMetaCharacters()_StringReplaceWholeWord()_StringStripChars()_Temperature()_TrialPeriod()_UKToUSDate()/_USToUKDate()_WinAPI_Create_CTL_CODE()_WinAPI_CreateGUID()_WMIDateStringToDate()/_DateToWMIDateString()Au3 script parsingAutoIt SearchAutoIt3 PortableAutoIt3WrapperToPragmaAutoItWinGetTitle()/AutoItWinSetTitle()CodingDirToHTML5FileInstallrFileReadLastChars()GeoIP databaseGUI - Only Close ButtonGUI ExamplesGUICtrlDeleteImage()GUICtrlGetBkColor()GUICtrlGetStyle()GUIEventsGUIGetBkColor()Int_Parse() & Int_TryParse()IsISBN()LockFile()Mapping CtrlIDsOOP in AutoItParseHeadersToSciTE()PasswordValidPasteBinPosts Per DayPreExpandProtect GlobalsQueue()Resource UpdateResourcesExSciTE JumpSettings INISHELLHOOKShunting-YardSignature CreatorStack()Stopwatch()StringAddLF()/StringStripLF()StringEOLToCRLF()VSCROLLWM_COPYDATAMore Examples...

Updated: 04/09/2015

Share this post


Link to post
Share on other sites

#8 ·  Posted (edited)

#include <Array.au3>
$text = BinaryToString(InetRead("http://www.autoitscript.com/forum/topic/155354-how-to-convert-a-webpagehtml-to-text"))

$file = FileOpen(@ScriptDir & '\Name.htm', 2)
FileWrite($file, $text)
FileClose($file)

$text = StringRegExpReplace($text, '(?si)<head>.*?</head>', '')
$text = StringRegExpReplace($text, '(?si)<script[^>]*?>.*?</script>', '')
$text = StringRegExpReplace($text, '>\s+<', '> <')
; $text = StringRegExpReplace($text, '[\v]', '')
$text = StringRegExpReplace($text, '<(br|p)>', @CRLF)
$text = StringRegExpReplace($text, '<[/!]*?[^<>]*?>', '')
; $text = StringRegExpReplace($text, '<[\/\!]*?[^<>]*?>', @CRLF)
$text = StringReplace($text, '&quot;', '"')
$text = StringReplace($text, '&amp;', '&')
$text = StringReplace($text, '&lt;', '<')
$text = StringReplace($text, '&gt;', '>')
$text = StringReplace($text, '&nbsp;', ' ')
$text = StringReplace($text, '&iexcl;', '&#161;')
$text = StringReplace($text, '&cent;', '&#162;')
$text = StringReplace($text, '&pound;', '&#163;')
$text = StringReplace($text, '&copy;', '&#169;')

; &#444 -> character
$a = StringRegExp($text, '&#(\d+);', 3)
If Not @error Then
    ; $log &= UBound($a) & '   &#(\d+);' & @CRLF
    $a = _ArrayUnique($a)
    For $i = 1 To $a[0]
        $a[$i] = Number($a[$i])
    Next
    _ArraySort($a, 1, 1)
    _ArrayDisplay($a)
    For $i = 1 To $a[0]
        $text = StringReplace($text, '&#' & $a[$i] & ';', ChrW($a[$i]))
    Next
EndIf
$text = StringRegExpReplace($text, '([\r\n])[\s]+', @CRLF)
MsgBox(0, ';)', $text)

$file = FileOpen(@ScriptDir & '\Name.txt', 2)
FileWrite($file, $text)
FileClose($file)

Edited by AZJIO

Share this post


Link to post
Share on other sites

#9 ·  Posted (edited)

Also, instead of parsing by yourself html page, you could use browser's internal method .document.body.innerText

$oIE = ObjCreate("Shell.Explorer.2")

; Create a simple invisible dummy GUI needed to create the obj
GUICreate("Dummy browser", 0, 0, 10, 10)
$GUIActiveX = GUICtrlCreateObj($oIE, 0, 0, 10, 10)

GUISetState(@SW_HIDE); hide dummy browser

$oIE.navigate("http://www.autoitscript.com") ; <-- your url here
Do
    Sleep(100)
Until Not $oIE.Busy

ConsoleWrite($oIE.document.body.innerText)
MsgBox(0, "", $oIE.document.body.innerText)

edit:

added MsgBox(0, "", $oIE.document.body.innerText) in listing

Edited by PincoPanco

small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Share this post


Link to post
Share on other sites

#10 ·  Posted (edited)

AZJIO:

Really nice - full implementation for HTML stripping, thank you.

 

guinness and PincoPanco:

Great - that's what I was hoping to find,

thank you :)

Edited by Zohar

Share this post


Link to post
Share on other sites

#11 ·  Posted (edited)

Or I'm thinking:

$sInput = StringRegExpReplace($sInput, '(?i)(?s)<([A-Z][A-Z0-9 \r?\n]*)\b[^>]*>(.*?)</([A-Z][A-Z0-9]*)>', '\2')
$sOutput = StringRegExpReplace($sInput, '(?i)(?-s)<.*?>', '')
Edited by Jury

Share this post


Link to post
Share on other sites

Hi Jury

Thank you

The problem is that Removing Tags will not be enough, since sometimes you have Tags that have an Opening Tag, and Closing Tag,

like:

<SCRIPT>

....

....

</SCRIPT>

Share this post


Link to post
Share on other sites

#13 ·  Posted (edited)

my idiot way

WinActivate("Google Chrome", "")
Send("{CTRL}a")
Send("{CTRL}c")
WinActivate("NOTEPAD", "")
Send("{CTRL}v")
Edited by Alexxander

Share this post


Link to post
Share on other sites

#15 ·  Posted (edited)

 

my idiot way

WinActivate("Google Chrome", "")
Send("{CTRL}a")
Send("{CTRL}c")
WinActivate("NOTEPAD", "")
Send("{CTRL}v")

 

I agree with you ..... :P

Edited by PincoPanco

small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0