Zohar Posted October 13, 2013 Share Posted October 13, 2013 (edited) Hi Let's say I have a URL to some Webpage, and I want to get only the Text from that Webpage. I can get the Webpage's content via InetRead(), but how do I then get only the Text, and not all the Tags, Attributes, etc? Thank you Edited October 13, 2013 by Zohar Link to comment Share on other sites More sharing options...
guinness Posted October 13, 2013 Share Posted October 13, 2013 Search for "HTML Strip" UDF List: _AdapterConnections() • _AlwaysRun() • _AppMon() • _AppMonEx() • _ArrayFilter/_ArrayReduce • _BinaryBin() • _CheckMsgBox() • _CmdLineRaw() • _ContextMenu() • _ConvertLHWebColor()/_ConvertSHWebColor() • _DesktopDimensions() • _DisplayPassword() • _DotNet_Load()/_DotNet_Unload() • _Fibonacci() • _FileCompare() • _FileCompareContents() • _FileNameByHandle() • _FilePrefix/SRE() • _FindInFile() • _GetBackgroundColor()/_SetBackgroundColor() • _GetConrolID() • _GetCtrlClass() • _GetDirectoryFormat() • _GetDriveMediaType() • _GetFilename()/_GetFilenameExt() • _GetHardwareID() • _GetIP() • _GetIP_Country() • _GetOSLanguage() • _GetSavedSource() • _GetStringSize() • _GetSystemPaths() • _GetURLImage() • _GIFImage() • _GoogleWeather() • _GUICtrlCreateGroup() • _GUICtrlListBox_CreateArray() • _GUICtrlListView_CreateArray() • _GUICtrlListView_SaveCSV() • _GUICtrlListView_SaveHTML() • _GUICtrlListView_SaveTxt() • _GUICtrlListView_SaveXML() • _GUICtrlMenu_Recent() • _GUICtrlMenu_SetItemImage() • _GUICtrlTreeView_CreateArray() • _GUIDisable() • _GUIImageList_SetIconFromHandle() • _GUIRegisterMsg() • _GUISetIcon() • _Icon_Clear()/_Icon_Set() • _IdleTime() • _InetGet() • _InetGetGUI() • _InetGetProgress() • _IPDetails() • _IsFileOlder() • _IsGUID() • _IsHex() • _IsPalindrome() • _IsRegKey() • _IsStringRegExp() • _IsSystemDrive() • _IsUPX() • _IsValidType() • _IsWebColor() • _Language() • _Log() • _MicrosoftInternetConnectivity() • _MSDNDataType() • _PathFull/GetRelative/Split() • _PathSplitEx() • _PrintFromArray() • _ProgressSetMarquee() • _ReDim() • _RockPaperScissors()/_RockPaperScissorsLizardSpock() • _ScrollingCredits • _SelfDelete() • _SelfRename() • _SelfUpdate() • _SendTo() • _ShellAll() • _ShellFile() • _ShellFolder() • _SingletonHWID() • _SingletonPID() • _Startup() • _StringCompact() • _StringIsValid() • _StringRegExpMetaCharacters() • _StringReplaceWholeWord() • _StringStripChars() • _Temperature() • _TrialPeriod() • _UKToUSDate()/_USToUKDate() • _WinAPI_Create_CTL_CODE() • _WinAPI_CreateGUID() • _WMIDateStringToDate()/_DateToWMIDateString() • Au3 script parsing • AutoIt Search • AutoIt3 Portable • AutoIt3WrapperToPragma • AutoItWinGetTitle()/AutoItWinSetTitle() • Coding • DirToHTML5 • FileInstallr • FileReadLastChars() • GeoIP database • GUI - Only Close Button • GUI Examples • GUICtrlDeleteImage() • GUICtrlGetBkColor() • GUICtrlGetStyle() • GUIEvents • GUIGetBkColor() • Int_Parse() & Int_TryParse() • IsISBN() • LockFile() • Mapping CtrlIDs • OOP in AutoIt • ParseHeadersToSciTE() • PasswordValid • PasteBin • Posts Per Day • PreExpand • Protect Globals • Queue() • Resource Update • ResourcesEx • SciTE Jump • Settings INI • SHELLHOOK • Shunting-Yard • Signature Creator • Stack() • Stopwatch() • StringAddLF()/StringStripLF() • StringEOLToCRLF() • VSCROLL • WM_COPYDATA • More Examples... Updated: 22/04/2018 Link to comment Share on other sites More sharing options...
Gianni Posted October 13, 2013 Share Posted October 13, 2013 (edited) Hi Zohar take a look in the help file to the _IEBodyReadText() function of the ie.au3 UDF this is the example from the help: ; ******************************************************* ; Example 1 - Open a browser with the basic example, read the body Text ; (the content with all HTML tags removed) and display it in a MsgBox ; ******************************************************* #include <IE.au3> Local $oIE = _IE_Example("basic") Local $sText = _IEBodyReadText($oIE) MsgBox(0, "Body Text", $sText) bye Edited October 13, 2013 by PincoPanco Chimp small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt.... Link to comment Share on other sites More sharing options...
Zohar Posted October 13, 2013 Author Share Posted October 13, 2013 (edited) Hi guinness and PincoPanco PincoPanco: IEBodyReadText() is usually great, but this time I don't have an IE Window open.. (I am using InetRead(), and working with the returned string..) So I want to do this in code, without opening an IE Window.. Is it possible? I viewed IEBodyReadText()'s source code, and in it, it has: $o_object.document.body.innerText Can I somehow create a Document object, without creating an IE Window? That way I can supply it with the HTML, and then get the Text outof it, via .body.innerText guinness: I will search now for what you wrote.. (but still I am curious regarding my previous question about the possibility to work with a Document object, without creating an IE Window.. so If anyone knows, please tell..) Thank you Edited October 13, 2013 by Zohar Link to comment Share on other sites More sharing options...
somdcomputerguy Posted October 13, 2013 Share Posted October 13, 2013 An alternative would be to use the _IECreate function with it's $f_visible parameter set to 0, to make the window hidden. - Bruce /*somdcomputerguy */ If you change the way you look at things, the things you look at change. Link to comment Share on other sites More sharing options...
Zohar Posted October 13, 2013 Author Share Posted October 13, 2013 (edited) Hi somdcomputerguy Yes I know, but I am hoping to avoid creating an IE Window at all.. (even an invisible one) If anyone knows whether I can instantiate a Document object somehow, it'll be useful. Maybe there is a way by using ObjCreate()? Edited October 13, 2013 by Zohar Link to comment Share on other sites More sharing options...
guinness Posted October 13, 2013 Share Posted October 13, 2013 http://www.autoitscript.com/wiki/Snippets_(_Internet_)#HTML_StripTags UDF List: _AdapterConnections() • _AlwaysRun() • _AppMon() • _AppMonEx() • _ArrayFilter/_ArrayReduce • _BinaryBin() • _CheckMsgBox() • _CmdLineRaw() • _ContextMenu() • _ConvertLHWebColor()/_ConvertSHWebColor() • _DesktopDimensions() • _DisplayPassword() • _DotNet_Load()/_DotNet_Unload() • _Fibonacci() • _FileCompare() • _FileCompareContents() • _FileNameByHandle() • _FilePrefix/SRE() • _FindInFile() • _GetBackgroundColor()/_SetBackgroundColor() • _GetConrolID() • _GetCtrlClass() • _GetDirectoryFormat() • _GetDriveMediaType() • _GetFilename()/_GetFilenameExt() • _GetHardwareID() • _GetIP() • _GetIP_Country() • _GetOSLanguage() • _GetSavedSource() • _GetStringSize() • _GetSystemPaths() • _GetURLImage() • _GIFImage() • _GoogleWeather() • _GUICtrlCreateGroup() • _GUICtrlListBox_CreateArray() • _GUICtrlListView_CreateArray() • _GUICtrlListView_SaveCSV() • _GUICtrlListView_SaveHTML() • _GUICtrlListView_SaveTxt() • _GUICtrlListView_SaveXML() • _GUICtrlMenu_Recent() • _GUICtrlMenu_SetItemImage() • _GUICtrlTreeView_CreateArray() • _GUIDisable() • _GUIImageList_SetIconFromHandle() • _GUIRegisterMsg() • _GUISetIcon() • _Icon_Clear()/_Icon_Set() • _IdleTime() • _InetGet() • _InetGetGUI() • _InetGetProgress() • _IPDetails() • _IsFileOlder() • _IsGUID() • _IsHex() • _IsPalindrome() • _IsRegKey() • _IsStringRegExp() • _IsSystemDrive() • _IsUPX() • _IsValidType() • _IsWebColor() • _Language() • _Log() • _MicrosoftInternetConnectivity() • _MSDNDataType() • _PathFull/GetRelative/Split() • _PathSplitEx() • _PrintFromArray() • _ProgressSetMarquee() • _ReDim() • _RockPaperScissors()/_RockPaperScissorsLizardSpock() • _ScrollingCredits • _SelfDelete() • _SelfRename() • _SelfUpdate() • _SendTo() • _ShellAll() • _ShellFile() • _ShellFolder() • _SingletonHWID() • _SingletonPID() • _Startup() • _StringCompact() • _StringIsValid() • _StringRegExpMetaCharacters() • _StringReplaceWholeWord() • _StringStripChars() • _Temperature() • _TrialPeriod() • _UKToUSDate()/_USToUKDate() • _WinAPI_Create_CTL_CODE() • _WinAPI_CreateGUID() • _WMIDateStringToDate()/_DateToWMIDateString() • Au3 script parsing • AutoIt Search • AutoIt3 Portable • AutoIt3WrapperToPragma • AutoItWinGetTitle()/AutoItWinSetTitle() • Coding • DirToHTML5 • FileInstallr • FileReadLastChars() • GeoIP database • GUI - Only Close Button • GUI Examples • GUICtrlDeleteImage() • GUICtrlGetBkColor() • GUICtrlGetStyle() • GUIEvents • GUIGetBkColor() • Int_Parse() & Int_TryParse() • IsISBN() • LockFile() • Mapping CtrlIDs • OOP in AutoIt • ParseHeadersToSciTE() • PasswordValid • PasteBin • Posts Per Day • PreExpand • Protect Globals • Queue() • Resource Update • ResourcesEx • SciTE Jump • Settings INI • SHELLHOOK • Shunting-Yard • Signature Creator • Stack() • Stopwatch() • StringAddLF()/StringStripLF() • StringEOLToCRLF() • VSCROLL • WM_COPYDATA • More Examples... Updated: 22/04/2018 Link to comment Share on other sites More sharing options...
AZJIO Posted October 13, 2013 Share Posted October 13, 2013 (edited) expandcollapse popup#include <Array.au3> $text = BinaryToString(InetRead("http://www.autoitscript.com/forum/topic/155354-how-to-convert-a-webpagehtml-to-text")) $file = FileOpen(@ScriptDir & '\Name.htm', 2) FileWrite($file, $text) FileClose($file) $text = StringRegExpReplace($text, '(?si)<head>.*?</head>', '') $text = StringRegExpReplace($text, '(?si)<script[^>]*?>.*?</script>', '') $text = StringRegExpReplace($text, '>\s+<', '> <') ; $text = StringRegExpReplace($text, '[\v]', '') $text = StringRegExpReplace($text, '<(br|p)>', @CRLF) $text = StringRegExpReplace($text, '<[/!]*?[^<>]*?>', '') ; $text = StringRegExpReplace($text, '<[\/\!]*?[^<>]*?>', @CRLF) $text = StringReplace($text, '"', '"') $text = StringReplace($text, '&', '&') $text = StringReplace($text, '<', '<') $text = StringReplace($text, '>', '>') $text = StringReplace($text, ' ', ' ') $text = StringReplace($text, '¡', '¡') $text = StringReplace($text, '¢', '¢') $text = StringReplace($text, '£', '£') $text = StringReplace($text, '©', '©') ; Ƽ -> character $a = StringRegExp($text, '&#(\d+);', 3) If Not @error Then ; $log &= UBound($a) & ' &#(\d+);' & @CRLF $a = _ArrayUnique($a) For $i = 1 To $a[0] $a[$i] = Number($a[$i]) Next _ArraySort($a, 1, 1) _ArrayDisplay($a) For $i = 1 To $a[0] $text = StringReplace($text, '&#' & $a[$i] & ';', ChrW($a[$i])) Next EndIf $text = StringRegExpReplace($text, '([\r\n])[\s]+', @CRLF) MsgBox(0, ';)', $text) $file = FileOpen(@ScriptDir & '\Name.txt', 2) FileWrite($file, $text) FileClose($file) Edited October 13, 2013 by AZJIO My other projects or all Link to comment Share on other sites More sharing options...
Gianni Posted October 13, 2013 Share Posted October 13, 2013 (edited) Also, instead of parsing by yourself html page, you could use browser's internal method .document.body.innerText $oIE = ObjCreate("Shell.Explorer.2") ; Create a simple invisible dummy GUI needed to create the obj GUICreate("Dummy browser", 0, 0, 10, 10) $GUIActiveX = GUICtrlCreateObj($oIE, 0, 0, 10, 10) GUISetState(@SW_HIDE); hide dummy browser $oIE.navigate("http://www.autoitscript.com") ; <-- your url here Do Sleep(100) Until Not $oIE.Busy ConsoleWrite($oIE.document.body.innerText) MsgBox(0, "", $oIE.document.body.innerText) edit: added MsgBox(0, "", $oIE.document.body.innerText) in listing Edited October 13, 2013 by PincoPanco Chimp small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt.... Link to comment Share on other sites More sharing options...
Zohar Posted October 13, 2013 Author Share Posted October 13, 2013 (edited) AZJIO: Really nice - full implementation for HTML stripping, thank you. guinness and PincoPanco: Great - that's what I was hoping to find, thank you Edited October 13, 2013 by Zohar Link to comment Share on other sites More sharing options...
Jury Posted October 13, 2013 Share Posted October 13, 2013 (edited) Or I'm thinking: $sInput = StringRegExpReplace($sInput, '(?i)(?s)<([A-Z][A-Z0-9 \r?\n]*)\b[^>]*>(.*?)</([A-Z][A-Z0-9]*)>', '\2') $sOutput = StringRegExpReplace($sInput, '(?i)(?-s)<.*?>', '') Edited October 13, 2013 by Jury Link to comment Share on other sites More sharing options...
Zohar Posted October 14, 2013 Author Share Posted October 14, 2013 Hi Jury Thank you The problem is that Removing Tags will not be enough, since sometimes you have Tags that have an Opening Tag, and Closing Tag, like: <SCRIPT> .... .... </SCRIPT> Link to comment Share on other sites More sharing options...
Alexxander Posted October 14, 2013 Share Posted October 14, 2013 (edited) my idiot way WinActivate("Google Chrome", "") Send("{CTRL}a") Send("{CTRL}c") WinActivate("NOTEPAD", "") Send("{CTRL}v") Edited October 14, 2013 by Alexxander Link to comment Share on other sites More sharing options...
Zohar Posted October 14, 2013 Author Share Posted October 14, 2013 Link to comment Share on other sites More sharing options...
Gianni Posted October 14, 2013 Share Posted October 14, 2013 (edited) my idiot way WinActivate("Google Chrome", "") Send("{CTRL}a") Send("{CTRL}c") WinActivate("NOTEPAD", "") Send("{CTRL}v") I agree with you ..... Edited October 14, 2013 by PincoPanco Chimp small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt.... Link to comment Share on other sites More sharing options...
Zohar Posted October 15, 2013 Author Share Posted October 15, 2013 OK BTW, noone has an idea for my other topic? '?do=embed' frameborder='0' data-embedContent>> Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now