Jump to content

HTML to TEXT


Recommended Posts

Hi everybody I'm looking for way to clean convert HTML to TEXT

I found few examples here (), tryed both scripts, but

1 script - using StringRegExpReplace function that gives me fatal error when im using it on big web-sites

2 script - using _IECreate function that working too slow and i dont wan't to create any new IE porcesses

Here is my script that sometimes gives me FATAl error:

#include <INet.au3>
#include <Constants.au3>
#Include <String.au3>
#include <Array.au3>
#Include <Misc.au3>
#include <file.au3>
#include <IE.au3>

$DATA = _INetGetSource("any web site")
checkcode()


Func checkcode()
local $x,$y,$lnx,$Content
;if StringLen($DATA)<90000 Then
$Content = $DATA
;MsgBox(0,"XXX",$LINE&"    "&StringLen($DATA))

$Content = StringStripCr($Content)
$Content = StringRegExpReplace($Content, '<head>(.|n)+?</head>','')
$Content = StringRegExpReplace($Content, '<script(.|n)+?/script>','')
$Content = StringRegExpReplace($Content, '<!--(.|n)+?-->','')
$Content = StringRegExpReplace($Content, '<(.|n)+?>','')
$Content = StringRegExpReplace($Content, 'http://(.|n)+? ','')
$Content = StringRegExpReplace($Content, 'ftp://(.|n)+? ','')
$Content = StringRegExpReplace($Content, 'https://(.|n)+? ','')
$Content = StringRegExpReplace($Content, 'www.(.|n)+? ','')

$Content = StringReplace($Content, '<','')
$Content = StringReplace($Content, '>','')
$Content = StringReplace($Content, '&lt;','<')
$Content = StringReplace($Content, '&gt;','>')
$Content = StringReplace($Content, '&nbsp;',' ')
$Content = StringReplace($Content, '&copy;','©')
$Content = StringReplace($Content, '&ldquo;','"')
$Content = StringReplace($Content, '&raquo;','»')
$Content = StringReplace($Content, '&laquo;','«')
$Content = StringReplace($Content, '&rdquo;','"')
$Content = StringReplace($Content, '&quot;','"')
$Content = StringReplace($Content, '&amp;','&')
$Content = StringReplace($Content, '&#149;','•')
$Content = StringReplace($Content, '&bull;','•')
$Content = StringReplace($Content, '&#8249;','')
$Content = StringReplace($Content, '&#8250;','')
$Content = StringReplace($Content, "&#8217;","'")
$Content = StringReplace($Content, "&#39;","'")

$Content = StringReplace($Content, '^[',' [')
$Content = StringReplace($Content, ']^',' ]')
$Content = StringReplace($Content, ' , ',', ')
$Content = StringReplace($Content, ' : ',': ')
$Content = StringReplace($Content, ' . ','. ')
$Content = StringReplace($Content, ' ? ','? ')
$Content = StringReplace($Content, ' ! ','! ')
$Content = StringReplace($Content, ' ; ','; ')

$Content = StringStripWS($Content, 4)
FileWriteLine("DUMP.txt",$Content)
Endfunc

Any ideas how to do it HTML to TEXT coverstion ?

Edited by Enforcer
[RU] Zone
Link to comment
Share on other sites

What about this >>

UDF List:

 
_AdapterConnections()_AlwaysRun()_AppMon()_AppMonEx()_ArrayFilter/_ArrayReduce_BinaryBin()_CheckMsgBox()_CmdLineRaw()_ContextMenu()_ConvertLHWebColor()/_ConvertSHWebColor()_DesktopDimensions()_DisplayPassword()_DotNet_Load()/_DotNet_Unload()_Fibonacci()_FileCompare()_FileCompareContents()_FileNameByHandle()_FilePrefix/SRE()_FindInFile()_GetBackgroundColor()/_SetBackgroundColor()_GetConrolID()_GetCtrlClass()_GetDirectoryFormat()_GetDriveMediaType()_GetFilename()/_GetFilenameExt()_GetHardwareID()_GetIP()_GetIP_Country()_GetOSLanguage()_GetSavedSource()_GetStringSize()_GetSystemPaths()_GetURLImage()_GIFImage()_GoogleWeather()_GUICtrlCreateGroup()_GUICtrlListBox_CreateArray()_GUICtrlListView_CreateArray()_GUICtrlListView_SaveCSV()_GUICtrlListView_SaveHTML()_GUICtrlListView_SaveTxt()_GUICtrlListView_SaveXML()_GUICtrlMenu_Recent()_GUICtrlMenu_SetItemImage()_GUICtrlTreeView_CreateArray()_GUIDisable()_GUIImageList_SetIconFromHandle()_GUIRegisterMsg()_GUISetIcon()_Icon_Clear()/_Icon_Set()_IdleTime()_InetGet()_InetGetGUI()_InetGetProgress()_IPDetails()_IsFileOlder()_IsGUID()_IsHex()_IsPalindrome()_IsRegKey()_IsStringRegExp()_IsSystemDrive()_IsUPX()_IsValidType()_IsWebColor()_Language()_Log()_MicrosoftInternetConnectivity()_MSDNDataType()_PathFull/GetRelative/Split()_PathSplitEx()_PrintFromArray()_ProgressSetMarquee()_ReDim()_RockPaperScissors()/_RockPaperScissorsLizardSpock()_ScrollingCredits_SelfDelete()_SelfRename()_SelfUpdate()_SendTo()_ShellAll()_ShellFile()_ShellFolder()_SingletonHWID()_SingletonPID()_Startup()_StringCompact()_StringIsValid()_StringRegExpMetaCharacters()_StringReplaceWholeWord()_StringStripChars()_Temperature()_TrialPeriod()_UKToUSDate()/_USToUKDate()_WinAPI_Create_CTL_CODE()_WinAPI_CreateGUID()_WMIDateStringToDate()/_DateToWMIDateString()Au3 script parsingAutoIt SearchAutoIt3 PortableAutoIt3WrapperToPragmaAutoItWinGetTitle()/AutoItWinSetTitle()CodingDirToHTML5FileInstallrFileReadLastChars()GeoIP databaseGUI - Only Close ButtonGUI ExamplesGUICtrlDeleteImage()GUICtrlGetBkColor()GUICtrlGetStyle()GUIEventsGUIGetBkColor()Int_Parse() & Int_TryParse()IsISBN()LockFile()Mapping CtrlIDsOOP in AutoItParseHeadersToSciTE()PasswordValidPasteBinPosts Per DayPreExpandProtect GlobalsQueue()Resource UpdateResourcesExSciTE JumpSettings INISHELLHOOKShunting-YardSignature CreatorStack()Stopwatch()StringAddLF()/StringStripLF()StringEOLToCRLF()VSCROLLWM_COPYDATAMore Examples...

Updated: 22/04/2018

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...