Jump to content

Help Stripping html from text


Recommended Posts

This what came out of an aim instant message. the text i need from this is just "dfg"

I tried stringsplit but it just wont work right, please help

<HTML><BODY BGCOLOR="#ffffff"><B><FONT COLOR="#0000ff" LANG="0">JoE DA HoE  6900<!-- (10:12:38 PM)--></B></FONT><FONT COLOR="#0000ff" BACK="#ffffff">:</FONT><FONT COLOR="#000000"> dfg</FONT><BR>
<BODY BGCOLOR="#ffffff"><B><FONT COLOR="#ff0000">JoE DA HoE  6900<!-- (10:12:39 PM)--></B>:</FONT><FONT COLOR="#000000"> dfg</FONT></BODY></HTML>
<HTML><BODY BGCOLOR="#ffffff"><B><FONT COLOR="#0000ff" LANG="0">JoE DA HoE  6900<!-- (10:12:38 PM)--></B></FONT><FONT COLOR="#0000ff" BACK="#ffffff">:</FONT><FONT COLOR="#000000"> dfg</FONT><BR>
<BODY BGCOLOR="#ffffff"><B><FONT COLOR="#ff0000">JoE DA HoE  6900<!-- (10:12:39 PM)--></B>:</FONT><FONT COLOR="#000000"> dfg</FONT></BODY></HTML

&Send
&Warn
Bloc&k
Send &Instant Message...
Get AIM E&xpressions
Play Games
Start Live Video
&Talk
&Chat

[center]AutoIT + Finger Print Reader/Scanner = COOL STUFF -> Check Out Topic![/center][center][font=Arial Black]Check out ConsultingJoe.com[/font][/center][center]My Scripts~~~~~~~~~~~~~~Web Protocol Managing - Simple WiFi Scanner - AutoTunes - Remote PC Control V2 - Audio SpectrascopePie Chart UDF - At&t's TTS - Custom Progress Bar - Windows Media Player Embed[/center]

Link to comment
Share on other sites

i wasn't able to get this working completely...and

its only for one line...bit maybe it will help

$file = '<HTML><BODY BGCOLOR="#ffffff"><B><FONT COLOR="#0000ff" LANG="0">JoE DA HoE  6900<!-- (10:12:38 PM)-->'
$split = StringSplit($file, ">")
#include <Array.au3>
For $l = 1 to $split[0]
    If StringInStr ( $split[$l], "<") = 1 Then
;
    Else
        If StringInStr($split[$l], "<") Then $split[$l] = StringTrimRight ( $split[$l], StringInStr($split[$l], "<"))
        MsgBox(0, "", $split[$l])
    EndIf
Next
_ArrayDisplay ( $split, "")
Edited by ACalcutt

Andrew Calcutt

Http://www.Vistumbler.net

Http://www.TechIdiots.net

Its not an error, its a undocumented feature

Link to comment
Share on other sites

  • Moderators

Rather Crude but this works:

#include <file.au3>
Local $aArray = ''
Local $sFilePath = @DesktopDir & '\test..txt'
Local $eXclude[3]; add to the array, the particular items you don't want to show on the return
$eXclude[1] = 'JoE DA HoE  6900' 
$eXclude[2] = ':'
_FileReadToArray($sFilePath, $aArray)
GetNonHtml($aArray, $eXclude)


Func GetNonHtml(ByRef $aArray, $eXclude = '')
    Local $nArray = ''
    For $i = 1 To UBound($aArray) - 1
        $StringBetween = StripHtml($aArray[$i], '>', '<', $eXclude)
        If $StringBetween = 0 Then ContinueLoop
        For $x = 1 To UBound($StringBetween) - 1
            $nArray = $nArray & $StringBetween[$x] & Chr(01)
        Next
    Next

    Local $spSp = StringSplit(StringTrimRight($nArray, 1), Chr(01))
    Local $nfOpen = '' 
    $nfOpen = FileOpen(StringTrimRight($sFilePath, 4) & '_stripped.txt', 9)
    For $x = 1 To UBound($spSp) - 1
        FileWriteLine($nfOpen, $spSp[$x])
    Next

    FileClose($nfOpen)
EndFunc
Func StripHtml($LineToRead, $start, $end, $eXclude = '1')
    Local $sPlit = StringSplit($LineToRead, $start)
    Local $nArray = ''
    For $i = 1 To UBound($sPlit) - 1
        If StringMid($sPlit[$i], 1, 1) <> '' And StringMid($sPlit[$i], 1, 1) <> $end Then
            For $x = 1 To StringLen($sPlit[$i])
                Local $SMid = StringMid($sPlit[$i], $x, 1)
                If $SMid = $end Then
                    Local $fOund = StringLeft($sPlit[$i], $x - 1)
                    Local $ErrorCheck = 0
                    For $y = 1 To UBound($eXclude) - 1
                        If $fOund = $eXclude[$y] Then 
                            $ErrorCheck = 1
                            ExitLoop
                        EndIf
                    Next
                    If $ErrorCheck = 0 Then
                        $nArray = $nArray & StringStripWS($fOund, 7) & Chr(01)
                        ExitLoop
                    EndIf
                EndIf
            Next
        EndIf
    Next
    If $nArray <> '' Then
        Return StringSplit(StringTrimRight($nArray, 1), Chr(01))
    Else
        Return 0
    EndIf
EndFunc

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Link to comment
Share on other sites

My attempt. It will search for the unique ">:<" substring and get it's location. Then it will count 32 characters ahead, where the text starts. It will then convert the string into an array, and search for the first "<" proceeding the beginning of the text. It will take all the elements of the array between those two points and create a single, new array from it. Finally it will convert that array into a string; resulting in the text!

The script is made to loop through the string until no more text is found.

May not be the best method, but its 2:14 AM and I whipped it up rather quickly.

#include <array.au3>

$String = '<HTML><BODY BGCOLOR="#ffffff"><B><FONT COLOR="#0000ff" LANG="0">JoE DA HoE  6900<!-- (10:12:38 PM)--></B></FONT><FONT COLOR="#0000ff" BACK="#ffffff">:</FONT><FONT COLOR="#000000"> dfg</FONT><BR><BODY BGCOLOR="#ffffff"><B><FONT COLOR="#ff0000">JoE DA HoE  6900<!-- (10:12:39 PM)--></B>:</FONT><FONT COLOR="#000000"> dfg</FONT></BODY></HTML><HTML><BODY BGCOLOR="#ffffff"><B><FONT COLOR="#0000ff" LANG="0">JoE DA HoE  6900<!-- (10:12:38 PM)--></B></FONT><FONT COLOR="#0000ff" BACK="#ffffff">:</FONT><FONT COLOR="#000000"> dfg</FONT><BR><BODY BGCOLOR="#ffffff"><B><FONT COLOR="#ff0000">JoE DA HoE  6900<!-- (10:12:39 PM)--></B>:</FONT><FONT COLOR="#000000"> dfg</FONT></BODY></HTML&Send&WarnBloc&kSend &Instant Message...Get AIM E&xpressionsPlay GamesStart Live Video&Talk&Chat'
$Var = 1

While 1
    
    $TextStart = StringInStr($String, ">:<", 0, $Var) + 32
    If $TextStart = 32 Then ExitLoop
    
    $rString = StringSplit($String, "")
    $TextEnd = _ArraySearch($rString, "<", $TextStart) - 1

    $New = _ArrayCreate("Success")
    For $T = $TextStart To $TextEnd
        _ArrayAdd($New, $rString[$T])
    Next

    _ArrayDelete($New, 0)
    $New = _ArrayToString($New, "|")
    $New = StringReplace($New, "|", "")
    
    MsgBox(0, "Occurrence "&$Var&" of text!", "Text #"&$Var&": "&$New)

    $Var += 1

WEnd

MsgBox(0, "Done", "Done: Found "&$Var-1&" occurrences of text.")
Link to comment
Share on other sites

Hi @zerocool60544,

This one is a brute force method, and and can be used as a general purpose HTML Code stripper too. It isn't elegant, but it is quick and writes the results to a timestamped log. It uses an INI file to take care of current Boiler Plate items and if any new ones turn up you can easily add them. There is a ZIP file attached.

Gene

Edit: I used the $char var in debugging and forgot to take it out.

Global $sWorkVar, $sWorkVar2, $iCodeFlag, $var, $i, $char 



$char = 0

$sFilePath = FileOpenDialog ( "Select HTML file to strip.", "My Computer", "HTML (*.html)|HTM (*.htm)" , 1 )
$begin = TimerInit()
$sFileContent = FileRead($sFilePath)
$sWorkVar = $sFileContent
While 1
If StringLeft($sWorkVar, 1) = "<" Then
  $iCodeFlag = 1
EndIf 
If StringLeft($sWorkVar, 1) = ">" Then
  $iCodeFlag = 0
  $sWorkVar = StringTrimLeft($sWorkVar, 1)
EndIf
While $iCodeFlag = 1
  If StringLeft($sWorkVar, 1) = ">" Then ExitLoop
  $sWorkVar = StringTrimLeft($sWorkVar, 1)
WEnd
While $iCodeFlag = 0
  If StringLeft($sWorkVar, 1) = "<" Then ExitLoop
  If Not StringInStr($sWorkVar, ">") Then ExitLoop
  $sWorkVar2 = $sWorkVar2 & StringLeft($sWorkVar, 1)
  $sWorkVar = StringTrimLeft($sWorkVar, 1)
WEnd
If Not StringInStr($sWorkVar, ">") Then
  $sWorkVar2 = $sWorkVar2 & $sWorkVar
  ExitLoop
EndIf
WEnd

$var = IniReadSection(@ScriptDir & "\Strip HTML.ini", "BoilerPlate")
If @error Then 
    MsgBox(4096, "", "Error occured, probably no INI file.")
Else
    For $i = 1 To $var[0][0]
  $sWorkVar2 = StringReplace($sWorkVar2,$var[$i][1],"")
;StringReplace ( "string", "searchstring", "replacestring")
    Next
While StringInStr($sWorkVar2,@CRLF) Or StringInStr($sWorkVar2,@CR) Or StringInStr($sWorkVar2,@LF)
  $sWorkVar2 = StringReplace($sWorkVar2,@CRLF,"")
  $sWorkVar2 = StringReplace($sWorkVar2,@CR,"")
  $sWorkVar2 = StringReplace($sWorkVar2,@LF,"")
WEnd 
EndIf

FileWrite ( @ScriptDir & "\Stripped.TXT", @YEAR & "/" & @MON & "/" & @MDAY & "  " & @HOUR & ":" & @MIN & ":" & @SEC & $sWorkVar2 & @CRLF )
$dif = Round ( (TimerDiff($begin)/1000) , 4 )
MsgBox(0,"Time To Process The File",$dif & " seconds...", 5)
MsgBox(0,"Result","The stripped data is " & $sWorkVar2, 5 )
Exit

This what came out of an aim instant message. the text i need from this is just "dfg"

I tried stringsplit but it just wont work right, please help

<HTML><BODY BGCOLOR="#ffffff"><B><FONT COLOR="#0000ff" LANG="0">JoE DA HoE  6900<!-- (10:12:38 PM)--></B></FONT><FONT COLOR="#0000ff" BACK="#ffffff">:</FONT><FONT COLOR="#000000"> dfg</FONT><BR>
<BODY BGCOLOR="#ffffff"><B><FONT COLOR="#ff0000">JoE DA HoE  6900<!-- (10:12:39 PM)--></B>:</FONT><FONT COLOR="#000000"> dfg</FONT></BODY></HTML>
<HTML><BODY BGCOLOR="#ffffff"><B><FONT COLOR="#0000ff" LANG="0">JoE DA HoE  6900<!-- (10:12:38 PM)--></B></FONT><FONT COLOR="#0000ff" BACK="#ffffff">:</FONT><FONT COLOR="#000000"> dfg</FONT><BR>
<BODY BGCOLOR="#ffffff"><B><FONT COLOR="#ff0000">JoE DA HoE  6900<!-- (10:12:39 PM)--></B>:</FONT><FONT COLOR="#000000"> dfg</FONT></BODY></HTML

&Send
&Warn
Bloc&k
Send &Instant Message...
Get AIM E&xpressions
Play Games
Start Live Video
&Talk
&Chat
Edited by Gene

[font="Verdana"]Thanks for the response.Gene[/font]Yes, I know the punctuation is not right...

Link to comment
Share on other sites

While I was making dinner I had a better idea. This one is a little more elegant and about a factor of 10 faster than the one I posted above. :geek: I considered adding code to eliminate duplicates of the text of interest, but decided against it.

Gene :o

A ZIP of the improved .Au3 and the INI file is attached.

Global $sWorkVar, $sCodeStr, $var, $i 

$sFilePath = FileOpenDialog("Select HTML file to strip.", "My Computer", "HTML (*.html)|HTM (*.htm)", 1)
$begin = TimerInit()
$sFileContent = FileRead($sFilePath)
$sWorkVar = $sFileContent


$var = IniReadSection(@ScriptDir & "\Strip HTML.ini", "BoilerPlate")
If @error Then
 MsgBox(4096, "", "Error occured, probably no INI file.")
Else
 For $i = 1 To $var[0][0]
  $sWorkVar = StringReplace($sWorkVar, $var[$i][1], "")
 Next
 While StringInStr($sWorkVar, ">:<")
  $sWorkVar = StringReplace($sWorkVar, ">:<", "><")
 WEnd
 
 While StringInStr($sWorkVar, @CRLF) Or StringInStr($sWorkVar, @CR) Or StringInStr($sWorkVar, @LF)
  $sWorkVar = StringReplace($sWorkVar, @CRLF, "")
  $sWorkVar = StringReplace($sWorkVar, @CR, "")
  $sWorkVar = StringReplace($sWorkVar, @LF, "")
 WEnd
EndIf

While 1
 $iBegin = StringInStr($sWorkVar, "<")
 $iEnd = StringInStr($sWorkVar, ">") + 1
 If $iBegin = 0 And $iEnd = 1 Then
  ExitLoop
 EndIf
 $sCodeStr = StringMid($sWorkVar, $iBegin, $iEnd - $iBegin)
 While StringInStr($sWorkVar, $sCodeStr)
  $sWorkVar = StringReplace($sWorkVar, $sCodeStr, "")
 WEnd
 $sCodeStr = ""
WEnd

FileWrite(@ScriptDir & "\Stripped.TXT", @YEAR & "/" & @MON & "/" & @MDAY & "  " & @HOUR & ":" & @MIN & ":" & @SEC & $sWorkVar & @CRLF)
$dif = Round((TimerDiff($begin) / 1000), 4)
MsgBox(0, "Time To Process The File", $dif & " seconds...", 5)
MsgBox(0, "Result", "The stripped data is " & $sWorkVar, 5)
Exit

[font="Verdana"]Thanks for the response.Gene[/font]Yes, I know the punctuation is not right...

Link to comment
Share on other sites

Here's a pretty general function that should work using regular expressions:

$test = '<HTML><BODY BGCOLOR="#ffffff"><B><FONT COLOR="#0000ff" LANG="0">JoE DA HoE  6900<!-- (10:12:38 PM)--></B></FONT><FONT COLOR="#0000ff" BACK="#ffffff">:</FONT><FONT COLOR="#000000"> dfg</FONT><BR><BODY BGCOLOR="#ffffff"><B><FONT COLOR="#ff0000">JoE DA HoE  6900<!-- (10:12:39 PM)--></B>:</FONT><FONT COLOR="#000000"> dfg</FONT></BODY></HTML><HTML><BODY BGCOLOR="#ffffff"><B><FONT COLOR="#0000ff" LANG="0">JoE DA HoE  6900<!-- (10:12:38 PM)--></B></FONT><FONT COLOR="#0000ff" BACK="#ffffff">:</FONT><FONT COLOR="#000000"> dfg</FONT><BR><BODY BGCOLOR="#ffffff"><B><FONT COLOR="#ff0000">JoE DA HoE  6900<!-- (10:12:39 PM)--></B>:</FONT><FONT COLOR="#000000"> dfg</FONT></BODY></HTML'
$id = "JoE DA HoE  6900"       ;I assume this is your screenname, id, etc.

$texts = ""    ;You could easily transform this into an array

$check = StripText($test, $id)
While $check <> ""
    If $check <> $id Then        ;If you got rid of this check, it would display all ids as entries
        $texts &= $check & @CRLF       ;If you wanted to store each entry in an array you would do it here.
    EndIf
    $test = StringTrimLeft($test, StringInStr($test, $check) + StringLen($check) - 1)
    $check = StripText($test, $id)
WEnd
MsgBox(0, "Results", $texts)      ;Will display all results

Func StripText($test, $id = "")
    $results = StringRegExp($test, "(>)([a-zA-Z 0-9]+)(<)", 1)     ;Note: will not capture blank entries
    If IsArray($results) Then
        If $results[1] <> "" Then
            Return $results[1]
        EndIf
    EndIf
EndFunc  ;==>StripText

[u]My UDFs[/u]Coroutine Multithreading UDF LibraryStringRegExp GuideRandom EncryptorArrayToDisplayString"The Brain, expecting disaster, fails to find the obvious solution." -- neogia

Link to comment
Share on other sites

Thank you everyone, I liked them all but especialy the last two. thanks sorry I thought no one replyed because I was expecting a e-mail notification. again thanks I i might to try to make it into a udf for my program, it would use aim for remote control purposes but mostly text to speech.

[center]AutoIT + Finger Print Reader/Scanner = COOL STUFF -> Check Out Topic![/center][center][font=Arial Black]Check out ConsultingJoe.com[/font][/center][center]My Scripts~~~~~~~~~~~~~~Web Protocol Managing - Simple WiFi Scanner - AutoTunes - Remote PC Control V2 - Audio SpectrascopePie Chart UDF - At&t's TTS - Custom Progress Bar - Windows Media Player Embed[/center]

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...