Jump to content

Help Stripping html from text


Recommended Posts

This what came out of an aim instant message. the text i need from this is just "dfg"

I tried stringsplit but it just wont work right, please help

<HTML><BODY BGCOLOR="#ffffff"><B><FONT COLOR="#0000ff" LANG="0">JoE DA HoE  6900<!-- (10:12:38 PM)--></B></FONT><FONT COLOR="#0000ff" BACK="#ffffff">:</FONT><FONT COLOR="#000000"> dfg</FONT><BR>
<BODY BGCOLOR="#ffffff"><B><FONT COLOR="#ff0000">JoE DA HoE  6900<!-- (10:12:39 PM)--></B>:</FONT><FONT COLOR="#000000"> dfg</FONT></BODY></HTML>
<HTML><BODY BGCOLOR="#ffffff"><B><FONT COLOR="#0000ff" LANG="0">JoE DA HoE  6900<!-- (10:12:38 PM)--></B></FONT><FONT COLOR="#0000ff" BACK="#ffffff">:</FONT><FONT COLOR="#000000"> dfg</FONT><BR>
<BODY BGCOLOR="#ffffff"><B><FONT COLOR="#ff0000">JoE DA HoE  6900<!-- (10:12:39 PM)--></B>:</FONT><FONT COLOR="#000000"> dfg</FONT></BODY></HTML

&Send
&Warn
Bloc&k
Send &Instant Message...
Get AIM E&xpressions
Play Games
Start Live Video
&Talk
&Chat
Check out ConsultingJoe.com
Link to comment
Share on other sites

i wasn't able to get this working completely...and

its only for one line...bit maybe it will help

$file = '<HTML><BODY BGCOLOR="#ffffff"><B><FONT COLOR="#0000ff" LANG="0">JoE DA HoE  6900<!-- (10:12:38 PM)-->'
$split = StringSplit($file, ">")
#include <Array.au3>
For $l = 1 to $split[0]
    If StringInStr ( $split[$l], "<") = 1 Then
;
    Else
        If StringInStr($split[$l], "<") Then $split[$l] = StringTrimRight ( $split[$l], StringInStr($split[$l], "<"))
        MsgBox(0, "", $split[$l])
    EndIf
Next
_ArrayDisplay ( $split, "")
Edited by ACalcutt

Andrew Calcutt

Http://www.Vistumbler.net

Http://www.TechIdiots.net

Its not an error, its a undocumented feature

Link to comment
Share on other sites

  • Moderators

Rather Crude but this works:

#include <file.au3>
Local $aArray = ''
Local $sFilePath = @DesktopDir & '\test..txt'
Local $eXclude[3]; add to the array, the particular items you don't want to show on the return
$eXclude[1] = 'JoE DA HoE  6900' 
$eXclude[2] = ':'
_FileReadToArray($sFilePath, $aArray)
GetNonHtml($aArray, $eXclude)


Func GetNonHtml(ByRef $aArray, $eXclude = '')
    Local $nArray = ''
    For $i = 1 To UBound($aArray) - 1
        $StringBetween = StripHtml($aArray[$i], '>', '<', $eXclude)
        If $StringBetween = 0 Then ContinueLoop
        For $x = 1 To UBound($StringBetween) - 1
            $nArray = $nArray & $StringBetween[$x] & Chr(01)
        Next
    Next

    Local $spSp = StringSplit(StringTrimRight($nArray, 1), Chr(01))
    Local $nfOpen = '' 
    $nfOpen = FileOpen(StringTrimRight($sFilePath, 4) & '_stripped.txt', 9)
    For $x = 1 To UBound($spSp) - 1
        FileWriteLine($nfOpen, $spSp[$x])
    Next

    FileClose($nfOpen)
EndFunc
Func StripHtml($LineToRead, $start, $end, $eXclude = '1')
    Local $sPlit = StringSplit($LineToRead, $start)
    Local $nArray = ''
    For $i = 1 To UBound($sPlit) - 1
        If StringMid($sPlit[$i], 1, 1) <> '' And StringMid($sPlit[$i], 1, 1) <> $end Then
            For $x = 1 To StringLen($sPlit[$i])
                Local $SMid = StringMid($sPlit[$i], $x, 1)
                If $SMid = $end Then
                    Local $fOund = StringLeft($sPlit[$i], $x - 1)
                    Local $ErrorCheck = 0
                    For $y = 1 To UBound($eXclude) - 1
                        If $fOund = $eXclude[$y] Then 
                            $ErrorCheck = 1
                            ExitLoop
                        EndIf
                    Next
                    If $ErrorCheck = 0 Then
                        $nArray = $nArray & StringStripWS($fOund, 7) & Chr(01)
                        ExitLoop
                    EndIf
                EndIf
            Next
        EndIf
    Next
    If $nArray <> '' Then
        Return StringSplit(StringTrimRight($nArray, 1), Chr(01))
    Else
        Return 0
    EndIf
EndFunc

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Link to comment
Share on other sites

My attempt. It will search for the unique ">:<" substring and get it's location. Then it will count 32 characters ahead, where the text starts. It will then convert the string into an array, and search for the first "<" proceeding the beginning of the text. It will take all the elements of the array between those two points and create a single, new array from it. Finally it will convert that array into a string; resulting in the text!

The script is made to loop through the string until no more text is found.

May not be the best method, but its 2:14 AM and I whipped it up rather quickly.

#include <array.au3>

$String = '<HTML><BODY BGCOLOR="#ffffff"><B><FONT COLOR="#0000ff" LANG="0">JoE DA HoE  6900<!-- (10:12:38 PM)--></B></FONT><FONT COLOR="#0000ff" BACK="#ffffff">:</FONT><FONT COLOR="#000000"> dfg</FONT><BR><BODY BGCOLOR="#ffffff"><B><FONT COLOR="#ff0000">JoE DA HoE  6900<!-- (10:12:39 PM)--></B>:</FONT><FONT COLOR="#000000"> dfg</FONT></BODY></HTML><HTML><BODY BGCOLOR="#ffffff"><B><FONT COLOR="#0000ff" LANG="0">JoE DA HoE  6900<!-- (10:12:38 PM)--></B></FONT><FONT COLOR="#0000ff" BACK="#ffffff">:</FONT><FONT COLOR="#000000"> dfg</FONT><BR><BODY BGCOLOR="#ffffff"><B><FONT COLOR="#ff0000">JoE DA HoE  6900<!-- (10:12:39 PM)--></B>:</FONT><FONT COLOR="#000000"> dfg</FONT></BODY></HTML&Send&WarnBloc&kSend &Instant Message...Get AIM E&xpressionsPlay GamesStart Live Video&Talk&Chat'
$Var = 1

While 1
    
    $TextStart = StringInStr($String, ">:<", 0, $Var) + 32
    If $TextStart = 32 Then ExitLoop
    
    $rString = StringSplit($String, "")
    $TextEnd = _ArraySearch($rString, "<", $TextStart) - 1

    $New = _ArrayCreate("Success")
    For $T = $TextStart To $TextEnd
        _ArrayAdd($New, $rString[$T])
    Next

    _ArrayDelete($New, 0)
    $New = _ArrayToString($New, "|")
    $New = StringReplace($New, "|", "")
    
    MsgBox(0, "Occurrence "&$Var&" of text!", "Text #"&$Var&": "&$New)

    $Var += 1

WEnd

MsgBox(0, "Done", "Done: Found "&$Var-1&" occurrences of text.")
Link to comment
Share on other sites

Hi @zerocool60544,

This one is a brute force method, and and can be used as a general purpose HTML Code stripper too. It isn't elegant, but it is quick and writes the results to a timestamped log. It uses an INI file to take care of current Boiler Plate items and if any new ones turn up you can easily add them. There is a ZIP file attached.

Gene

Edit: I used the $char var in debugging and forgot to take it out.

Global $sWorkVar, $sWorkVar2, $iCodeFlag, $var, $i, $char 



$char = 0

$sFilePath = FileOpenDialog ( "Select HTML file to strip.", "My Computer", "HTML (*.html)|HTM (*.htm)" , 1 )
$begin = TimerInit()
$sFileContent = FileRead($sFilePath)
$sWorkVar = $sFileContent
While 1
If StringLeft($sWorkVar, 1) = "<" Then
  $iCodeFlag = 1
EndIf 
If StringLeft($sWorkVar, 1) = ">" Then
  $iCodeFlag = 0
  $sWorkVar = StringTrimLeft($sWorkVar, 1)
EndIf
While $iCodeFlag = 1
  If StringLeft($sWorkVar, 1) = ">" Then ExitLoop
  $sWorkVar = StringTrimLeft($sWorkVar, 1)
WEnd
While $iCodeFlag = 0
  If StringLeft($sWorkVar, 1) = "<" Then ExitLoop
  If Not StringInStr($sWorkVar, ">") Then ExitLoop
  $sWorkVar2 = $sWorkVar2 & StringLeft($sWorkVar, 1)
  $sWorkVar = StringTrimLeft($sWorkVar, 1)
WEnd
If Not StringInStr($sWorkVar, ">") Then
  $sWorkVar2 = $sWorkVar2 & $sWorkVar
  ExitLoop
EndIf
WEnd

$var = IniReadSection(@ScriptDir & "\Strip HTML.ini", "BoilerPlate")
If @error Then 
    MsgBox(4096, "", "Error occured, probably no INI file.")
Else
    For $i = 1 To $var[0][0]
  $sWorkVar2 = StringReplace($sWorkVar2,$var[$i][1],"")
;StringReplace ( "string", "searchstring", "replacestring")
    Next
While StringInStr($sWorkVar2,@CRLF) Or StringInStr($sWorkVar2,@CR) Or StringInStr($sWorkVar2,@LF)
  $sWorkVar2 = StringReplace($sWorkVar2,@CRLF,"")
  $sWorkVar2 = StringReplace($sWorkVar2,@CR,"")
  $sWorkVar2 = StringReplace($sWorkVar2,@LF,"")
WEnd 
EndIf

FileWrite ( @ScriptDir & "\Stripped.TXT", @YEAR & "/" & @MON & "/" & @MDAY & "  " & @HOUR & ":" & @MIN & ":" & @SEC & $sWorkVar2 & @CRLF )
$dif = Round ( (TimerDiff($begin)/1000) , 4 )
MsgBox(0,"Time To Process The File",$dif & " seconds...", 5)
MsgBox(0,"Result","The stripped data is " & $sWorkVar2, 5 )
Exit

This what came out of an aim instant message. the text i need from this is just "dfg"

I tried stringsplit but it just wont work right, please help

<HTML><BODY BGCOLOR="#ffffff"><B><FONT COLOR="#0000ff" LANG="0">JoE DA HoE  6900<!-- (10:12:38 PM)--></B></FONT><FONT COLOR="#0000ff" BACK="#ffffff">:</FONT><FONT COLOR="#000000"> dfg</FONT><BR>
<BODY BGCOLOR="#ffffff"><B><FONT COLOR="#ff0000">JoE DA HoE  6900<!-- (10:12:39 PM)--></B>:</FONT><FONT COLOR="#000000"> dfg</FONT></BODY></HTML>
<HTML><BODY BGCOLOR="#ffffff"><B><FONT COLOR="#0000ff" LANG="0">JoE DA HoE  6900<!-- (10:12:38 PM)--></B></FONT><FONT COLOR="#0000ff" BACK="#ffffff">:</FONT><FONT COLOR="#000000"> dfg</FONT><BR>
<BODY BGCOLOR="#ffffff"><B><FONT COLOR="#ff0000">JoE DA HoE  6900<!-- (10:12:39 PM)--></B>:</FONT><FONT COLOR="#000000"> dfg</FONT></BODY></HTML

&Send
&Warn
Bloc&k
Send &Instant Message...
Get AIM E&xpressions
Play Games
Start Live Video
&Talk
&Chat
Edited by Gene

[font="Verdana"]Thanks for the response.Gene[/font]Yes, I know the punctuation is not right...

Link to comment
Share on other sites

While I was making dinner I had a better idea. This one is a little more elegant and about a factor of 10 faster than the one I posted above. :geek: I considered adding code to eliminate duplicates of the text of interest, but decided against it.

Gene :o

A ZIP of the improved .Au3 and the INI file is attached.

Global $sWorkVar, $sCodeStr, $var, $i 

$sFilePath = FileOpenDialog("Select HTML file to strip.", "My Computer", "HTML (*.html)|HTM (*.htm)", 1)
$begin = TimerInit()
$sFileContent = FileRead($sFilePath)
$sWorkVar = $sFileContent


$var = IniReadSection(@ScriptDir & "\Strip HTML.ini", "BoilerPlate")
If @error Then
 MsgBox(4096, "", "Error occured, probably no INI file.")
Else
 For $i = 1 To $var[0][0]
  $sWorkVar = StringReplace($sWorkVar, $var[$i][1], "")
 Next
 While StringInStr($sWorkVar, ">:<")
  $sWorkVar = StringReplace($sWorkVar, ">:<", "><")
 WEnd
 
 While StringInStr($sWorkVar, @CRLF) Or StringInStr($sWorkVar, @CR) Or StringInStr($sWorkVar, @LF)
  $sWorkVar = StringReplace($sWorkVar, @CRLF, "")
  $sWorkVar = StringReplace($sWorkVar, @CR, "")
  $sWorkVar = StringReplace($sWorkVar, @LF, "")
 WEnd
EndIf

While 1
 $iBegin = StringInStr($sWorkVar, "<")
 $iEnd = StringInStr($sWorkVar, ">") + 1
 If $iBegin = 0 And $iEnd = 1 Then
  ExitLoop
 EndIf
 $sCodeStr = StringMid($sWorkVar, $iBegin, $iEnd - $iBegin)
 While StringInStr($sWorkVar, $sCodeStr)
  $sWorkVar = StringReplace($sWorkVar, $sCodeStr, "")
 WEnd
 $sCodeStr = ""
WEnd

FileWrite(@ScriptDir & "\Stripped.TXT", @YEAR & "/" & @MON & "/" & @MDAY & "  " & @HOUR & ":" & @MIN & ":" & @SEC & $sWorkVar & @CRLF)
$dif = Round((TimerDiff($begin) / 1000), 4)
MsgBox(0, "Time To Process The File", $dif & " seconds...", 5)
MsgBox(0, "Result", "The stripped data is " & $sWorkVar, 5)
Exit

[font="Verdana"]Thanks for the response.Gene[/font]Yes, I know the punctuation is not right...

Link to comment
Share on other sites

Here's a pretty general function that should work using regular expressions:

$test = '<HTML><BODY BGCOLOR="#ffffff"><B><FONT COLOR="#0000ff" LANG="0">JoE DA HoE  6900<!-- (10:12:38 PM)--></B></FONT><FONT COLOR="#0000ff" BACK="#ffffff">:</FONT><FONT COLOR="#000000"> dfg</FONT><BR><BODY BGCOLOR="#ffffff"><B><FONT COLOR="#ff0000">JoE DA HoE  6900<!-- (10:12:39 PM)--></B>:</FONT><FONT COLOR="#000000"> dfg</FONT></BODY></HTML><HTML><BODY BGCOLOR="#ffffff"><B><FONT COLOR="#0000ff" LANG="0">JoE DA HoE  6900<!-- (10:12:38 PM)--></B></FONT><FONT COLOR="#0000ff" BACK="#ffffff">:</FONT><FONT COLOR="#000000"> dfg</FONT><BR><BODY BGCOLOR="#ffffff"><B><FONT COLOR="#ff0000">JoE DA HoE  6900<!-- (10:12:39 PM)--></B>:</FONT><FONT COLOR="#000000"> dfg</FONT></BODY></HTML'
$id = "JoE DA HoE  6900"       ;I assume this is your screenname, id, etc.

$texts = ""    ;You could easily transform this into an array

$check = StripText($test, $id)
While $check <> ""
    If $check <> $id Then        ;If you got rid of this check, it would display all ids as entries
        $texts &= $check & @CRLF       ;If you wanted to store each entry in an array you would do it here.
    EndIf
    $test = StringTrimLeft($test, StringInStr($test, $check) + StringLen($check) - 1)
    $check = StripText($test, $id)
WEnd
MsgBox(0, "Results", $texts)      ;Will display all results

Func StripText($test, $id = "")
    $results = StringRegExp($test, "(>)([a-zA-Z 0-9]+)(<)", 1)     ;Note: will not capture blank entries
    If IsArray($results) Then
        If $results[1] <> "" Then
            Return $results[1]
        EndIf
    EndIf
EndFunc  ;==>StripText

[u]My UDFs[/u]Coroutine Multithreading UDF LibraryStringRegExp GuideRandom EncryptorArrayToDisplayString"The Brain, expecting disaster, fails to find the obvious solution." -- neogia

Link to comment
Share on other sites

Thank you everyone, I liked them all but especialy the last two. thanks sorry I thought no one replyed because I was expecting a e-mail notification. again thanks I i might to try to make it into a udf for my program, it would use aim for remote control purposes but mostly text to speech.

Check out ConsultingJoe.com
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...