Jump to content

Text from HTML


 Share

Recommended Posts

Yes, this is similar to the other topic about "Stripping HTML from text" But rather than just get a single string such as "dfg" I just want to get rid of all the HTML and javascript, keeping only the text.

I figure the easiest way to do it would be to just get rid of everything in between the " < > "'s, and I got that working.

But I have no idea how to get rid of the javascript parts.

Using Gene's Strip HTML script I got it down to:

flawblure

function mpd(x,y) {

document.df.dx.value=x;

document.df.dy.value=y;

document.df.submit();

}

function delmes(m) {

document.message.action='/';

document.message.DelMessage.value=m;

document.message.submit();

}

function replym(m) {

document.message.action='message.php';

document.message.ReplyMes.value=m;

document.message.submit();

}

function newmsg(p) {

document.message.action='message.php';

document.message.WriteTo.value=p;

document.message.submit();

}

inventing (x8) (54 minutes)

function move(dir) {

if(!dir) return;

document.form1.Act2.value = dir;

if(dir!='center')

document.form1.Action.value = "move";

document.form1.submit();

}

Display:

Terrain

Height

Roads

Brief

reload

build

war

knowledge

magic

messages

possessions

map

forums

skills

options

overview

contract(19)

(4)

(3)

(40)Chimpy

(4)

(3)Adrian

(3)

(110)

(120)

(120)

SpankyRedhillPagezeuschumbawumbaflawblure 80

(120)

(120)

(10)lordgarion

from jwrtolkien, 1h32m ago: Snork? reply delete

3h19m ago: You drift into fanciful reverie... delete

5h50m ago: You had a glimmer of insight, but couldn't shape it into an invention... delete

Feb 11th, 6:58: How unfortunate, their head is already empty. delete

(not shown: 815 messages)

19:25

(This is an online game btw) But I have no idea how to get it further than that... I want to be able to get rid of the javascript so that the end result looks something like:

flawblure

inventing (x8) (54 minutes)

Chimpy

Adrian

SpankyRedhillPagezeuschumbawumbaflawblure 80

lordgarion

from jwrtolkien, 1h32m ago: Snork? delete

3h19m ago: You drift into fanciful reverie... delete

5h50m ago: You had a glimmer of insight, but couldn't shape it into an invention... delete

Feb 11th, 6:58: How unfortunate, their head is already empty delete

(not shown: 815 messages)

19:25

The other problem is that just getting rid of that wont work... since as its a game, what it gets rid of and keeps would vary based on the situation... is there a way to do this?

Link to comment
Share on other sites

well to help with the javascript have it destroy everything between <script language="javascript"> or w/e the script tag is and </script>

AutoIt Console written in C#. Write au3 code right at the console :D_FileWriteToLineWrite to a specific line in a file.My UDF Libraries: MySQL UDF Library version 1.6 MySQL Database UDF's for AutoItI have stopped updating the MySQL thread above, all future updates will be on my SVN. The svn location is:kan2.sytes.net/publicsvn/mysqlnote: This will still be available, but due to my new job, and school hours, am no longer developing this udf.My business: www.hirethebrain.com Hire The Brain HireTheBrain.com Computer Consulting, Design, Assembly and RepairOh no! I've commited Scriptocide!
Link to comment
Share on other sites

well to help with the javascript have it destroy everything between <script language="javascript"> or w/e the script tag is and </script>

How would I do that? The HTML strip one already destroys the <script language="javascript"> and </script> without removing whats inbetween.

Link to comment
Share on other sites

well, i havetn seen the HTML strip so i donno if im being useless or not, but you could edit it to disregard (when looping thru all tags) <script> tags and then after that have it delete everything between the <script> tags. sorry, i know i have horrible grammar.

--hope this helps

~cdkid

AutoIt Console written in C#. Write au3 code right at the console :D_FileWriteToLineWrite to a specific line in a file.My UDF Libraries: MySQL UDF Library version 1.6 MySQL Database UDF's for AutoItI have stopped updating the MySQL thread above, all future updates will be on my SVN. The svn location is:kan2.sytes.net/publicsvn/mysqlnote: This will still be available, but due to my new job, and school hours, am no longer developing this udf.My business: www.hirethebrain.com Hire The Brain HireTheBrain.com Computer Consulting, Design, Assembly and RepairOh no! I've commited Scriptocide!
Link to comment
Share on other sites

Bah, grammar doesnt matter :o

I'd edit the HTML strip but I didn't make it so I don't really know how it works. lol

This is it though.

Global $sWorkVar, $sWorkVar2, $iCodeFlag, $var, $i, $char

$char = 0

$sFilePath = FileOpenDialog ( "Select HTML file to strip.", "My Computer", "HTML (*.html)|HTM (*.htm)" , 1 )

$begin = TimerInit()

$sFileContent = FileRead($sFilePath)

$sWorkVar = $sFileContent

While 1

If StringLeft($sWorkVar, 1) = "<" Then

$iCodeFlag = 1

EndIf

If StringLeft($sWorkVar, 1) = ">" Then

$iCodeFlag = 0

$sWorkVar = StringTrimLeft($sWorkVar, 1)

EndIf

While $iCodeFlag = 1

If StringLeft($sWorkVar, 1) = ">" Then ExitLoop

$sWorkVar = StringTrimLeft($sWorkVar, 1)

WEnd

While $iCodeFlag = 0

If StringLeft($sWorkVar, 1) = "<" Then ExitLoop

If Not StringInStr($sWorkVar, ">") Then ExitLoop

$sWorkVar2 = $sWorkVar2 & StringLeft($sWorkVar, 1)

$sWorkVar = StringTrimLeft($sWorkVar, 1)

WEnd

If Not StringInStr($sWorkVar, ">") Then

$sWorkVar2 = $sWorkVar2 & $sWorkVar

ExitLoop

EndIf

WEnd

$var = IniReadSection(@ScriptDir & "\Strip HTML.ini", "BoilerPlate")

If @error Then

MsgBox(4096, "", "Error occured, probably no INI file.")

Else

For $i = 1 To $var[0][0]

$sWorkVar2 = StringReplace($sWorkVar2,$var[$i][1],"")

;StringReplace ( "string", "searchstring", "replacestring")

Next

While StringInStr($sWorkVar2,@CRLF) Or StringInStr($sWorkVar2,@CR) Or StringInStr($sWorkVar2,@LF)

$sWorkVar2 = StringReplace($sWorkVar2,@CRLF,"")

$sWorkVar2 = StringReplace($sWorkVar2,@CR,"")

$sWorkVar2 = StringReplace($sWorkVar2,@LF,"")

WEnd

EndIf

FileWrite ( @ScriptDir & "\Stripped.TXT", @YEAR & "/" & @MON & "/" & @MDAY & " " & @HOUR & ":" & @MIN & ":" & @SEC & $sWorkVar2 & @CRLF )

$dif = Round ( (TimerDiff($begin)/1000) , 4 )

MsgBox(0,"Time To Process The File",$dif & " seconds...", 5)

MsgBox(0,"Result","The stripped data is " & $sWorkVar2, 5 )

Exit

Link to comment
Share on other sites

alright, i'll see what i can do with this... gimme a few minutes to look over it

AutoIt Console written in C#. Write au3 code right at the console :D_FileWriteToLineWrite to a specific line in a file.My UDF Libraries: MySQL UDF Library version 1.6 MySQL Database UDF's for AutoItI have stopped updating the MySQL thread above, all future updates will be on my SVN. The svn location is:kan2.sytes.net/publicsvn/mysqlnote: This will still be available, but due to my new job, and school hours, am no longer developing this udf.My business: www.hirethebrain.com Hire The Brain HireTheBrain.com Computer Consulting, Design, Assembly and RepairOh no! I've commited Scriptocide!
Link to comment
Share on other sites

erg.. having a major brainfart, this could take a bit longer

could you jsut add a "stringreplace(putallthejavascritpcodehere,'')" to your script?

Edited by cdkid
AutoIt Console written in C#. Write au3 code right at the console :D_FileWriteToLineWrite to a specific line in a file.My UDF Libraries: MySQL UDF Library version 1.6 MySQL Database UDF's for AutoItI have stopped updating the MySQL thread above, all future updates will be on my SVN. The svn location is:kan2.sytes.net/publicsvn/mysqlnote: This will still be available, but due to my new job, and school hours, am no longer developing this udf.My business: www.hirethebrain.com Hire The Brain HireTheBrain.com Computer Consulting, Design, Assembly and RepairOh no! I've commited Scriptocide!
Link to comment
Share on other sites

hhmm... i got an idea

$file = FileOpenDialog('HTML FILE',@DESKTOPDIR,'HTML files (*.html)|htm files (*.htm)')
$file = FileRead($file)
$jsstart = StringInStr($file, "<script")
$jsstop = StringInStr($file, "</script")
StringReplace($file, $jsstart, '', $jsstop - $jsstart)

put this b4 the html stripper and i think it should work... havent tested.

--hope this helps

~cdkid

AutoIt Console written in C#. Write au3 code right at the console :D_FileWriteToLineWrite to a specific line in a file.My UDF Libraries: MySQL UDF Library version 1.6 MySQL Database UDF's for AutoItI have stopped updating the MySQL thread above, all future updates will be on my SVN. The svn location is:kan2.sytes.net/publicsvn/mysqlnote: This will still be available, but due to my new job, and school hours, am no longer developing this udf.My business: www.hirethebrain.com Hire The Brain HireTheBrain.com Computer Consulting, Design, Assembly and RepairOh no! I've commited Scriptocide!
Link to comment
Share on other sites

hhmm... i got an idea

$file = FileOpenDialog('HTML FILE',@DESKTOPDIR,'HTML files (*.html)|htm files (*.htm)')
$file = FileRead($file)
$jsstart = StringInStr($file, "<script")
$jsstop = StringInStr($file, "</script")
StringReplace($file, $jsstart, '', $jsstop - $jsstart)

put this b4 the html stripper and i think it should work... havent tested.

--hope this helps

~cdkid

I ran it, got the dialogue box twice, didnt change.... Is that because its not saving it when it gets rid of the javascript?
Link to comment
Share on other sites

yes, that's why. this is just something that u can build off of

AutoIt Console written in C#. Write au3 code right at the console :D_FileWriteToLineWrite to a specific line in a file.My UDF Libraries: MySQL UDF Library version 1.6 MySQL Database UDF's for AutoItI have stopped updating the MySQL thread above, all future updates will be on my SVN. The svn location is:kan2.sytes.net/publicsvn/mysqlnote: This will still be available, but due to my new job, and school hours, am no longer developing this udf.My business: www.hirethebrain.com Hire The Brain HireTheBrain.com Computer Consulting, Design, Assembly and RepairOh no! I've commited Scriptocide!
Link to comment
Share on other sites

yes, that's why. this is just something that u can build off of

Ok, got it running using this:

$file = _IEBodyReadHTML($oGK)

$jsstart = StringInStr($file, "<script")

$jsstop = StringInStr($file, "</script")

$JSstripped = StringReplace($file, $jsstart, '', $jsstop - $jsstart)

FileWrite ( @ScriptDir & "\JSStripped.TXT", @YEAR & "/" & @MON & "/" & @MDAY & " " & @HOUR & ":" & @MIN & ":" & @SEC & $JSstripped & @CRLF )

The problem is what it writes to the file JSStripped.txt is no differant that actual source of the page, still alot of <script> and </SCRIPT>'s.... Did I do anything to the code that would make it not work?

Link to comment
Share on other sites

Add this just before your While loop.

Do
    $start_js = StringInStr($sWorkVar, '<script language=')
    $end_js = StringInStr($sWorkVar, '</script>')
    If $start_js And $end_js Then
        $sScriptline = StringMid($sWorkVar, $start_js, ($end_js - $start_js) + 9)
        If $sScriptline Then $sWorkVar = StringReplace($sWorkVar, $sScriptline, '')
    EndIf
Until Not $start_js Or Not $end_js
Link to comment
Share on other sites

Add this just before your While loop.

Do
    $start_js = StringInStr($sWorkVar, '<script language=')
    $end_js = StringInStr($sWorkVar, '</script>')
    If $start_js And $end_js Then
        $sScriptline = StringMid($sWorkVar, $start_js, ($end_js - $start_js) + 9)
        If $sScriptline Then $sWorkVar = StringReplace($sWorkVar, $sScriptline, '')
    EndIf
Until Not $start_js Or Not $end_js
I dont have a While loop... Erm... that would completely screw it all up, wouldnt it :/.

EDIT: ohh... your talking about the HTML strip one, nvm

EDIT2: It works!! Thanks!

Edited by Flawblure
Link to comment
Share on other sites

Yes, this is similar to the other topic about "Stripping HTML from text" But rather than just get a single string such as "dfg" I just want to get rid of all the HTML and javascript, keeping only the text.

I figure the easiest way to do it would be to just get rid of everything in between the " < > "'s, and I got that working.

But I have no idea how to get rid of the javascript parts.

Using Gene's Strip HTML script I got it down to:

(This is an online game btw) But I have no idea how to get it further than that... I want to be able to get rid of the javascript so that the end result looks something like:

The other problem is that just getting rid of that wont work... since as its a game, what it gets rid of and keeps would vary based on the situation... is there a way to do this?

You could consider letting IE and the DOM do the heavy lifting for you...

something like:

#include <IE.au3>
$oIE = _IECreate()
_IENavigate($oIE,"c:\yourfile")
$myText = $oIE.document.innerText

Dale

edit: typo

Edited by DaleHohm

Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y

Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Link to comment
Share on other sites

Could you post the raw HTML file before, and what you would want it to look like after? Maybe a couple examples, and then I could work something up. Regular Expressions reign supreme in this area.

[u]My UDFs[/u]Coroutine Multithreading UDF LibraryStringRegExp GuideRandom EncryptorArrayToDisplayString"The Brain, expecting disaster, fails to find the obvious solution." -- neogia

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...