Jump to content
Sign in to follow this  
natedog102

HTML Pretty Print UDF

Recommended Posts

Hi everyone. I want to format the output of _INetGetSource to look nice and pretty. 

Example google.com source output: 

<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content="Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for." name="description"><meta content="noodp" name="robots"><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script>(function(){window.google={kEI:'DJtTWvCOI6WGjwSE9JrICg',kEXPI:'18167,1354277,1354916,1355218,1355675,1355793,1356171,1356806,1357219,1357326,3700304,3700519,3700521,4003510,4029815,4031109,4043492,4045841,4048347,4081038,4081164,4095909,4096834,4097153,4097195,4097922,4097929,4098733,4098740,4098752,4102237,4102827,4103475,4103845,4106084,4107914,4109316,4109490,4112770,4113217,4115697,4116349,4116724,4116731,4116926,4116927,4116935,4117980,4118798,4119032,4119034,4119036,4120285,4120286,4120660,4121175,4121518,4122511,4123830,4123850,4124091,4124850,4125837,4126202,4126754,4126869,4127262,4127418,4127473,4127744,4127863,4128586,4128622,4129001,4129520,4129556,4129633,4130362,4130783,4131247,4131834,4132956,4133114,4133509,4135025,4135088,4135249,4135934,4136073,4136092,4136137,4137597,4137646,4140792,4140849,4141281,4141707,4141915,4142071,4142328,4142420,4142443,4142503,4142678,4142729,4142829,4142834,4142847,4143278,4143527,4143902,4144442,4144550,4144704,4145074,4145075,4145082,4145088,4145461,4145485,4145622,4145688,4145713,4145836,4146146,4146183,4146874,4147032,4147043,4147096,4147443,4147800,4147951,4148257,4148304,4148436,4148498,4148573,6512220,10200083,10202524,10202562,15807763,19000288,19000423,19000427,19001999,19002287,19002288,19002366,19002548,19002880,19003321,19003323,19003325,19003326,19003328,19003329,19003330,19003407,19003408,19003409,19004309,19004516,19004517,19004518,19004519,19004520,19004521,19004531,19004656,19004668,19004670,19004692,41317155',authuser:0,kscs:'c9c918f0_DJtTWvCOI6WGjwSE9JrICg',u:'c9c918f0',kGL:'US'};google.kHL='en';})();(function(){google.lc=[];google.li=0;google.getEI=function(a){for(var b;a&&(!a.getAttribute||!(b=a.getAttribute("eid")));)a=a.parentNode;return b||google.kEI};google.getLEI=function(a){for(var b=null;a&&(!a.getAttribute||!(b=a.getAttribute("leid")));)a=a.parentNode;return b};google.https=function(){return"https:"==window.location.protocol};google.ml=function(){return null};google.wl=function(a,b){try{google.ml(Error(a),!1,b)}catch(d){}};google.time=function(){return(new Date).getTime()};google.log=function(a,b,d,c,g){if(a=google.logUrl(a,b,d,c,g)){b=new Image;var e=google.lc,f=google.li;e[f]=b;b.onerror=b.onload=b.onabort=function(){delete e[f]};google.vel&&google.vel.lu&&google.vel.lu(a);b.src=a;google.li=f+1}};google.logUrl=function(a,b,d,c,g){var e="",f=google.ls||"";d||-1!=b.search("&ei=")||(e="&ei="+google.getEI(c),-1==b.search("&lei=")&&(c=google.getLEI(c))&&(e+="&lei="+c));c="";!d&&google.cshid&&-1==b.search("&cshid=")&&(c="&cshid="+google.cshid);a=d||"/"+(g||"gen_204")+"?atyp=i&ct="+a+"&cad="+b+e+f+"&zx="+google.time()+c;/^http:/i.test(a)&&google.https()&&(google.ml(Error("a"),!1,{src:a,glmm:1}),a="");return a};}).call(this);(function(){google.y={};google.x=function(a,b){if(a)var c=a.id;else{do c=Math.random();while(google.y[c])}google.y[c]=[a,b];return!1};google.lm=[];google.plm=function(a){google.lm.push.apply(google.lm,a)};google.lq=[];google.load=function(a,b,c){google.lq.push([[a],b,c])};google.loadAll=function(a,b){google.lq.push([a,b])};}).call(this);google.f={};var a=window.location,b=a.href.indexOf("#");if(0<=b){var c=a.href.substring(b+1);/(^|&)q=/.test(c)&&-1==c.indexOf("#")&&a.replace("/search?"+c.replace(/(^|&)fp=[^&]*/g,"")+"&cad=h")};</script><style>#gbar,#guser{font-size:13px;padding-top:1px !important;}#gbar{height:22px}#guser{padding-bottom:7px !important;text-align:right}.gbh,.gbd{border-top:1px solid #c9d7f1;font-size:1px}.gbh{height:0;position:absolute;top:24px;width:100%}@media all{.gb1{height:22px;margin-right:.5em;vertical-align:top}#gbar{float:left}}a.gb1,a.gb4{text-decoration:underline !important}a.gb1,a.gb4{color:#00c !important}.gbi .gb4{color:#dd8e27 !important}.gbf .gb4{color:#900 !important}

But I want it outputted like this:

<!doctype html>
<html itemscope="" itemtype="http://schema.org/WebPage" lang="en">

<head>
    <meta content="Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for." name="description">
    <meta content="noodp" name="robots">
    <meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
    <meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image">
    <title>Google</title>
    <script>
        (function() {
            window.google = {
                kEI: 'DJtsdfgWGjwSE9JrICg',
                kEXPI: '18167,1354277,1354916,1355218,1355675,1355793,1356171,1356806,1357219,1357326,37sdfg0304,3700519,3700521,4003510,4029815,4031109,4043492,4045841,4048347,4081038,4081164,4095909,4096834,4097153,4097195,4097922,4097929,4098733,4098740,4098752,4102237,4102827,4103475,4103845,4106084,4107914,4109316,4109490,4112770,4113217,4115697,4116349,4116724,4116731,4116926,4116927,4116935,4117980,4118798,4119032,4119034,4119036,4120285,4120286,4120660,4121175,4121518,4122511,4123830,4123850,4124091,4124850,4125837,4126202,4126754,4126869,4127262,4127418,4127473,4127744,4127863,4128586,4128622,4129001,4129520,4129556,4129633,4130362,4130783,4131247,4131834,413sdfg56,4133114,4133509,4135025,4135088,4135249,4135934,4136073,4136092,4136137,4137597,4137646,4140792,4140849,4141281,4141707,4141915,4142071,4142328,4142420,4142443,4142503,4142678,4142729,4142829,4142834,4142847,4143278,4143527,4143902,4144442,4144550,4144704,4145074,4145075,4145082,4145088,4145461,4145485,4145622,4145688,4145713,4145836,4146146,4146183,4146874,4147032,4147043,4147096,4147443,4147800,4147951,4148257,4148304,4148436,4148498,4148573,6512220,10200083,10202524,10202562,15807763,19000288,190sdfg23,19000427,19001999,19002287,19002288,19002366,19002548,19002880,19003321,19003323,19003325,19003326,19003328,19003329,19003330,19003407,19003408,19003409,19004309,19004516,19004517,19004518,19004519,19004520,19004521,19004531,19004656,19004668,19004670,19004692,41317155',
                authuser: 0,
                kscs: 'c9c918f0_DJtTWvCOI6WGjwSE9JrICg',
                u: 'c9c918f0',
                kGL: 'US'
            };
            google.kHL = 'en';
        })();
        
.......

I checked the forums and did not see any UDFs that allow for this. I see the Chilkat UDF but that only supports JSON. Any help would be greatly appreciated.

Share this post


Link to post
Share on other sites

Hi @natedog102.

So i took a stab at it for 30mins, and got it to work with google html. (I was doing something related anyway, and i got to address a problem in my hTMLParser.au3 lib i can implement when i find a way to make it less messy)

the file you need to run in the same folder as the two other files is prettyhtml.au3

the html you need to parse, currently need to be in a file named: prettyhtml.txt

the output will be in the same folder and be named: prettyhtml_output.txt

Hope you can use it.

Btw. there might be some strange that can give you trouble still, and if you find them, be sure to let me know, i will appreciate it.

prettyhtml.au3

HTMLParser.au3

TokenList.au3

Edit: credit to @Zedna for the StringRepeat Function

Edited by genius257

Share this post


Link to post
Share on other sites

Thanks for the quick response! If the HTML is already partially formatted, it doubles the whitespaces and returns. If the HTML contains javascript, it sometimes doesn't appear in the formatted text file. Same thing with CSS.

Hope that helps. Let me know if you want me to post any examples.

Share this post


Link to post
Share on other sites
2 minutes ago, natedog102 said:

If the HTML is already partially formatted, it doubles the whitespaces and returns.

hmmm I imagine it might be an easy fix with StringStripWS(..., 1+2)

3 minutes ago, natedog102 said:

If the HTML contains javascript, it sometimes doesn't appear in the formatted text file. Same thing with CSS.

Hmmm i suspect it might be the cases i have a tough time testing for myself ^^ examples would be greatly appreciated :)

4 minutes ago, natedog102 said:

Hope that helps.

Oh yeah, it helps :) The more bugs i know of, the more i can try to improve it ^^

Share this post


Link to post
Share on other sites

Hey @natedog102.

So here's the most i'll do on the script for now: prettyhtml.au3

What's missing that i know of without your special case examples, would be start tags without end tags. There's just too many for me to do without some kind of usage of the end product for me ^^, see https://html.spec.whatwg.org/multipage/syntax.html#syntax-tag-omission

The "An ... element's ... tag may be omitted if ..." cases are many and very specific for each case :)

Anyway i hope the updated script may help a little.

 

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

  • Recently Browsing   0 members

    No registered users viewing this page.

  • Similar Content

    • By Colduction
      Hi guys, i'm using Telegram UDF by @LinkOut from github with latest update, but this UDF has not any parse_mode ability in SendDocument and other send file's functions to make texts bold, italic or underline and i can't send Emojis via these functions too. i've tried to change HTML section of multipart/form-data but i did not get correct results.

      For example, i can't get correct results by sending a document with this URL Encoded caption: %F0%9F%93%84%20*Test*%20%F0%9F%93%84

      I will be happy to help me in this section. Thanks!
    • By DannyJ
      I use _ClipPutHTML UDF function 
      My problem is that I am not able to write characters with accets.
      When I paste this code to an Mail program the accent characters will be Chinese characters or '???' characters.
      Here is a snippet of my code:
      #include <_ClipPutHTML.au3> $sHTMLStr='<html><head>'&@CRLF & " <title>Page Title</title>"&@CRLF & _ ' <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">'&@CRLF & _ "</head>"&@CRLF & "<body>"&@CRLF & "<h1>Headline Text</h1>"&@CRLF & _ "<p>" & "ófiéááéllááéáéá:" & Chr(225) & BinaryToString("á",4) &@CRLF & _ '<a href="http://www.autoitscript.com/forum/index.php?showtopic=96556">_ClipPutHTML() functions</a>.'&@CRLF& _ " The regular modifiders, such as <strong>bold</strong>, <i>italics</i>, and <u>underlines</u> work as usual,"&@CRLF& _ " just like all other HTML formatting.</p>"&@CRLF & "<p>&nbsp;</p>"&@CRLF & _ "<p><strong>Here's an example list:</strong></p>"&@CRLF & "<ul>"&@CRLF & _ " <li>List <i>itemü</i> #1.</li>"&@CRLF & _ " <li>List <i>itemá</i> #2.</li>"&@CRLF & _ ' <li>List <i>itemé</i> #3 with a <a href="http://www.google.com">Hyperlink</a></li>'&@CRLF & _ "</ul>"&@CRLF & "</body>"&@CRLF & "</html>" $sPlainTextStr="Headline Text"&@CRLF&@CRLF& _ "ófigyeljáéáéá" & Chr(225) & "_ClipPutHTML() functions."& _ "The regular modifiders, such as bold, italics, and underlines work as usual, just like all other HTML formatting."&@CRLF&@CRLF& _ "Here's an example list:"&@CRLF& _ " * List itemü #1."&@CRLF& _ " * List itemá #2."&@CRLF& _ " * List itemé #3 with a Hyperlink"&@CRLF ;I have tired this way, but it does not work. ;$UTF8HTML = BinaryToString($sHTMLStr,4) ;ConsoleWrite($UTF8HTML) ;$sUTF8String=BinaryToString($sPlainTextStr,4) ConsoleWrite($sUTF8String) _ClipPutHTML($UTF8HTML,$sUTF8String) ; Special Unicode text call ;_ClipPutHyperlink("http://www.google.co.jp/",ChrW(0x30B0)& ChrW(0x30FC)& ChrW(0x30B0)& ChrW(0x30EB)& " (Japanese Google)") ; Regular text ;_ClipPutHyperlink("http://www.google.com","itt")  
    • By Fenzik
      Hello!
      i wrote this function as alternative to using the Com Object or Commandline version of this project, discussed also earlyer on this forum.
      Project site - http://ebstudio.info/home/xdoc2txt.html
      Advantage of this implementation is that you do not need to register Com dll, using regsvr32.
      But you still need the project Dll (xd2txlib.dll).
      Enjoy!
      ; #FUNCTION# ==================================================================================================================== ; Name ..........: _ExtractText ; Description ...: Extracts text from advanced documment formats (Doc, Docx, ODT, XLS, ...) ; Syntax ........: _ExtractText($sFilename[, $bProperties = False[, $hDll = 0]]) ; Parameters ....: $sFilename - a string value. ; $bProperties - [optional] a boolean value. Default is False. If True, documment properties will be returned instead of the text. ; $hDll - [optional] a handle value. Default is 0. Optional handle to previously opened xd2txlib.dll. By default the xd2txlib.dll (Expected in @scriptdir) will be opened and closed during the function call. ; Return value .: String, containing the text or documment properties or empty string and Error as follows: ;1 - The file does not exists. ;2 - Error during opening xd2txlib.dll. ;3 - No text returned. ; Author ........: Fenzik ; Modified ......: ; Remarks .......: Project site - http://ebstudio.info/home/xdoc2txt.html ; Related .......: ; Link ..........: ; Example .......: No ; =============================================================================================================================== Func _ExtractText($sFilename, $bProperties = False, $hDll = 0) If Not FileExists($sFilename) Then Return SetError(1, "", "") Local $bLoaded = False If $hDll = 0 Then $hDll = DllOpen(@scriptdir&"\xd2txlib.dll") If $hDll = -1 Then Return SetError(2, "", "") $bLoaded = True Endif $aResult = DllCall($hDll, "int:cdecl", "ExtractText", "WSTR", $sFilename, "BOOL", $bProperties, "WSTR*", "") If $aResult[0] = 0 Then Return SetError(3, "", "") If $bLoaded = True Then DllClose($hDll) Return $aResult[3] EndFunc  
       
      xd2txlib-example.zip
    • By wysocki
      I have a smartphone and I use it to access my email. However, when composing an email on it I have a problem. My list of phone contacts on the phone is very different from my list of email contacts in my Thunderbird desktop app.  I use my Gmail address book to store primarily phone contacts, and I use Thunderbird for my list of email contacts. I wanted a way to get my Thunderbird contact list onto my smartphone to be able to compose emails to addresses in that list. Here's my solution.
      I wrote a script to export my Thunderbird Personal Address Book to a csv file. It then reads that file and re-writes it with html wrappers around the data to make it into a nicely formatted web page. It then uploads the htm file to my website. On my smartphone, I created a shortcut to the file's URL and whenever I click it, I get the list displayed. Each contact shows name and email address along with a COPY button that will put the address into the clipboard. Then in my email client, I can easily paste that address into it. Alternatively, clicking on the actual email link will open a new message dialog in your email client with that address already entered.
      To use the app, all you need to do is use Thunderbird and have a webserver available. You'll need to download the FTPEX.AU3 file from this website and make a few changes to some constants around line 17 for FTP login info, etc.
       
      pab2ftp.au3
    • By SkysLastChance
      What would be the best way to grab the last digits of this <span>? One of the problems I know I am going to have is sometimes it will be 1 digit other times it might be 3. 

      I am trying to get the list of spans and I get this error.

       
      $oInputs = _IETagNameGetCollection($oIE, "span") $sTxt = "" For $oInput In $oInputs     $sTxt &= $oInput.Innertext & @CRLF Next MsgBox($MB_SYSTEMMODAL, "Form Input Type", "Form: " & $oInput.form.name & @CRLF & @CRLF & "         Types :" & @CRLF & $sTxt)  
×
×
  • Create New...