Jump to content

Creating a text to HTML converter


Recommended Posts

Hi,

My first post here on the forum, so first of all hello! :D

I'm trying to build a script that will basically take text and wrap HTML around it. Reason for this is I get lots of Word documents handed to me to then build into web pages. Most of the content is simply headings, paragaphs of text and unordered lists. Therefore I figured if I could build a script to do this tedious task for me, or at least the most part so I can check it myself afterwards, it would save time and effort.

This is my first script, so I've given it a go but doesn't appear to work. The idea is for the text to be copied from MS Word to my clipboard which I'll do myself, then by running the program it converts symbols such as £ to &poundl; as well as wrap <p></p> tags around paragraphs and <ul></ul> around unordered lists. Then once the script has done this, it will put the outcome back into the clipboard so I can just Ctrl + V into my text editor. Not sure how to do the unordered list part though. Below is my attempt, any help would be appreciated.

#cs
gets clipboard data
#ce
$clipboard = ClipGet()

#cs
checks if data is already on cipboard
#ce
If FileExists($clipboard) Then
$text = FileRead($clipboard)
Else
$text = $clipboard
EndIf

#cs
cuts two line breaks down to one
add paragraph tags at start and end of paragraph
replace symbols with html codes x 5
#ce
$html = StringRegExpReplace($text, "(\r\n){2,}", "\1")
$html = StringRegExpReplace($text, "(.+)(\r\n|\z)", "<p>\1</p>\2")
$html = StringReplace($text, "&", "&amp;")
$html = StringReplace($text, "£", "&pound;")
$html = StringReplace($text, "€", "&euro;")
$html = StringReplace($text, "“", "&#8220;")
$html = StringReplace($text, "”", "&#8221;")

#cs
writes text to clipboard
displays a message
#ce
ClipPut ($html)
MsgBox(0, "Text to HTML converter", "Text converted to HTML and copied to clipboard")
Exit
Link to comment
Share on other sites

Not sure why, but this line - $html = StringReplace($text, "£", "&pound;") - is the culprit. AutoIT does not like that symbol. It is breaking the conversion process.

Certifications: A+, Network+, Security+, Linux+, LPIC-1, MCSA | Languages: AutoIt, C, SQL, .NETBooks: AutoIt v3: Your Quick Guide - $7.99 - O'Reilly Media - September 2007-------->[u]AutoIt v3 Development - newbie to g33k[/u] - Coming Soon - Fate Publishing - Spring 2013UDF Libraries: SkypeCOM UDF Library | ADUC Computers OU Cleanup | Find PixelChecksumExamples: Skype COM Examples - Skype4COMLib Examples converted from VBS to AutoIt
Link to comment
Share on other sites

Doesn't Word already have the ability to save a document as an HTML document?

If I posted any code, assume that code was written using the latest release version unless stated otherwise. Also, if it doesn't work on XP I can't help with that because I don't have access to XP, and I'm not going to.
Give a programmer the correct code and he can do his work for a day. Teach a programmer to debug and he can do his work for a lifetime - by Chirag Gude
How to ask questions the smart way!

I hereby grant any person the right to use any code I post, that I am the original author of, on the autoitscript.com forums, unless I've specifically stated otherwise in the code or the thread post. If you do use my code all I ask, as a courtesy, is to make note of where you got it from.

Back up and restore Windows user files _Array.au3 - Modified array functions that include support for 2D arrays.  -  ColorChooser - An add-on for SciTE that pops up a color dialog so you can select and paste a color code into a script.  -  Customizable Splashscreen GUI w/Progress Bar - Create a custom "splash screen" GUI with a progress bar and custom label.  -  _FileGetProperty - Retrieve the properties of a file  -  SciTE Toolbar - A toolbar demo for use with the SciTE editor  -  GUIRegisterMsg demo - Demo script to show how to use the Windows messages to interact with controls and your GUI.  -   Latin Square password generator

Link to comment
Share on other sites

Thanks for the help guys. I've tried amending this part, but my code doesn't actually appear to be doing anything. Am I passing through the correct variables? Any of you actually ran the code and got it working?

BrewManNH - yes Word does have the option to save the document as a HTML document, however it puts a ridiculous amount of inline styles in the code. I'm just wanting basic simple HTML stuff.

Link to comment
Share on other sites

Figured it out, appeared to be this line which was stopping the code from running.....

$html = StringRegExpReplace($text, "(rn){2,}", "1")

Anyone know how to get bulleted list to have HTML wrap around them? I made a start but no sure how to do this. I figured maybe doing a bullet point dot, but then didnt know how to write the code to say whatever text after the bullet point, any ideas? This will kinda make the <li> tags, but don't know how to make the <ul> tags to recognise the start and end of the unordered list.

$html = StringRegExpReplace($text, "(• +)( )(rn|z)", "<li>1</li>2")

Any help would really be appreciated.

Link to comment
Share on other sites

try this:

$Text = "• test"
$html = StringRegExpReplace($Text, "(?:•[s]{0,})(.*)(?:rn|z)", "<li>1</li>")

oh, edited to include the '+':

$html = StringRegExpReplace($Text, "(?:[•|+][s]{0,})(.*)(?:rn|z)", "<li>1</li>")

last one, i think:

$html = StringRegExpReplace($sText, "(?:[•|+][s]{0,})(.*)(rn.*)", "<li>1</li>2")
Edited by jdelaney
IEbyXPATH-Grab IE DOM objects by XPATH IEscriptRecord-Makings of an IE script recorder ExcelFromXML-Create Excel docs without excel installed GetAllWindowControls-Output all control data on a given window.
Link to comment
Share on other sites

Thanks jdelaney!

The 3rd bit of code you provided appears to only wrap the <li> tags on every odd numbered list item. So on the 1st, 3rd, 5th etc it was wrapping the tags, however on even numbers did not wrap the HTML around it. The second bit of code works well, however put them all on one line, so just needs a carridge return after each closing </li>. I tried the following, but didn't work. Any ideas what to try?

$html = StringRegExpReplace($Text, "(?:[•|+][s]{0,})(.*)(?:rn|z)", "<li>1</li>r")

I also spotted then what I use more than one StringRegExpReplace in my code, it seems to only make the last one in my code work. So for example when I added the code in the last post, my script when ran appeard to ignore adding <p> tags to my text. Anyone know how to fix this?

Link to comment
Share on other sites

whoops, yeah, that was wrong code...try this one to global replace

$html = StringRegExpReplace($sText, "(?:[•|+][s]{0,})(.*)(rn|z)", "<li>1</li>2", 0)
IEbyXPATH-Grab IE DOM objects by XPATH IEscriptRecord-Makings of an IE script recorder ExcelFromXML-Create Excel docs without excel installed GetAllWindowControls-Output all control data on a given window.
Link to comment
Share on other sites

Thanks! This now works with wrapping <li> tags around them :D

Not quite sure why only one of my StringRegExpReplace's only work. My code runs through the following, does all of them except the second to last line which is adding the <p> tags. If I swap the last line of code with the second to last line then it makes the other bit of code work. So bascially the StringRegExpReplace which is last in my code works, but the one previous does not. No idea why this is happening! Is it becuase if need to be in a switch statement? Don't know if you can do this with AutoIt Script, however in PHP for example you can do....

switch (n)
{
case label1:
  code to be executed if n=label1;
  break;
case label2:
  code to be executed if n=label2;
  break;
default:
  code to be executed if n is different from both label1 and label2;
}

My code being below....

$html = StringReplace($text, "&", "&amp;")
$html = StringReplace($text, "“", "&#8220;")
$html = StringReplace($text, "”", "&#8221;")
$html = StringReplace($text, "€", "&euro;")
$html = StringReplace($text, "£", "&pound;")
$html = StringReplace($text, "  ", " ")
$html = StringRegExpReplace($text, "(.+)(rn|z)", "<p>1</p>2")
$html = StringRegExpReplace($text, "(?:[•|+][s]{0,})(.*)(rn|z)", "<li>1</li>2", 0)
Link to comment
Share on other sites

what is the second to last stringregexp trying to get to? so there is some char(), followed by a + immediatly followed by a CRLF/end? (that's how it's written, currently)

please explain it

IEbyXPATH-Grab IE DOM objects by XPATH IEscriptRecord-Makings of an IE script recorder ExcelFromXML-Create Excel docs without excel installed GetAllWindowControls-Output all control data on a given window.
Link to comment
Share on other sites

what is the second to last stringregexp trying to get to? so there is some char(), followed by a + immediatly followed by a CRLF/end? (that's how it's written, currently)

please explain it

It is finding a full stop followed by a carridge return, which if found puts a </p> straight after the full stop and then putting an opening <p> at the start of the sentance. If that makes sense? It works when ran as the only StringRegExpReplace in the code, but must have an error somewhere or something in my code I'm guessing causing the two StringRegExpReplace to conflict. Not sure why, I assume you should be able to have more than one StringRegExpReplace in your code? I can show you all the code I have if that would be easier to see what is happening?

Thanks for all the help on this, really appreciate it.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...