Jump to content

Recommended Posts

Posted

Hello to all,

scripting about pdf conversion to txt. (ebook, epub, flash related)

I've some pdf, extract text, clean a bit and use for other task.

Search help to remove in txt file some text. Post some rows:

#<font 10 ""> 3d hdtv: ready for primetime? #<font 7 ""> Tablets 2.0 Why 2010 Could (Finally) Be Their Year nexus oneHan
#<font 16 ""> www.storemags.com & www.fantamag.com #<font 8 ""> MARCH 2010 MARCH 2010 vol. 29 no. 3 44CovER SToRY Though
#<font 47 ""> www.storemags.com & www.fantamag.com #<font 26 "">#<font 24 ""> PC Magazine Digital Edition, #<font 8 ""> 
#<font 24 ""> www.storemags.com & www.fantamag.com #<font 21 ""> The iPad: A Must-Have? #<font 17 ""> The New York Times

I've 2 task to accomplish:

- 1st remove text into ""> R4Nd0M #<font marker

- 2nd remove text into #<font R4Nd0M ""> marker

See that markers can contain random text (can be number or letter),

and can't find solution or function that can help me.

Any hint is appreciated, thank you.

m.

  • Moderators
Posted

myspacee,

I am confused about what you want to do - can you show us how you want the lines to look after the 2 deletions?

If you want to get rid of the 2 font tags, then this pattern should work: :mellow:

StringRegExpReplace($sText, "(?U)(#<.*>)", "")

(?U) = Inverse greediness, look for the shortest match (otherwise you lose the text between the tags as well!)

(#<.*>) = Look for #<, followed by a number of characters, followed by >

M23

Public_Domain.png.2d871819fcb9957cf44f4514551a2935.png Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind

Open spoiler to see my UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Posted

Thank you for reply,

I've

#<font 10 ""> 3d hdtv: ready for primetime? #<font 7 ""> Tablets 2.0 Why 2010 Could (Finally) Be Their Year nexus oneHan
#<font 16 ""> www.storemags.com & www.fantamag.com #<font 8 ""> MARCH 2010 MARCH 2010 vol. 29 no. 3 44CovER SToRY Though
#<font 47 ""> www.storemags.com & www.fantamag.com #<font 26 "">#<font 24 ""> PC Magazine Digital Edition, #<font 8 ""> 
#<font 24 ""> www.storemags.com & www.fantamag.com #<font 21 ""> The iPad: A Must-Have? #<font 17 ""> The New York Times

I want to obtain

Tablets 2.0 Why 2010 Could (Finally) Be Their Year nexus oneHan
MARCH 2010 MARCH 2010 vol. 29 no. 3 44CovER SToRY Though
PC Magazine Digital Edition,
The iPad: A Must-Have? The New York Times

try few StringRegExpReplace combination but i'm not so smart as i think :]

m.

  • Moderators
Posted

myspacee,

I can do it in 2 passes. :mellow:

First pass - get rid of the initial #<font tag> text #<font tag>:

StringRegExpReplace($sText, "(?U)(?m:^)(#<.+>.+#<.+>)", "")

(?U) = Inverse greediness, look for shortest match

(?m:^) = Start at beginning of a line

(#<.+>.+#<.+>) = Match the first 2 tags and any text between them

Then a second pass to get rid of the remaining #<font tag>:

StringRegExpReplace($sText, "(?U)(#<.+>)", "")

(?U) = Inverse greediness

(#<.+>) = match the remaining tags

I am sure a SRE guru will come along in a minute and laugh until his sides hurt - but that should get you started! :(

M23

Public_Domain.png.2d871819fcb9957cf44f4514551a2935.png Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind

Open spoiler to see my UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Posted (edited)

I think to use 2 step too,

but not discard #<font tag> text #<font tag> :

$chars = StringRegExpReplace($chars, "(?U)(> .*#<)", "")
$chars = StringRegExpReplace($chars, "(?U)(#<.*>)", "")

Is a bad idea ? [EDIT: after test is a very bad idea !]

m.

(now testing yours...) :mellow:

Edited by myspacee

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...