Sign in to follow this  
Followers 0
DOTCOMmunications

Replace multiple blank lines with 1 blank line

10 posts in this topic

Hi all

I am looking for a way to process strings with single and/or multiple new lines in them, some of the strings have only single line spacing e.g.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas gravida nisi augue, vitae aliquet augue suscipit nec. Vivamus placerat lacinia tellus, ac laoreet ante vehicula nec. Praesent eleifend dapibus accumsan. Aenean dictum felis a tristique pretium. Morbi eget placerat ex. Phasellus purus ligula, malesuada a sapien vitae, mattis bibendum eros. Nullam elementum vehicula tellus, nec egestas magna tempus nec. Aenean ultricies lacinia mollis. Integer euismod felis nec nisl vestibulum, nec tincidunt lorem pellentesque. Vivamus auctor mauris bibendum mattis euismod. Quisque maximus tristique nulla, in tempor nunc euismod vestibulum. Mauris accumsan id ligula quis consectetur. Donec aliquet, nunc a fermentum convallis, ipsum elit semper libero, at cursus turpis quam ut velit. Nam in nibh sed erat sodales gravida.

Praesent fermentum nulla ut viverra tristique. Suspendisse in mauris mollis, suscipit enim sit amet, faucibus lorem. Vivamus erat ante, accumsan sed leo et, viverra fringilla sem. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Nunc ac ullamcorper orci. Nam accumsan orci quis lacus tempor, at gravida neque consequat. Morbi nulla ex, eleifend sit amet dictum ut, ultricies at massa. Pellentesque in posuere sapien. Phasellus nec consectetur ligula, eu sagittis odio. Suspendisse bibendum, metus et sollicitudin aliquet, neque libero efficitur ante, non condimentum nulla ipsum vitae metus. Proin at viverra massa, id porta augue. Ut maximus viverra metus eu accumsan. Phasellus et rutrum tortor.

Etiam facilisis dui at leo porta, elementum accumsan nibh fermentum. Maecenas ultricies eget neque ac scelerisque. Donec ac mauris et ex suscipit cursus. In a eros in nisl consequat placerat quis eu tortor. Vestibulum justo justo, sollicitudin id ligula sed, euismod cursus quam. Integer varius sapien a nulla faucibus ornare. Morbi bibendum, nisl in placerat dapibus, est massa vehicula mi, nec ultrices mauris odio eu augue. Proin a finibus nibh. Ut pharetra velit at ligula sodales, sagittis accumsan sapien molestie.

Whereas some have double or more line spacing e.g.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas gravida nisi augue, vitae aliquet augue suscipit nec. Vivamus placerat lacinia tellus, ac laoreet ante vehicula nec. Praesent eleifend dapibus accumsan. Aenean dictum felis a tristique pretium. Morbi eget placerat ex. Phasellus purus ligula, malesuada a sapien vitae, mattis bibendum eros. Nullam elementum vehicula tellus, nec egestas magna tempus nec. Aenean ultricies lacinia mollis. Integer euismod felis nec nisl vestibulum, nec tincidunt lorem pellentesque. Vivamus auctor mauris bibendum mattis euismod. Quisque maximus tristique nulla, in tempor nunc euismod vestibulum. Mauris accumsan id ligula quis consectetur. Donec aliquet, nunc a fermentum convallis, ipsum elit semper libero, at cursus turpis quam ut velit. Nam in nibh sed erat sodales gravida.

 

Praesent fermentum nulla ut viverra tristique. Suspendisse in mauris mollis, suscipit enim sit amet, faucibus lorem. Vivamus erat ante, accumsan sed leo et, viverra fringilla sem. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Nunc ac ullamcorper orci. Nam accumsan orci quis lacus tempor, at gravida neque consequat. Morbi nulla ex, eleifend sit amet dictum ut, ultricies at massa. Pellentesque in posuere sapien. Phasellus nec consectetur ligula, eu sagittis odio. Suspendisse bibendum, metus et sollicitudin aliquet, neque libero efficitur ante, non condimentum nulla ipsum vitae metus. Proin at viverra massa, id porta augue. Ut maximus viverra metus eu accumsan. Phasellus et rutrum tortor.

 

Etiam facilisis dui at leo porta, elementum accumsan nibh fermentum. Maecenas ultricies eget neque ac scelerisque. Donec ac mauris et ex suscipit cursus. In a eros in nisl consequat placerat quis eu tortor. Vestibulum justo justo, sollicitudin id ligula sed, euismod cursus quam. Integer varius sapien a nulla faucibus ornare. Morbi bibendum, nisl in placerat dapibus, est massa vehicula mi, nec ultrices mauris odio eu augue. Proin a finibus nibh. Ut pharetra velit at ligula sodales, sagittis accumsan sapien molestie.

Unfortunately the lines that are blank appear to have spaces or other characters on them so the regular expressions i have been trying so far dont appear to be working all that well.

#include <Array.au3>

$string = FileRead("TestSingle.txt") ; Top example text
$string2 = FileRead("TestTriple.txt") ; Bottom example text
MsgBox(0, "", $string)
MsgBox(0, "", $string2)

$var = StringRegExp($string, "(?s)\r\n(.+?)\r\n(.+?)\r\n(.+?)", $STR_REGEXPARRAYGLOBALFULLMATCH)
If @error Then
    ConsoleWrite("Error: " & @error)
EndIf
$var2 = StringRegExp($string2, "(?s)\r\n(.+?)\r\n(.+?)\r\n(.+?)", $STR_REGEXPARRAYGLOBALFULLMATCH)
If @error Then
    ConsoleWrite("Error: " & @error)
EndIf
_ArrayDisplay($var, "VAR 1")
_ArrayDisplay($var2, "VAR 2")

Both of the arrays are returning multiple matches on my text strings whereas i am trying to get to a point where i can either replace all the multiple new lines with a single new line whilst leaving the existing single new lines in tact or to differentiate between the 2 somehow and use the StringStripWS(flag of 7) function to remove them which works fine on the multiple new lines but squashes everything together on the single new lines

Any help is much appreciated

P.S. The strings originate from the plain text body of e-mails if that makes any difference but i have that part working okay

Share this post


Link to post
Share on other sites



#2 ·  Posted (edited)

StringRegExpReplace($string, '\R{2,}',@CRLF)
StringRegExpReplace($string,'(?i)(\R\s*)',@CRLF)

Edited by mLipok

Signature beginning:   Wondering who uses AutoIT and what it can be used for ?
* GHAPI UDF - modest begining - comunication with GitHub REST API *
ADO.au3 UDF     POP3.au3 UDF     XML.au3 UDF    How to use IE.au3  UDF with  AutoIt v3.3.14.x  for other useful stuff click the following button

Spoiler

Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind. 

My contribution (my own projects): * Debenu Quick PDF Library - UDF * Debenu PDF Viewer SDK - UDF * Acrobat Reader - ActiveX Viewer * UDF for PDFCreator v1.x.x * XZip - UDF * AppCompatFlags UDF * CrowdinAPI UDF * _WinMergeCompare2Files() * _JavaExceptionAdd() * _IsBeta() * Writing DPI Awareness App - workaround * _AutoIt_RequiredVersion() * Chilkatsoft.au3 UDF * TeamViewer.au3 UDF * JavaManagement UDF * VIES over SOAP * WinSCP UDF * GHAPI UDF - modest begining - comunication with GitHub REST API *

My contribution to others projects or UDF based on  others projects: * _sql.au3 UDF  * POP3.au3 UDF *  RTF Printer - UDF * XML.au3 - BETA * ADO.au3 UDF SMTP Mailer UDF *

Useful links: * Forum Rules * Forum etiquette *  Forum Information and FAQs * How to post code on the forum * AutoIt Online Documentation * AutoIt Online Beta Documentation * SciTE4AutoIt3 getting started * Convert text blocks to AutoIt code * Games made in Autoit * Programming related sites * Polish AutoIt Tutorial * DllCall Code Generator * 

Wiki: Expand your knowledge - AutoIt Wiki * Collection of User Defined Functions * How to use HelpFile * Best coding practices * 

IE Related:  * How to use IE.au3  UDF with  AutoIt v3.3.14.x * Why isn't Autoit able to click a Javascript Dialog? * Clicking javascript button with no ID * IE document >> save as MHT file * IETab Switcher (by LarsJ ) * HTML Entities * _IEquerySelectorAll() (by uncommon) * 

I encourage you to read: * Global Vars * Best Coding Practices * Please explain code used in Help file for several File functions * OOP-like approach in AutoIt * UDF-Spec Questions *  EXAMPLE: How To Catch ConsoleWrite() output to a file or to CMD *

"Homo sum; humani nil a me alienum puto" - Publius Terentius Afer
"Program are meant to be read by humans and only incidentally for computers and execute" - Donald Knuth, "The Art of Computer Programming"
:naughty:  :ranting:, be  :) and       \\//_.

Anticipating Errors :  "Any program that accepts data from a user must include code to validate that data before sending it to the data store. You cannot rely on the data store, ...., or even your programming language to notify you of problems. You must check every byte entered by your users, making sure that data is the correct type for its field and that required fields are not empty."

Signature last update: 2017-06-04

Share this post


Link to post
Share on other sites

StringRegExpReplace($string, '\s*\R', @CRLF&@CRLF)

; 1 @crlf = no blank line left
; 2 @crlf = 1 blank line left
; etc

Share this post


Link to post
Share on other sites

Hi both,

Thanks for your quick replies.

Unfortunately none of them achieved what i was looking for so perhaps i didnt explain myself that well.

With the first one (from mLipok) this removes all the newlines from the single line file but does correctly process the multiline file. With the one from mikell, when i run this against both strings it actually adds extra line spaces to the single line sample and doesn't appear to change the multi-line sample. When i removed one of the 2 @CRLF's it then produced the same result as the one from mLipok.

Basically i am wanting those with only single newline spacing to remain untouched and keep that single newline spacing. Those with multiple new line spacing i am wanting to convert to single newline spacing.

The only other way i can think of is a button on the GUI to have the user specify the type but i was hoping to automate it somehow but i dont know that much about regular expressions.

Thanks for your help so far

Share this post


Link to post
Share on other sites

Can you upload examples of the files that you are working with?

kylomas


Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites

Hi

Hopefully these have retained the formatting since i had to edit them to remove user/confidential details since they are e-mails from customers.

Hope they help

TestSingle.txt is single newline spaced file

TestTripple.txt is triple newline spaced file

Obvious Outlook usually does a relatively good job at removing these but because the e-mails are being accessed through MAPI it doesn't do that process.

TestTriple.txt

TestSingle.txt

Share this post


Link to post
Share on other sites

These are the regex patterns that I use for removing empty or apparently empty lines. 

$sFileT = "\\RBKNAS02\Data\Downloads\autoitscript.com\2015\January\TestTriple.txt"
$sFileS = "\\RBKNAS02\Data\Downloads\autoitscript.com\2015\January\TestSingle.txt"
$sPattern1 = "(?m)^[^[:graph:]]*"  ; Matches empty line and lines that only contain characters that have no dispayable symbol/glyph
$sPattern2 = "(?m)\R^[^[:graph:]]*"  ; Matches 2 or more consecutive empty line or lines that only contain characters that have no dispayable symbol/glyph

$sData = FileRead($sFileS)
$sData = StringRegExpReplace($sData,$sPattern1,'')
MsgBox(0,"TestSingle.txt with all blank lines removed",$sData)

$sData = FileRead($sFileT)
$sData = StringRegExpReplace($sData,$sPattern1,'')
MsgBox(0,"TestTriple.txt with all blank lines removed",$sData)

$sData = FileRead($sFileT)
$sData = StringRegExpReplace($sData,$sPattern2,@CRLF & @CRLF)
MsgBox(0,"TestTriple.txt with multiple blank lines replaced with a single blank line",$sData)

"Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the universe trying to build bigger and better idiots. So far, the universe is winning."- Rick Cook

Share this post


Link to post
Share on other sites

Bowmore,

The correct syntax to negate a posix class is [[:^graph:]]

Anyway this pattern is not good, as it removes also the displayable chars > 128

In the example below the ‰ char should not be removed (not blank)

$sData = "‰"   ; Chr(137)
$sData = StringRegExpReplace($sData, "(?m)^[^[:graph:]]*", "")
MsgBox(0,"", $sData)

To remove white space chars including those > 126  like the non-break space Chr(160) in the "TestTriple.txt" above, one of these should be used :

'(*UCP)s'  or  '(*UCP)[[:space:]]'     because by default (UTF mode) s and [:space:] don't match Chr(160)

'[hv]'     because both don't need (*UCP)

'p{Xps}'

;)

Share this post


Link to post
Share on other sites

Would it instead work to parse the file line by line, and ignore any line which does not contain non-whitespace?

Something like (untested)

Do
   $line = FileReadLine($file)
   If(@error == 0) Then
      If(StringRegExp($line, "\S", 1) Then
          ; Grab the line unless it is only whitespaces.
          $singleSpace = $singleSpace&$line
      EndIf
   EndIf
Until (@error <> 0)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0