kosamja

How to use RegExp to make this work faster?

5 posts in this topic

#1 ·  Posted (edited)

Hi, hope someone can help me with my problem. I am trying to:
1) read content of text file
2) fix formatting which means:
   a) remove empty lines at beginning of file and spaces at beginning of lines
   b ) replace multiple empty lines between paragraphs with one line and multiple spaces inside of line with one space
   c) if after that first character in line is lowercase and previous line is not empty then merge it with previous line, otherwise keep line unchanged
3) convert letters
4) remove duplicate lines and write to RTF
But its currently slow for bigger txt files(1MB+). Any chance to make it faster with RegExp? tnx

example: this

$Cyrillic     =  'љ|Љ|њ|Њ|џ|Џ|
   a|A|б|Б|в|В|г|Г|д|Д|ђ|Ђ|е|Е|ж|Ж|з|З|и|И|ј|Ј|к|
К|л|Л|м|М|н|Н|о|О|п|П|р|Р|с|С|



т|Т|ћ|Ћ
  |у|У|ф|Ф|х|Х|ц|Ц|ч|Ч|ш|Ш'

should be changed to

$Cyrillic = 'љ|Љ|њ|Њ|џ|Џ|a|A|б|Б|в|В|г|Г|д|Д|ђ|Ђ|е|Е|ж|Ж|з|З|и|И|ј|Ј|к|
К|л|Л|м|М|н|Н|о|О|п|П|р|Р|с|С|

т|Т|ћ|Ћ
|у|У|ф|Ф|х|Х|ц|Ц|ч|Ч|ш|Ш'
#NoTrayIcon
#RequireAdmin
#include <File.au3>
#include <Constants.au3>
#include <GUIConstants.au3>
#include <WinAPI.au3>
#include <Array.au3>

Opt("WinWaitDelay", 0)
Opt("MouseClickDelay", 0)
Opt("MouseClickDownDelay", 0)
Opt("MouseClickDragDelay", 0)
Opt("SendKeyDelay", 0)
Opt("SendKeyDownDelay", 0)
Opt("WinTitleMatchMode", 3)
FileChangeDir(StringRegExpReplace(@ScriptDir, '\\+$', ''))

Global $Convert = 'Cyrillic'
;$Convert = 'Latin'
Global $Cyrillic = 'љ|Љ|њ|Њ|џ|Џ|a|A|б|Б|в|В|г|Г|д|Д|ђ|Ђ|е|Е|ж|Ж|з|З|и|И|ј|Ј|к|К|л|Л|м|М|н|Н|о|О|п|П|р|Р|с|С|т|Т|ћ|Ћ|у|У|ф|Ф|х|Х|ц|Ц|ч|Ч|ш|Ш'
Global $Latin = 'lj|Lj|nj|Nj|dž|Dž|a|A|b|B|v|V|g|G|d|D|đ|Đ|e|E|ž|Ž|z|Z|i|I|j|J|k|K|l|L|m|M|n|N|o|O|p|P|r|R|s|S|t|T|ć|Ć|u|U|f|F|h|H|c|C|č|Č|š|Š'
Global $CyrillicCharList = StringSplit($Cyrillic, '|')
Global $LatinCharList = StringSplit($Latin, '|')

;txt file
_Convert($CmdLine[1])

Func _Convert($sPath)
   $sConvertedText = _FormattingFix(FileRead($sPath))
   $sConvertedText = _Transliterate($sConvertedText, $Convert)
   Return $sConvertedText
EndFunc

Func _Transliterate($sText, $sConversion = 'Latin')
   $sText = StringReplace($sText, 'dz', 'dž', 0, $STR_CASESENSE)
   $sText = StringReplace($sText, 'Dz', 'Dž', 0, $STR_CASESENSE)
   $sText = StringReplace($sText, 'DZ', 'Dž', 0, $STR_CASESENSE)
   $sText = StringReplace($sText, 'DŽ', 'Dž', 0, $STR_CASESENSE)
   $sText = StringReplace($sText, 'LJ', 'Lj', 0, $STR_CASESENSE)
   $sText = StringReplace($sText, 'NJ', 'Nj', 0, $STR_CASESENSE)
   For $i = 1 to 60
      If $sConversion = 'Latin' Then
         $sText = StringReplace($sText, $CyrillicCharList[$i], $LatinCharList[$i], 0, $STR_CASESENSE)
      Else
         $sText = StringReplace($sText, $LatinCharList[$i], $CyrillicCharList[$i], 0, $STR_CASESENSE)
      EndIf
   Next
   Return $sText
EndFunc

Func _FormattingFix($sText)
   $sFixedText = ''
   $IsFirstNonWhitespaceLineFound = False
   $sLines = StringSplit($sText, @LF)
   For $i = 1 to $sLines[0]
      $sString = StringStripWS(StringStripCR($sLines[$i]), $STR_STRIPLEADING + $STR_STRIPTRAILING + $STR_STRIPSPACES)
      $sFirstChar = StringLeft($sString, 1)
      Select
      Case $IsFirstNonWhitespaceLineFound = False and not StringIsSpace($sFirstChar)
         $sFixedText = $sString
         $IsFirstNonWhitespaceLineFound = True
      Case StringIsUpper($sFirstChar)
         If StringIsUpper(StringLeft(StringStripWS(StringStripCR($sLines[$i-1]), $STR_STRIPLEADING + $STR_STRIPTRAILING), 1)) Then
            $sFixedText = $sFixedText & @CRLF & $sString
         Else
            $sFixedText = $sFixedText & @CRLF & @CRLF & $sString
         EndIf
      Case StringIsLower($sFirstChar)
         $sFixedText = $sFixedText & ' ' & $sString
      Case StringIsSpace($sFirstChar)
         ;ignore empty lines
      Case Else
         $sFixedText = $sFixedText & @CRLF & $sString
      EndSelect
   Next
   $sAppendAtEnd = @CRLF
   If StringIsSpace(StringStripCR($sLines[$sLines[0]])) Then $sAppendAtEnd = @CRLF & @CRLF
   Return $sFixedText & $sAppendAtEnd
EndFunc

 

Edited by kosamja

Share this post


Link to post
Share on other sites



#2 ·  Posted

For the 2) , here is a way :

; remove empty lines at beginning of file and spaces at beginning of lines
$newString = StringRegExpReplace($string, "^\R+\h*|\R\K\h+", "")

;replace multiple empty lines between paragraphs with one line and multiple spaces inside of line with one space
$newString = StringRegExpReplace($newString, "\R{2}\K\R+|\h\K\h+", "")

;  if after that first character in line is lowercase and previous line is not empty then merge it with previous line, otherwise keep line unchanged
$newString = StringRegExpReplace($newString, "\V+\K\R(?=[[:lower:]])", "")

 

1 person likes this

Share this post


Link to post
Share on other sites

#3 ·  Posted (edited)

Hi jguinch, thanks for answering, it works perfect. I have 2 more questions:
1) to remove duplicate lines with autoit i need to use _ArrayUnique?
2) what would be RegExp version for this

$sLines = StringSplit($sText, @LF)
   $sFixedText = ''
   For $i = 1 to $sLines[0]
      If not StringIsSpace(StringStripCR($sLines[$i])) Then
         $sAppendBetween = ''
         If StringIsSpace(StringStripCR($sLines[$i-1])) Then $sAppendBetween = '\line '
         $sFixedText = $sFixedText & '{' & $sAppendBetween & '\pard \fs24 \ql \f0 \li0 \fi0 ' & StringStripCR($sLines[$i]) & '\par}' & @CRLF
      EndIf
   Next

a) if line is not empty replace it with {\pard \fs24 \ql \f0 \li0 \fi0 (Content Of Line) \par}
(add {\pard \fs24 \ql \f0 \li0 \fi0 at beginning of each non empty line and add \par} at end of each non empty line)
b ) if line is empty replace it with {\line \pard \fs24 \ql \f0 \li0 \fi0 \par}

 

Edited by kosamja

Share this post


Link to post
Share on other sites

#4 ·  Posted (edited)

is this correct way to do it?

;insert at begin of non empty lines
$newString = StringRegExpReplace($newString, "(?m)(.+)","{\\pard \\fs24 \\ql \\f0 \\li0 \\fi0 \0")
;insert at end of non empty lines
$newString = StringRegExpReplace($newString, "(?m)(\R+)"," \\par}\0")
;insert at empty lines
$newString = StringRegExpReplace($newString, "(?m)(^\R)","{\\pard \\fs24 \\ql \\f0 \\li0 \\fi0 \\par}\0")

1 more question: How to remove spaces from end of each line with RegExp? Is this correct way to do it:

$newString = StringRegExpReplace($newString, "(?m)^[ \t]+|[ \t]+(\R)","\1")

 

Edited by kosamja

Share this post


Link to post
Share on other sites

#5 ·  Posted

;remove spaces from end of each line
$newString = StringRegExpReplace($newString, "\h+(?=\R)","")

;insert at begin of non empty lines
$newString = StringRegExpReplace($newString, "(?:^|\R)\K(?!\R)","{\\pard \\fs24 \\ql \\f0 \\li0 \\fi0 \\0")

;insert at end of non empty lines
$newString = StringRegExpReplace($newString, "\N+\K"," \\par}\\0")

;insert at empty lines
$newString = StringRegExpReplace($newString, "(?:^|\R)\K(?=\R)","{\\pard \\fs24 \\ql \\f0 \\li0 \\fi0 \\par}\\0")
ConsoleWrite($newString)

 

1 person likes this

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now