8 posts in this topic
I have a script that takes a large excel file, pulls out and reorganizes certain information I need, and spits out a trimmed down csv file which I uses to upload the information on my website. Some of this information contains characters with accents or em dashes. By default it would create a csv file in ANSI which I then uploaded but had to tell my website import system it was windows-1252 in order for it to look correct.
This was all working fine except now I need to add in a non-breaking space and non-breaking hyphen into parts of my output. At first I tried using ChrW(0xA0) and ChrW(0x2011) as replacements. A quick test in the console looked correct, however opening the csv output in notepad++ showed the space correctly but a ? for the hyphen and the file was still encoded as ANSI. I tried to view it as UTF-8 instead but this just made the space appear as xAO and also other characters appeared that way like my em dashes appeared as x97 and another symbol as xA7 etc.
If I instead do a convert to UTF-8 from notepad++ then those problems go away except the hyphen still displays as ?. I then noticed on the page I linked for the non-breaking hyphen it lists the UTF-8 hex as 0xE2 0x80 0x91 (e28091). I was unsure how to enter this in autoit but several things i tried all failed to get the hyphen inserted.
I need a way to get both the space and hyphen added correctly as either ANSI or UTF-8, but if it is UTF-8 then I need a way to convert all of the other data I extracted from the excel file.
I've included a test excel file with a single line and test script to create a csv demonstrating the problem.
Hello . How to do that
$regexp = starts from "abcdef" and after this could be anything in name
Well the plan is to use the power of regular expressions engine of AutoIT for patching binary data.
Something like this: StringRegExp( $BinaryData, "(?s)\x55\x8B.."
<cut> ... Okay straight to question/problem
... certain bytes that are in the range from 0x80 to 0xA0 won't match.
Hmm seem to be a char encoding problem. In detail these are 27 chars: 0x80, 0x82~8C, 0x8E, 0x91~9C, 0x9E,0x9F
Here's a small code snippet to explore / explain this problem:
#include "StringConstants.au3" $TestData = BinaryToString("0x7E7F808182") ;Okay $match = StringRegExp( $TestData ,'\x7E' ,$STR_REGEXPARRAYFULLMATCH) ConsoleWrite('@extended = ' & @extended & ' $match = ' & $match & @CRLF) ;Okay $match = StringRegExp( $TestData ,'\x7F' ,$STR_REGEXPARRAYFULLMATCH) ConsoleWrite('@extended = ' & @extended & ' $match = ' & $match & @CRLF) ;Error no match $match = StringRegExp( $TestData ,'\x80' ,$STR_REGEXPARRAYFULLMATCH) ConsoleWrite('@extended = ' & @extended & ' $match = ' & $match & @CRLF) ;Okay $match = StringRegExp( $TestData ,'\x81' ,$STR_REGEXPARRAYFULLMATCH) ConsoleWrite('@extended = ' & @extended & ' $match = ' & $match & @CRLF) ;Error no match $match = StringRegExp( $TestData ,'\x82' ,$STR_REGEXPARRAYFULLMATCH) ConsoleWrite('@extended = ' & @extended & ' $match = ' & $match & @CRLF) ;~ output: ;~ @extended = 2 $match = ;~ @extended = 3 $match = ;~ @extended = 0 $match = 1 ;~ @extended = 5 $match = ;~ @extended = 0 $match = 1 Hmm what to do? Go back and use the 'numberstring monster' implementation or just omit that range of 'unsafe bytes'. What is the root of this problem?
Any idea how to fix this?
Update: Okay I know a byte is not a character.
But StringRegExp operates on String and so character level.
Okay as long as you stay at Ansi encoding and only use /x00 - /X7F in the search pattern using StringRegExp works well to search for binary data.
What bytes can be matched that are in the range from /X7F - /xFF is also depending on the code page.
So this avoid to search for bytes in the range from 0x80-0xa0 only applies to Germany.
I just change this country setting:
to Thai and now near all bytes from /X7F - /xFF fails to match.
Text in a file, read into var with fileread:
<> <> <> <> < J please look > <> <> <> Hi,
I want a RegExp to select around 'please', back to the previous < and forward to the next >. I can select the line of text. Then I add in (?s) and it selects the whole text. I think I want to make it not greedy, (?U) , that seems to make it ungreedy after, but it still selects all the previous lines.
$sPattern = "(?s)<.*please.*>" ; 1 $sPattern = "(?s)<(?U).*please.*>" ; 2 $sPattern = "(?s)<(?U).*please(?U).*>" ; 3 $sAry = StringRegExp($sHTML, $sPattern, 3)
I have a csv file with delimiters "," and @CRLF
After trying several different examples (simple ones!) I still haven't resolved an answer
#include <Array.au3> #include <File.au3> ;#include "ArrayMultiColSort.au3" Global $BatchDir = "C:\ncat\" FileChangeDir($BatchDir) $sFile = "mod1.csv" ; Read file into a 2D array Global $aArray _FileReadToArray($sFile, $aArray, $FRTA_NOCOUNT, ",") ; And here it is _ArrayDisplay($aArray, "Original", Default, 8) Firstly it throws a MsgBox error "No array variable passed to function" _ArrayDisplay(), so no displayed data
Second and probably more important how do I get _FileReadToArray() to split the imported array into rows and columns?
I tried "," & @CRLF without success.