Jump to content

Need help with File Encoding


ShawnW
 Share

Recommended Posts

I have a script that takes a large excel file, pulls out and reorganizes certain information I need, and spits out a trimmed down csv file which I uses to upload the information on my website. Some of this information contains characters with accents or em dashes. By default it would create a csv file in ANSI which I then uploaded but had to tell my website import system it was windows-1252 in order for it to look correct.

This was all working fine except now I need to add in a non-breaking space and non-breaking hyphen into parts of my output. At first I tried using ChrW(0xA0) and ChrW(0x2011) as replacements. A quick test in the console looked correct, however opening the csv output in notepad++ showed the space correctly but a ? for the hyphen and the file was still encoded as ANSI. I tried to view it as UTF-8 instead but this just made the space appear as xAO and also other characters appeared that way like my em dashes appeared as x97 and another symbol as xA7 etc.

If I instead do a convert to UTF-8 from notepad++ then those problems go away except the hyphen still displays as ?. I then noticed on the page I linked for the non-breaking hyphen it lists the UTF-8 hex as 0xE2 0x80 0x91 (e28091). I was unsure how to enter this in autoit but several things i tried all failed to get the hyphen inserted.

I need a way to get both the space and hyphen added correctly as either ANSI or UTF-8, but if it is UTF-8 then I need a way to convert all of the other data I extracted from the excel file.

I've included a test excel file with a single line and test script to create a csv demonstrating the problem.

test.xlsx

test.au3

Link to comment
Share on other sites

What about:

#include <Array.au3>
#include <Excel.au3>

$excelFile = FileOpenDialog("Select test file", @DesktopDir, "Excel files (*.xls*)")

$oExcel = _Excel_Open(False, False, False, False, True)
$oWorkbook = _Excel_BookOpen($oExcel, $excelFile, True, False)
$table = _Excel_RangeRead($oWorkbook, Default, Default, 1, True)
_Excel_BookClose($oWorkbook)

$table[0][0] &= ". pp." & ChrW(0xA0) & $table[0][1] & ChrW(0x2014) & $table[0][2]
_ArrayColDelete($table, 2)
_ArrayColDelete($table, 1)

$sTable = _ArrayToString($table, ",")
$hFileOpen = FileOpen(@DesktopDir & "\test.csv", 258)
FileWrite($hFileOpen, $sTable)
FileClose($hFileOpen)

 

Link to comment
Share on other sites

Thanks for the suggestion.

The first problem there is that 0x2014 is the Em Dash that although present in the excel file data, is not what i need when I add the hyphen in that line. I need the shorter dash like on the keyboard except one that is treated as part of the word when wrapping. I'm thinking it is 0x2011 but I cannot get that to display. In researching the dashes I should honestly be using an En Dash (0x2013) for showing a range of numbers, but that breaks in wrapping, and I didn't get to set the requirements for this, I just have to come up with the solution.

Secondly, FileWrite on the array does add commas for the csv but that is all it does. It does not add quotes around cells that contain a comma already, and if it did, it wouldn't escape existing quotes either. I didn't have a line like this in my example because I did not consider a different way of creating the csv, but I do have lines that would require a more accurate csv creation.

Link to comment
Share on other sites

That didn't seem to change the encoding. I did play with it some more though based on this idea. It seems neither that nor writing to the file in that method changed it to UTF8. I did find that if i read it, then open it in overwrite mode it and write it then it changes to UTF8.

$str = FileRead(@DesktopDir & "\test.csv")
$hFileOpen = FileOpen(@DesktopDir & "\test.csv",130)
FileWrite($hFileOpen, $str)
FileClose($hFileOpen)

But that still leaves the problem of the hyphen.

Link to comment
Share on other sites

Sorry was in a rush to get out the door this morning, have you tried uploading the file using ChrW(0x2011)?  If you open it in Excel the spreadsheet renders it correctly, in Notepad it will show special characters because not all fonts support ChrW(0x2011) for example Sans-Serif.

Link to comment
Share on other sites

I highly recommend DejaVu Sans Mono to be used in Notepad++ (and Scite as well). It covers a large range of Unicode and is particularly readable for programming and general text edit. Both mentionned codepoints show correctly, among many others.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

15 hours ago, Subz said:

Sorry was in a rush to get out the door this morning, have you tried uploading the file using ChrW(0x2011)?  If you open it in Excel the spreadsheet renders it correctly, in Notepad it will show special characters because not all fonts support ChrW(0x2011) for example Sans-Serif.

If try and upload it still shows as '?'. Also, I don't get the same result after opening in Excel 2016. Running on test file and opening in Excel shows the '?' on both the UTF8 and ANSI versions.

I tried something simpler and copied the actual hyphen and just added it in string form. It looks correct in the scite editor, and if I output it to the console when running it shows correctly there as well. Yet after Excel writes it to the csv it breaks, and no matter where I open it, or what font I use, it shows as a '?'.

It seems the problem is in the way excel makes the csv, I may just do a more complex csv manual creation building on your 1st example. It seems the display problem i had with the hyphen after writing the file without excel was indeed the font problem you refer to. If I can't do it that way though I may just have to fight every OCD bone in my body and add it as the html code &#x2011; since it is only ever going to be displayed on a webpage that way.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...