Jump to content

CSV file and white space


Hobbyist
 Share

Recommended Posts

First I know there is a function out there that does better than this method, but I will leave it for later.

Attached is the code I have been using and it has worked until today.

This issue, I think is related to the "white space" which I thought I understood but perhaps not.

While reading the data it works EXCEPT for in the array element [x] [2] if the data is "abc hij xyz" it starts a new row BUT if the data is "abc hij xyz " it completes the existing row and then continues executing properly.  I have experimented with changing the data in different columns(moving the quote marks) and it always works (except for [x] [2] position). I am missing something. The quote mark placement is an issue and it is something I have no control over in terms of data received.

The data is usually 5 columns wide.

Thanks

Hobbyist

Local Const $sMessage = "Select a single file .CSV file."
    $sFileOpenDialog = FileOpenDialog($sMessage, "" & "c:\Dash Board\" & $Source & "\Statement", "Text & Commas(*.csv)", $FD_FILEMUSTEXIST)
    If @error Then
        MsgBox($MB_SYSTEMMODAL, " Status", "Error Code  " & @error & "  File Not Selected")
        Return 0
    EndIf
    $statementfile = $sFileOpenDialog

    local $aCard[200][8] 
    local $aAmerican[200][8]
    $csv = FileRead($statementfile)
    $csv = StringStripWS($csv, 7)
    $rows = StringSplit($csv, @CRLF)
    Dim $aCard[$rows[0] + 1][5 + 1]
    $aCard[0][0] = $rows[0] 

    For $i = 0 To $rows[0]
        $temp = StringSplit($rows[$i], ",", 2)

        For $j = 0 To UBound($temp) - 1
            $aCard[$i][$j] = $temp[$j] ;
        Next
    Next

    _ArrayDisplay($aCard, "4894")

 

Link to comment
Share on other sites

Thank you, I will look at that.

I am also trying to learn the "why" for this not working.  I thought in a CSV file, the delimiter impacted the column placing.

If that is true, then why would the placement of " cause the data to jump to a new row?  Shouldn't the file be read from delimiter to delimiter?  Here is an example of data. Pls note the placement of " in the all the lines for column position #2. All the entries have a space after the last alpha character and then the  "  ....but in the last line has in column position #2 the " immediately after the last alpha character and then the delimiter.

"Status","Date","Description","Debit","Credit"
"Cleared","01/16/2016","CHINA STAR Fire            My Town AA ","19.92",""
"Cleared","01/27/2016","THE HOME DEPOT 8029    Sometown  XX" ,"9.09",""
"Cleared","02/05/2016","PAYMENT THANK YOU", "","25.00"

If I go into the data and put a space between YOU and the "  it reads it fine, but why should that impact the delimiter behavior for column placement.  Additionally if I go into the data and put a space after the 2 in "19.92" there is no negative impact, it still reads it correctly and would be of no concern at input time.

This is probably very basic to most out there, so I apologize for such a newbie question, but if I stand a chance at learn this stuff I definitely need to understand the why.  I thought I had the delimiter topic nailed down until this popped up.

Attached is example CSV file reflected above.

 

Thanks again

Hobbyist

February2016_205xxxx.csv

Link to comment
Share on other sites

I'm not sure what the problem is, but I wonder why you are stripping out white spaces. This code reads the file you posted to an array.

#include <Array.au3>
#include 'csvSplit.au3'

Local $sFilePath = @ScriptDir & "\February2016_205xxxx.csv"

Local $hFile = FileOpen($sFilePath)
If $hFile = -1 Then
    MsgBox(0, "", "Unable to open file")
    Exit
EndIf

Local $sCSV = FileRead($hFile)
If @error Then
    MsgBox(0, "", "Unable to read file")
    FileClose($hFile)
    Exit
EndIf
FileClose($hFile)

Local $aCSV = _CSVSplit($sCSV) ; Create the main array
If @error Then
    ConsoleWrite("Error = " & @error & @LF)
    Exit
EndIf

_ArrayDisplay($aCSV)

 

Link to comment
Share on other sites

 

PACaleala -The file is just downloaded without any editing.  Here is what I noticed - opening it using Microsoft Works Spreadsheet produces column overlap. I would guess using Excel would produce the same result as it is a CSV file. Opening it using notepad produces a regular readable notepad file and nothing garbled.  IF I moved the as cited above, the file reads into the code I listed above, BUT using the same file AFTER moving the it still is column overlap in the Works spreadsheet.  I have used the script I listed before and with data where the "  is not proceeded by a space ("xyz" versus "xyz " for example) and it worked in both the script and Works.

CZARDAS - To your wondering why I strip out white spaces brings me full circle to my comment on the "why". I just did what somebody in the forum suggest be done.  I really didn't ask or question - I am way too new at this.

Where does it leave me?  I may have script that worked and was just LUCKY before.  I thought I fully understood the concept of a delimiter as well and can't see why the " placement would interfere. 

I want to fully understand this to build a good learning foundation rather than just CUT and PASTE without knowing anything - that may have just gotten me to where I am.

Thank you both for thoughts and any direction. Being a newbie is such a headache.

Just thought of this as well - is there difference in reading a CSV file that is one continuous line versus a file where you see each line? 

Hobbyist

Link to comment
Share on other sites

Some line break character combinations don't necessarily show up in notepad. Stripping white spaces may corrupt the format - I haven't thought very deeply about it but it constitutes a form of data loss regardless. You should read the wiki article about csv format. https://en.wikipedia.org/wiki/Comma-separated_values

Despite the marvelous features of Microsoft Excel, it always seems rather flakey when it comes to reading csv files - don't ask me why, because csv format is one of the simplest possible ways to arrange data in a grid. I can only surmise that it is a design choice.

Edited by czardas
Link to comment
Share on other sites

Yup, just a bonus line feed, this leaves the carraige returns behind to keep it all nice like

$sFile = FileRead("February2016_205xxxx.csv")
$sFile = stringreplace($sFile , @LF , "")
FileWrite("Newcsv.csv" , $sFile)

 

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Link to comment
Share on other sites

Glory Glory Man United!  Descendents of the smarter UK folk who kept walking west until it got warm and flat.

 

edit:  they say our hick accents are very liverpool-ish, but when I watch the 51st State I think we sound more mancunian, and this scene is so epic

 

 

Edited by iamtheky

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Link to comment
Share on other sites

I thought we had solved the problem and answered your question. The file you posted can be read by the code I posted - providing you put the UDF code (and the csv file) in the same folder and name it CSVSplit.au3.

The code you posted begins by stripping out white spaces. I asked why you were doing this and you said someone told you to do it. I can tell you that your code is not a general purpose csv parser. To understand why this is the case, you first need to understand the most commonly used csv syntax, which is described in the article I linked to.

Determining the exact consequences of using code not suited to the task is potentially a lot of effort. Suffice it to say that it will never work without a full rewrite. Coding isn't about combining incompatible resources and observing the results. It might provide some entertainment, but there are faster ways to learn. Don't take this as a rebuff, I fully understand your desire to see where your code fails. Read the article - especially this part:
https://en.wikipedia.org/wiki/Comma-separated_values#Basic_rule
I admit it's a rather detailed, but you should at least be able to understand why your code is unsuitable as a general csv parser.

Edited by czardas
Link to comment
Share on other sites

czardas & iamthekey

You both have greatly helped me with my question. I read the material at the link you provided and already see that I was behind the game. I actually think it was just a struck of luck my code had worked - up until now.  So I have gone from "tweaking" to a blow up and build again.

Thanks for your guidance and patience.  You both get thumbs up.

 

Hobbyist

Link to comment
Share on other sites

It's tricky to design a function that will cover a variety of csv formats. You should learn a lot in the process. My function was written a while ago now and could be improved. It has worked well enough for me so far, and I'm too busy with other projects to spend time on it unfortunately. Good luck with your progress! :)

Edited by czardas
Link to comment
Share on other sites

CZARDAS

BINGO!   I was not happy not being able to answer your question about the white space issue but can now. I went through my journal looking for the +/- of the topic from when I started CSV.

At the time and now I find out erroneously, I thought all CSV files where ONE continuous line of records and got some advice on white space - and being a newbie had no knowledge of WHY.  

So I have read the wiki notes on CSV and no doubt need to read them again. But on the first pass I thought of a question I hope you can lend some advice.

When thinking of rewriting the script above it seems I would need to plan for a.) a file that could be one continuous line and b.) one that is reflected above where each record is on its own line.  Of course at this point I say this because I took two files and opened each in notepad and found them as just described.  Or am I way off on this @CR @ LF   @CRLF stuff.

So it would seem to me that before being able to process a file wouldn't I need to know if its a.)one line of continuous data or b.) multiple and secondly convert one to the other so they always process the same?

Hopefully you can give me a push in the right direction - it might be simpler than I think or complex but I'm drawing a blank as of now. Not looking for you to script it but rather lend  your knowledge.

 

Thanks. Cheers

Hobbyist

 

Link to comment
Share on other sites

Quote

Or am I way off on this @CR @ LF   @CRLF stuff.

you can have any or all of those characters, in various combinations, your issue was in the application the user chooses to view it.  If its excel then it has word wrap by default on cells and a line feed jacks the view up, if its notepad it ignores them and everything looks fine.  I would build a separate sanitization routine for output to excel so you dont mix the cosmetic with the necessary.  But @czardas is the pro, so await his guidance for sure.

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Link to comment
Share on other sites

6 hours ago, iamtheky said:

But @czardas is the pro

Hardly. I was never 100% happy with my approach, but here's more or less what I did (from memory) in plain English. I first identified fields enclosed in double quotes. I then replaced all CRLF outside enclosed fields with LF. I then replaced all CR outside enclosed fields with LF. I then replaced all LF outside enclosed fields with a private range Unicode character and used that to split the string to get rows (leaving the breaks within enclosed fields intact). A similar process was used to replace commas outside enclosed fields to split each row into columns.

Certain double quotes wrapped around (or within) enclosed fields must also be removed, depending on where they appear. Although my code uses a crude method, I managed to get it working back then. There are more elegant solutions using RegExp. \R is meant to target any combination of CRLF, CR and LF, although I haven't used it yet. There are some examples on the forum of parsing csv with RegExp. Each of the methods I have seen has its strengths and weaknesses.

Edited by czardas
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...