Re: How to properly read a csv file

alepage · Posted 01-23-2023 04:47 PM

Hello,

Hello, I have tried the procedure proc import to import a csv file that might have some structure problems and it failed. I have also tried the code generated by the import tool but without the cleansing option. However, it is like some data lines (see the attached Excel file) are spread on multiple lines instead of staying on one line (original file structure).

For example, the three first variables should be as below:

Startdate: 2022-12-06 16:56:08

Endate: 2022-12-06 17:00:58

Status: IP Address

And so on…

So each line should start with Startdate, Endate, Status IP and so on. If it is not the case, this partial line should concatenate with the previous one. Again if there is a second and a third partial line, than this third partial line should concatenate with the second partial line and this concatenated result should concatenate with the first partial line.

If it is possible to polish this dataset in that manner, it will be possible either to use an infile statement or a proc import.

The difficulties is how it is possible to clean (polish) that dataset first. I never did that before?

Any suggestions?

Reeza · Posted 01-23-2023 04:55 PM

Your "CSV" has two header lines, and each line seems to be quoted itself, with quotes internally as well that appear to be double quoted which helps.

If you delete the second header that appears JSONesque and use a standard proc import does that get you closer?

Reeza · Posted 01-23-2023 04:56 PM

Or Brute force it by reading a single set of quotes at a time if the EOL allows it and then parsing each field manually.

ballardw · Posted 01-23-2023 06:17 PM

When I see stuff like this on the third line of your data (way off to the right)

""""}"",""{""""ImportId"""":""""QID24_TEXT_TRANSLATEDen1uk8bwp""""}"",""{""""ImportId"""":""""QID24_TEXT_TRANSLATEDfrxaxooxr""""}"",""{""""

I suspect that these bits are saying something about literal text entered by respondents. So you have one issue that may relate to possible end of line characters entered by respondents using the enter key. Just a guess but "Text_translated" just sounds odd. Since your first header row has a bunch of "- Other (please specify)" one strongly suspects a relationship to those bits.

Likely you want to start reading data at line 4, the Firstobs option for the Infile statement.

Do you have any documentation from the source of the file as to the type and content of the fields? I would use that to determine the variable properties for reading the data, length, type and informat.

The body of the file seems to be a tad inconsistent about where quotes are used and the occasional "" (two double quotes) is problematic. CSV in traditional comma separated files would want to see quotes around fields that contain one or more commas as part of the data and everything inside one pair of " " would be a single value.

By any chance do you have access to the actual data system these come from? There might be some options available for exporting the data might make the file a bit more consistent. You can also search this forum for CSV and "end of line" (I think) to find some of the possible solutions for dealing with embedded enter key presses. Sometimes a TRMSTR infile option will help as that sets what combination of Carriage Return, Line Feed or single character is treated as end of line. You may get lucky with the embedded characters actually being different than the rest of the file end of line.

Tom · Posted 01-24-2023 09:56 AM

You did not attach any Excel file. You just attached an CSV file. A CSV file is just a text file and so has nothing to do with Excel.

Is this the original CSV file? or did you accidentally open the original CSV file with Excel and the overwrite the original by saving Excel's attempt to interpret the file over it? Excel will mangle CSV files if you let it open them automatically without telling it how to interpret each column.

How many observations should the file have? How many fields should each observation have?

This is what you posted.

1105  data _null_;
1106    infile csv ;
1107    input;
1108    words=countw(_infile_,',','mq');
1109    put _n_= words=;
1110  run;

NOTE: The infile CSV is:
      Filename=c:\downloads\test.csv,
      RECFM=V,LRECL=32767,File Size (bytes)=17513,
      Last Modified=24Jan2023:08:27:01,
      Create Time=24Jan2023:08:27:01

_N_=1 words=132
_N_=2 words=1
_N_=3 words=1
_N_=4 words=1
_N_=5 words=1
_N_=6 words=105
_N_=7 words=2
_N_=8 words=1
_N_=9 words=132
_N_=10 words=132
_N_=11 words=1
NOTE: 11 records were read from the infile CSV.
      The minimum record length was 1.
      The maximum record length was 6148.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds

So it looks like perhaps there should be 132 fields per line. Perhaps there are some embedded end of line strings that might explain why some lines have fewer. But adding the fields in lines 2 to 8 does not reach 132.

So the file appears to have both embedded end-of-line characters and also some lines appear to have been totally enclosed in quotes. Let's get some tools and see if we can fix it.

First let's get a tool for dealing with embedded CR and/or LF characters in delimited text files.

filename rcrlf url "https://raw.githubusercontent.com/sasutils/macros/master/replace_crlf.sas";
%include rcrlf;

Then let's get a better tool for guessing how to read a delimited text file into a SAS dataset.

filename csv2ds url "https://raw.githubusercontent.com/sasutils/macros/master/csv2ds.sas";
%include csv2ds;
filename parmv url "https://raw.githubusercontent.com/sasutils/macros/master/parmv.sas";
%include parmv;

To try to fix the lines that appear to be completely quoted let's try testing if the line has only one field and starts and ends with a quote.

So first let's remove embedded CRLF characters. Then remove extra quotes around single field lines. Then try again to remove embedded CRLF characters.

filename csv "c:\downloads\test.csv";
filename fixcrlf temp;
%replace_crlf(csv,fixcrlf);
filename dequote temp;
data _null_;
  infile fixcrlf ;
  file dequote;
  input;
  if countw(_infile_,',','qm')=1 and char(_infile_,1)='"'
    and char(_infile_,length(_infile_))='"' then
    _infile_=dequote(_infile_)
  ;
  put _infile_;
run;
filename fixcrlf2 temp;
%replace_crlf(dequote,fixcrlf2);

So this file begins to look more like a CSV file.

2259  data _null_;
2260    infile fixcrlf2 ;
2261    input;
2262    words=countw(_infile_,',','mq');
2263    put _n_= words= ;
2264  run;

NOTE: The infile FIXCRLF2 is:
      Filename=...\#LN00062,
      RECFM=V,LRECL=32767,File Size (bytes)=16030,
      Last Modified=24Jan2023:09:30:22,
      Create Time=24Jan2023:09:30:22

_N_=1 words=132
_N_=2 words=128
_N_=3 words=133
_N_=4 words=132
_N_=5 words=132
_N_=6 words=27
NOTE: 6 records were read from the infile FIXCRLF2.
      The minimum record length was 431.
      The maximum record length was 6032.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

The last line seems truncated, but perhaps you stopped in the middle of that observation because of the extra line breaks in the file you copied from.

Another problem with the file is that there appear to be two extra lines inserted between the header row and the actual data. That is something that the %CSV2DS() macro can handle. Just tell it to read the names from row 1 and start reading the data at row 4.

%csv2ds(fixcrlf2,namerow=1,datarow=4,out=want,replace=1)

If we look at the _TYPES_ dataset that %CSV2DS() uses to decide how to define the variables you can see that is now making more reasonable choices for the how to define the variables.

And the data for this first few variables looks reasonable:

proc print data=want;
  var startdate -- ResponseId;
run;

The extra two rows seem to be some type of prompt/description and some type of code for pulling the values.

That second extra row seems to have one field split somehow. You can clean it up with a little work:

%csv2ds(fixcrlf2,out=metadata,replace=1);
proc transpose data=metadata(obs=2) out=vars prefix=ROW;
  var _all_;
run;

data vars;
 set vars;
 set vars(keep=row2 rename=(row2=nextrow2) firstobs=2)  vars(drop=_all_ obs=1);
 if _n_=1 then row2=catx(',',row2,nextrow2);
 else row2=nextrow2;
 drop nextrow2;
 if _name_='VAR133' then delete;
 rename row1=Prompt row2=Code ;
run;

proc print;
run;

alepage · Posted 01-24-2023 02:03 PM

Hello Tom

The csv file is obtained by unzipping a file that is download using a Proc http procedure.

The test.csv file provided is an extract of the original file. It contains just few observations.

Effectively, they are 132 fields per observations.

Now let’s synthetize what you have done.

At the first steps, you are counting the number of words (fields) per observations. This information reveals that this file needs to be polish.
After that, the macro function replace_crlf(csv,fixcrlf) is use to get rid off the carriage return and linefeed that are inside quote.
Then erase the quotes that are at the beginning and at the end of an observation
Then again, the macro function replace_crlf that are between quote are erase.
Finally, the macro function csv2ds convert the csv file into a dataset.

I have tested the proposed script with the complete csv file, and it works very well.

Thank you very much for your help.

Tom · Posted 01-24-2023 02:38 PM

If Excel can open the full file and it does not get columns miss aligned (broken into multiple lines for example) then you should be able to skip the data step that tries to remove extra quoting. I suspect that the extra quoting might have been introduced by you trying to make a subset of the file.

Excel can handle embedded end of line characters in fields that are quoted. SAS cannot, so you probably need to do something like the %replace_crlf() macro to make a file that SAS can read.

If you have to read multiple versions of this file then it would be better to skip the %CSV2DS() macro and instead just run your own data step to read the file directly. You could start with the code generated by %CSV2DS() macro if you want and adjust it. Note that %CSV2DS() will have written that code to the temporary fileref _CODE_ that it created. So you could add code to copy that a permanent file. Or you can also use the RUN=0 option of the macro and instead of actually running the generated code it will dump it to the SAS log where it would be easier to copy and paste.

alepage · Posted 01-24-2023 02:45 PM

I have read the complete csv file with the proposed code and it allow to cleanup the file and convert it to a nice dataset.

Thank you very much

Registration is open

SAS Training: Just a Click Away