Hello,
Hello, I have tried the procedure proc import to import a csv file that might have some structure problems and it failed. I have also tried the code generated by the import tool but without the cleansing option. However, it is like some data lines (see the attached Excel file) are spread on multiple lines instead of staying on one line (original file structure).
For example, the three first variables should be as below:
Startdate: 2022-12-06 16:56:08
Endate: 2022-12-06 17:00:58
Status: IP Address
And so on…
So each line should start with Startdate, Endate, Status IP and so on. If it is not the case, this partial line should concatenate with the previous one. Again if there is a second and a third partial line, than this third partial line should concatenate with the second partial line and this concatenated result should concatenate with the first partial line.
If it is possible to polish this dataset in that manner, it will be possible either to use an infile statement or a proc import.
The difficulties is how it is possible to clean (polish) that dataset first. I never did that before?
Any suggestions?
When I see stuff like this on the third line of your data (way off to the right)
""""}"",""{""""ImportId"""":""""QID24_TEXT_TRANSLATEDen1uk8bwp""""}"",""{""""ImportId"""":""""QID24_TEXT_TRANSLATEDfrxaxooxr""""}"",""{""""
I suspect that these bits are saying something about literal text entered by respondents. So you have one issue that may relate to possible end of line characters entered by respondents using the enter key. Just a guess but "Text_translated" just sounds odd. Since your first header row has a bunch of "- Other (please specify)" one strongly suspects a relationship to those bits.
Likely you want to start reading data at line 4, the Firstobs option for the Infile statement.
Do you have any documentation from the source of the file as to the type and content of the fields? I would use that to determine the variable properties for reading the data, length, type and informat.
The body of the file seems to be a tad inconsistent about where quotes are used and the occasional "" (two double quotes) is problematic. CSV in traditional comma separated files would want to see quotes around fields that contain one or more commas as part of the data and everything inside one pair of " " would be a single value.
By any chance do you have access to the actual data system these come from? There might be some options available for exporting the data might make the file a bit more consistent. You can also search this forum for CSV and "end of line" (I think) to find some of the possible solutions for dealing with embedded enter key presses. Sometimes a TRMSTR infile option will help as that sets what combination of Carriage Return, Line Feed or single character is treated as end of line. You may get lucky with the embedded characters actually being different than the rest of the file end of line.
You did not attach any Excel file. You just attached an CSV file. A CSV file is just a text file and so has nothing to do with Excel.
Is this the original CSV file? or did you accidentally open the original CSV file with Excel and the overwrite the original by saving Excel's attempt to interpret the file over it? Excel will mangle CSV files if you let it open them automatically without telling it how to interpret each column.
How many observations should the file have? How many fields should each observation have?
This is what you posted.
1105 data _null_; 1106 infile csv ; 1107 input; 1108 words=countw(_infile_,',','mq'); 1109 put _n_= words=; 1110 run; NOTE: The infile CSV is: Filename=c:\downloads\test.csv, RECFM=V,LRECL=32767,File Size (bytes)=17513, Last Modified=24Jan2023:08:27:01, Create Time=24Jan2023:08:27:01 _N_=1 words=132 _N_=2 words=1 _N_=3 words=1 _N_=4 words=1 _N_=5 words=1 _N_=6 words=105 _N_=7 words=2 _N_=8 words=1 _N_=9 words=132 _N_=10 words=132 _N_=11 words=1 NOTE: 11 records were read from the infile CSV. The minimum record length was 1. The maximum record length was 6148. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.01 seconds
So it looks like perhaps there should be 132 fields per line. Perhaps there are some embedded end of line strings that might explain why some lines have fewer. But adding the fields in lines 2 to 8 does not reach 132.
So the file appears to have both embedded end-of-line characters and also some lines appear to have been totally enclosed in quotes. Let's get some tools and see if we can fix it.
First let's get a tool for dealing with embedded CR and/or LF characters in delimited text files.
filename rcrlf url "https://raw.githubusercontent.com/sasutils/macros/master/replace_crlf.sas";
%include rcrlf;
Then let's get a better tool for guessing how to read a delimited text file into a SAS dataset.
filename csv2ds url "https://raw.githubusercontent.com/sasutils/macros/master/csv2ds.sas";
%include csv2ds;
filename parmv url "https://raw.githubusercontent.com/sasutils/macros/master/parmv.sas";
%include parmv;
To try to fix the lines that appear to be completely quoted let's try testing if the line has only one field and starts and ends with a quote.
So first let's remove embedded CRLF characters. Then remove extra quotes around single field lines. Then try again to remove embedded CRLF characters.
filename csv "c:\downloads\test.csv";
filename fixcrlf temp;
%replace_crlf(csv,fixcrlf);
filename dequote temp;
data _null_;
infile fixcrlf ;
file dequote;
input;
if countw(_infile_,',','qm')=1 and char(_infile_,1)='"'
and char(_infile_,length(_infile_))='"' then
_infile_=dequote(_infile_)
;
put _infile_;
run;
filename fixcrlf2 temp;
%replace_crlf(dequote,fixcrlf2);
So this file begins to look more like a CSV file.
2259 data _null_; 2260 infile fixcrlf2 ; 2261 input; 2262 words=countw(_infile_,',','mq'); 2263 put _n_= words= ; 2264 run; NOTE: The infile FIXCRLF2 is: Filename=...\#LN00062, RECFM=V,LRECL=32767,File Size (bytes)=16030, Last Modified=24Jan2023:09:30:22, Create Time=24Jan2023:09:30:22 _N_=1 words=132 _N_=2 words=128 _N_=3 words=133 _N_=4 words=132 _N_=5 words=132 _N_=6 words=27 NOTE: 6 records were read from the infile FIXCRLF2. The minimum record length was 431. The maximum record length was 6032. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds
The last line seems truncated, but perhaps you stopped in the middle of that observation because of the extra line breaks in the file you copied from.
Another problem with the file is that there appear to be two extra lines inserted between the header row and the actual data. That is something that the %CSV2DS() macro can handle. Just tell it to read the names from row 1 and start reading the data at row 4.
%csv2ds(fixcrlf2,namerow=1,datarow=4,out=want,replace=1)
If we look at the _TYPES_ dataset that %CSV2DS() uses to decide how to define the variables you can see that is now making more reasonable choices for the how to define the variables.
And the data for this first few variables looks reasonable:
proc print data=want;
var startdate -- ResponseId;
run;
The extra two rows seem to be some type of prompt/description and some type of code for pulling the values.
That second extra row seems to have one field split somehow. You can clean it up with a little work:
%csv2ds(fixcrlf2,out=metadata,replace=1);
proc transpose data=metadata(obs=2) out=vars prefix=ROW;
var _all_;
run;
data vars;
set vars;
set vars(keep=row2 rename=(row2=nextrow2) firstobs=2) vars(drop=_all_ obs=1);
if _n_=1 then row2=catx(',',row2,nextrow2);
else row2=nextrow2;
drop nextrow2;
if _name_='VAR133' then delete;
rename row1=Prompt row2=Code ;
run;
proc print;
run;
Hello Tom
The csv file is obtained by unzipping a file that is download using a Proc http procedure.
The test.csv file provided is an extract of the original file. It contains just few observations.
Effectively, they are 132 fields per observations.
Now let’s synthetize what you have done.
I have tested the proposed script with the complete csv file, and it works very well.
Thank you very much for your help.
If Excel can open the full file and it does not get columns miss aligned (broken into multiple lines for example) then you should be able to skip the data step that tries to remove extra quoting. I suspect that the extra quoting might have been introduced by you trying to make a subset of the file.
Excel can handle embedded end of line characters in fields that are quoted. SAS cannot, so you probably need to do something like the %replace_crlf() macro to make a file that SAS can read.
If you have to read multiple versions of this file then it would be better to skip the %CSV2DS() macro and instead just run your own data step to read the file directly. You could start with the code generated by %CSV2DS() macro if you want and adjust it. Note that %CSV2DS() will have written that code to the temporary fileref _CODE_ that it created. So you could add code to copy that a permanent file. Or you can also use the RUN=0 option of the macro and instead of actually running the generated code it will dump it to the SAS log where it would be easier to copy and paste.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Check out this tutorial series to learn how to build your own steps in SAS Studio.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.