Hi All,
I have a data step as below to read a csv file, but the memo field for some records jump to the next row in the output dataset.
I open the csv file in notepad and see the memo field seems to have double space before the quotation mark (not sure if that is the reason). See below data.
I also tried TERMSTR=CRLF or TERMSTR=LF but I am still seeing the memo filed jumps to the next row.
Given there is not an option to fix the data quality from the upstream, is there a way to resolve this in the data step?
Johnson and Johnson,REF35847,Monthly,159,"Sandigo, Luke [INVENTORY MANAGER]","Meet - Standard - Year1/Quarter1 - 35153194 stock 301463
INV#12011088 ",5
data inventory;
infile "/warehouse/users/Jack/inventory/inventory_20201130.csv" dlm= ',' missover DSD lrecl=32767 firstobs=2 ;
input
CompanyName : $300.
Ref_ID : $50.
Freq : $10.
Value : 4.
Salesperson : $100.
Memo :$200.
ID :3.
;
run;
Try removing the missover option.
Thanks for offering your suggestion. Without the missover, the memo value stays in the same row but part of the value moves to the next column (ID), and ID moves to the next row.
I suspect fixing the data quality from upstream might be the only way.
Fixing the issue upstream would be the ideal solution. But if that can't be done, try posting a few lines from the csv file here, so that we can give it a better try.
Thanks PG. I realized there is embedded CRLF in the text after stock 301463, so INV#12011088 moves to the next row when SAS reads the data. I don't know if that is something fixable in SAS code.
@Tom might be able to help here...
Search for the other 25 gazillion times this question has been asked on the forum.
https://www.google.com/search?q=%40sas.com+embedded+line+breaks+in+CSV+file
Thanks Tom. I saw those posts prior to asking my question on the forum. Later I found out my issue is not due to double quotation mark, extra space or line break, etc.
Embedded CRLF is found in the middle of the value for one filed (from the csv file I have), after reading it from SAS, it seems the embedded CRLF breaks the value into two separate parts - 1st part stays in the same row, remaining columns become missing, then 2nd part of the value jumps to next row.
We couldn't resolve this issue in the SAS code and had to explore alternative solution.
The linked post is a solution to handling embedded end of line characters in delimited files. As long as the fields that contain the extra end of line characters are enclosed in double quotes it should work to transform the file into a file that SAS can read.
But the best solution is to fix the problem at the source and generate delimited files that do not contain end of line characters anywhere other than at the end of the lines.
If you have no other tools for examining text files the SAS data step will work just fine.
data _null_;
infile 'myfile.csv' ;
input;
list;
run;
If you really want to see what the end of line character is and if there are any in the middle of lines then you can read the file a fixed length and the end of line characters are treated as part of the data.
data _null_;
infile 'myfile.csv' refcm=f lrecl=100 ;
input;
list;
run;
No. Use the fixed length just to LOOK at the file so you can figure out what garbage you have and how you can read it.
The "garbage" is invisible neither in csv nor sas dataset. I tried the fix length approach but it did not do anything in this situation. The "garbage" is only visible when we use unix command to read the file in putty. Again, I think the subject of this question is confusing because later I found out my issue is not due to space, quotation mark or line break etc. Thanks again for all your advice.
To see how you can use the LIST statement to look at what is in your text file run this example:
filename csv temp;
data _null_;
file csv ;
put '1,2,3,"first part'
/ 'second part",5'
/ '6,7,8,"ok",10'
;
run;
data _null_;
infile csv;
input;
list;
run;
data _null_;
infile csv recfm=f lrecl=20 ;
input;
list;
run;
The first reading will show that the file has three "lines", none of which contain any characters that LIST considers as non-printing.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+-- 1 1,2,3,"first part 17 2 second part",5 14 3 6,7,8,"ok",10 13 NOTE: 3 records were read from the infile CSV. The minimum record length was 13. The maximum record length was 17.
The second reading will show you where the CR ('0D'x) and LF ('0A'x) appear in the file. So you can see that there are CRLF pairs in the middle of the first string between FIRST PART and SECOND PART. And also that the file is using CRLF pairs at the end of the lines.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+-- 1 CHAR 1,2,3,"first part..s ZONE 32323226677727677007 NUMR 1C2C3C26923400124DA3 2 CHAR econd part",5..6,7,8 ZONE 66666276772230032323 NUMR 53FE4001242C5DA6C7C8 3 CHAR ,"ok",10.. 10 ZONE 2266223300 NUMR C2FB2C10DA NOTE: 3 records were read from the infile CSV.
Other common non-printing characters are TAB '09'x FORM FEED '0C'x non-breaking space 'A0'x.
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.