How to read csv file with space followed by quotation mark and line br...

LL5 · Posted 01-07-2021 04:19 PM

Hi All,

I have a data step as below to read a csv file, but the memo field for some records jump to the next row in the output dataset.

I open the csv file in notepad and see the memo field seems to have double space before the quotation mark (not sure if that is the reason). See below data.

I also tried TERMSTR=CRLF or TERMSTR=LF but I am still seeing the memo filed jumps to the next row.

Given there is not an option to fix the data quality from the upstream, is there a way to resolve this in the data step?

Johnson and Johnson,REF35847,Monthly,159,"Sandigo, Luke [INVENTORY MANAGER]","Meet - Standard - Year1/Quarter1 - 35153194 stock 301463
INV#12011088 ",5

data inventory;
infile "/warehouse/users/Jack/inventory/inventory_20201130.csv" dlm= ',' missover DSD lrecl=32767 firstobs=2 ;
  input
    CompanyName      : $300.      
    Ref_ID                    : $50.
    Freq                        : $10.
    Value                       : 4.
    Salesperson            : $100.
    Memo                       :$200.
    ID                            :3.
;
run;

PGStats · Posted 01-07-2021 05:23 PM

Try removing the missover option.

PG

LL5 · Posted 01-07-2021 05:35 PM

Thanks for offering your suggestion. Without the missover, the memo value stays in the same row but part of the value moves to the next column (ID), and ID moves to the next row.

I suspect fixing the data quality from upstream might be the only way.

PGStats · Posted 01-07-2021 05:43 PM

Fixing the issue upstream would be the ideal solution. But if that can't be done, try posting a few lines from the csv file here, so that we can give it a better try.

PG

LL5 · Posted 01-07-2021 07:49 PM

Thanks PG. I realized there is embedded CRLF in the text after stock 301463, so INV#12011088 moves to the next row when SAS reads the data. I don't know if that is something fixable in SAS code.

PGStats · Posted 01-07-2021 07:53 PM

@Tom might be able to help here...

PG

Tom · Posted 01-07-2021 11:13 PM

Search for the other 25 gazillion times this question has been asked on the forum.

https://www.google.com/search?q=%40sas.com+embedded+line+breaks+in+CSV+file

https://communities.sas.com/t5/New-SAS-User/Reading-CSV-file-with-double-quotes-line-breaks-and-spac...

LL5 · Posted 01-08-2021 04:58 PM

Thanks Tom. I saw those posts prior to asking my question on the forum. Later I found out my issue is not due to double quotation mark, extra space or line break, etc.

Embedded CRLF is found in the middle of the value for one filed (from the csv file I have), after reading it from SAS, it seems the embedded CRLF breaks the value into two separate parts - 1st part stays in the same row, remaining columns become missing, then 2nd part of the value jumps to next row.

We couldn't resolve this issue in the SAS code and had to explore alternative solution.

Tom · Posted 01-08-2021 06:55 PM

The linked post is a solution to handling embedded end of line characters in delimited files. As long as the fields that contain the extra end of line characters are enclosed in double quotes it should work to transform the file into a file that SAS can read.

But the best solution is to fix the problem at the source and generate delimited files that do not contain end of line characters anywhere other than at the end of the lines.

LL5 · Posted 01-08-2021 07:18 PM

The end of life character is invisible in the csv file. I looked at the specific row of data in unix and found the CRLF in the middle of the record.

Tom · Posted 01-08-2021 07:22 PM

If you have no other tools for examining text files the SAS data step will work just fine.

data _null_;
   infile 'myfile.csv' ;
   input;
   list;
run;

If you really want to see what the end of line character is and if there are any in the middle of lines then you can read the file a fixed length and the end of line characters are treated as part of the data.

data _null_;
   infile 'myfile.csv'  refcm=f lrecl=100 ;
   input;
   list;
run;

LL5 · Posted 01-08-2021 07:34 PM

I got your point. I though about using fixed length and treat the special/invisible character as part of the data, but I can’t be exactly sure how many invisible characters are there and where are they. I have around 100k obs and 125 variable.
The csv file is an output file/downstream from another upstream process. I suspect upstream data provider copied and pasted some texts from the website that carries CRLF.

Tom · Posted 01-08-2021 08:23 PM

No. Use the fixed length just to LOOK at the file so you can figure out what garbage you have and how you can read it.

LL5 · Posted 01-11-2021 10:27 AM

The "garbage" is invisible neither in csv nor sas dataset. I tried the fix length approach but it did not do anything in this situation. The "garbage" is only visible when we use unix command to read the file in putty. Again, I think the subject of this question is confusing because later I found out my issue is not due to space, quotation mark or line break etc. Thanks again for all your advice.

Tom · Posted 01-11-2021 11:08 AM

To see how you can use the LIST statement to look at what is in your text file run this example:

filename csv temp;
data _null_;
  file csv ;
  put '1,2,3,"first part'
    / 'second part",5'
    / '6,7,8,"ok",10'
  ;
run;

data _null_;
  infile csv;
  input;
  list;
run;

data _null_;
  infile csv recfm=f lrecl=20 ;
  input;
  list;
run;

The first reading will show that the file has three "lines", none of which contain any characters that LIST considers as non-printing.

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+--
1         1,2,3,"first part 17
2         second part",5 14
3         6,7,8,"ok",10 13
NOTE: 3 records were read from the infile CSV.
      The minimum record length was 13.
      The maximum record length was 17.

The second reading will show you where the CR ('0D'x) and LF ('0A'x) appear in the file. So you can see that there are CRLF pairs in the middle of the first string between FIRST PART and SECOND PART. And also that the file is using CRLF pairs at the end of the lines.

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+--

1   CHAR  1,2,3,"first part..s
    ZONE  32323226677727677007
    NUMR  1C2C3C26923400124DA3

2   CHAR  econd part",5..6,7,8
    ZONE  66666276772230032323
    NUMR  53FE4001242C5DA6C7C8

3   CHAR  ,"ok",10.. 10
    ZONE  2266223300
    NUMR  C2FB2C10DA
NOTE: 3 records were read from the infile CSV.

Other common non-printing characters are TAB '09'x FORM FEED '0C'x non-breaking space 'A0'x.

How to read csv file with space followed by quotation mark and line break

Re: How to read csv file with space followed by quotation mark and line break

Re: How to read csv file with space followed by quotation mark and line break

Re: How to read csv file with space followed by quotation mark and line break

Re: How to read csv file with space followed by quotation mark and line break

Re: How to read csv file with space followed by quotation mark and line break

Re: How to read csv file with space followed by quotation mark and line break

Re: How to read csv file with space followed by quotation mark and line break

Re: How to read csv file with space followed by quotation mark and line break

Re: How to read csv file with space followed by quotation mark and line break

Re: How to read csv file with space followed by quotation mark and line break

Re: How to read csv file with space followed by quotation mark and line break

Re: How to read csv file with space followed by quotation mark and line break

Re: How to read csv file with space followed by quotation mark and line break

Re: How to read csv file with space followed by quotation mark and line break

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!