BookmarkSubscribeRSS Feed
LL5
Pyrite | Level 9 LL5
Pyrite | Level 9

Hi All, 

I have a data step as below to read a csv file, but the memo field for some records jump to the next row in the output dataset.

I open the csv file in notepad and see the memo field seems to have double space before the quotation mark (not sure if that is the reason). See below data.

I also tried TERMSTR=CRLF or TERMSTR=LF but I am still seeing the memo filed jumps to the next row.

Given there is not an option to fix the data quality from the upstream, is there a way to resolve this in the data step? 

 

Johnson and Johnson,REF35847,Monthly,159,"Sandigo, Luke [INVENTORY MANAGER]","Meet - Standard - Year1/Quarter1 - 35153194 stock 301463
INV#12011088  ",5

 

data inventory;
infile "/warehouse/users/Jack/inventory/inventory_20201130.csv" dlm= ',' missover DSD lrecl=32767 firstobs=2 ;
  input
    CompanyName      : $300.      
    Ref_ID                    : $50.
    Freq                        : $10.
    Value                       : 4.
    Salesperson            : $100.
    Memo                       :$200.
    ID                            :3.
;
run;

 

 

 

14 REPLIES 14
PGStats
Opal | Level 21

Try removing the missover option.

PG
LL5
Pyrite | Level 9 LL5
Pyrite | Level 9

Thanks for offering your suggestion. Without the missover, the memo value stays in the same row but part of the value moves to the next column (ID), and ID moves to the next row. 

I suspect fixing the data quality from upstream might be the only way.

PGStats
Opal | Level 21

Fixing the issue upstream would be the ideal solution. But if that can't be done, try posting a few lines from the csv file here, so that we can give it a better try.

PG
LL5
Pyrite | Level 9 LL5
Pyrite | Level 9

Thanks PG. I realized there is embedded CRLF in the text after stock 301463, so INV#12011088 moves to the next row when SAS reads the data. I don't know if that is something fixable in SAS code. 

LL5
Pyrite | Level 9 LL5
Pyrite | Level 9

Thanks Tom. I saw those posts prior to asking my question on the forum. Later I found out my issue is not due to double quotation mark, extra space or line break, etc.

Embedded CRLF is found in the middle of the value for one filed (from the csv file I have), after reading it from SAS, it seems the embedded CRLF breaks the value into two separate parts - 1st part stays in the same row, remaining columns become missing, then 2nd part of the value jumps to next row. 

We couldn't resolve this issue in the SAS code and had to explore alternative solution.

Tom
Super User Tom
Super User

The linked post is a solution to handling embedded end of line characters in delimited files. As long as the fields that contain the extra end of line characters are enclosed in double quotes it should work to transform the file into a file that SAS can read.

 

But the best solution is to fix the problem at the source and generate delimited files that do not contain end of line characters anywhere other than at the end of the lines.

LL5
Pyrite | Level 9 LL5
Pyrite | Level 9
The end of life character is invisible in the csv file. I looked at the specific row of data in unix and found the CRLF in the middle of the record.
Tom
Super User Tom
Super User

If you have no other tools for examining text files the SAS data step will work just fine.

data _null_;
   infile 'myfile.csv' ;
   input;
   list;
run;

If you really want to see what the end of line character is and if there are any in the middle of lines then you can read the file a fixed length and the end of line characters are treated as part of the data.

data _null_;
   infile 'myfile.csv'  refcm=f lrecl=100 ;
   input;
   list;
run;
LL5
Pyrite | Level 9 LL5
Pyrite | Level 9
I got your point. I though about using fixed length and treat the special/invisible character as part of the data, but I can’t be exactly sure how many invisible characters are there and where are they. I have around 100k obs and 125 variable.
The csv file is an output file/downstream from another upstream process. I suspect upstream data provider copied and pasted some texts from the website that carries CRLF.
Tom
Super User Tom
Super User

No.  Use the fixed length just to LOOK at the file so you can figure out what garbage you have and how you can read it.

LL5
Pyrite | Level 9 LL5
Pyrite | Level 9

The "garbage" is invisible neither in csv nor sas dataset. I tried the fix length approach but it did not do anything in this situation. The "garbage" is only visible when we use unix command to read the file in putty. Again, I think the subject of this question is confusing because later I found out my issue is not due to space, quotation mark or line break etc. Thanks again for all your advice. 

Tom
Super User Tom
Super User

To see how you can use the LIST statement to look at what is in your text file run this example:

filename csv temp;
data _null_;
  file csv ;
  put '1,2,3,"first part'
    / 'second part",5'
    / '6,7,8,"ok",10'
  ;
run;

data _null_;
  infile csv;
  input;
  list;
run;

data _null_;
  infile csv recfm=f lrecl=20 ;
  input;
  list;
run;

The first reading will show that the file has three "lines", none of which contain any characters that LIST considers as non-printing.

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+--
1         1,2,3,"first part 17
2         second part",5 14
3         6,7,8,"ok",10 13
NOTE: 3 records were read from the infile CSV.
      The minimum record length was 13.
      The maximum record length was 17.

The second reading will show you where the CR ('0D'x) and LF ('0A'x) appear in the file.  So you can see that there are CRLF pairs in the middle of the first string between FIRST PART and SECOND PART. And also that the file is using CRLF pairs at the end of the lines.

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+--

1   CHAR  1,2,3,"first part..s
    ZONE  32323226677727677007
    NUMR  1C2C3C26923400124DA3

2   CHAR  econd part",5..6,7,8
    ZONE  66666276772230032323
    NUMR  53FE4001242C5DA6C7C8

3   CHAR  ,"ok",10.. 10
    ZONE  2266223300
    NUMR  C2FB2C10DA
NOTE: 3 records were read from the infile CSV.

Other common non-printing characters are TAB '09'x FORM FEED '0C'x non-breaking space 'A0'x.

 

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 1662 views
  • 4 likes
  • 3 in conversation