SAS Programming

ginak · Posted 09-05-2018 03:10 PM

Hello,

Sometimes I receive .csv files from folks who work in R and change everything that's missing to "NA." What's an easy way to import the .csv while telling SAS to treat "NA" as ".m"? For example,

data have;	
input ID status (start end) (:mmddyy10.) number;
format start end yymmdd10. number ;
cards;
1A	active 	01/02/2011 02/01/2011      9495552233
1A	revoked    NA               NA     9495523344
1A	active 	05/01/2013	06/03/2016 (949)553-4488
;

Except in my data, these character variables appear deep into the data.The data I imported could look something like this and Ireceived an error because of the second line and third line, since SAS assumed two dates and a numeric variable, only to be surprised with characters in those. I usually go into the proc import and manually change each variable to a character to avoid errors, but then it's cumbersome to have to go back and change those back to numeric manually, especially when I have a lot of variables.

Thanks!

Best,

Gina

Tom · Posted 09-05-2018 03:43 PM

One way is to just pre-process the data to remove the NA. That is really easy if you know the number of columns in the source file. So if the file has 20 columns and the longest possible value is 200 characters then a program like this will convert all the NA to blanks. Then SAS will properly read those a missing. If you really wanted to treat them as some type of special missing then convert them into a single letter instead. But why would need to have missing and NA as meaning two different things?

data _null_;
   infile 'from_r.csv' dsd truncover ;
   file 'to_sas.csv' dsd ;
   length word $200 ;
   do i=1 to 20;
      input word @;
      if word='NA' then word=' ';
      put word @;
   end;
   put;
run;

Then you could still try using PROC IMPORT to guess what is actually in the file. Note that it is not hard to figure out how many variables are in the file and put that into a macro variable.

data _null_;
  infile 'from_r.csv' obs=1;
  input;
  call symputx('ncol',countw(_infile_,',','q'));
run;

Another method is to make your own informats to handle the NA strings.

proc format;
invalue from_r
 'NA' = .
other = [comma32.]
;
invalue from_r_d
 'NA' = .
other = [anydtdte32.]
;
run;
data want ;
  infile 'from_r.csv' dsd truncover firstobs=2;
  length col1 $20 col2 8 col3 8;
  informat _numeric_ from_r.;
  informat col3 from_rd. ;
  format col3 yymmdd10.;
run;

View solution in original post

PGStats · Posted 09-05-2018 03:27 PM

Read _INFILE_ buffer, replace NA with . and read again :

data have;
infile datalines dsd;
input @;
_infile_ = transtrn(_infile_, ",NA", ",."); /* Replace NA with . */
input @1 ID $ status $ (start end) (:mmddyy10.);
format start end yymmdd10.;
datalines;
1A,active,01/02/2011,02/01/2011
1A,revoked,NA,NA
1A,active,05/01/2013,06/03/2016
;

PG

ginak · Posted 09-05-2018 03:33 PM

Thank you!! This is very helpful!

If I were to use the proc import code, where could I denote that?

 data MAIN2.combined_dat ;
   %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
   infile 'B:\Gina\combined_data.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
      informat ID best32. ;
      informat phoneNumber best32. ;
      informat strt_dt mmddyy10. ;
      informat end_dt mmddyy10. ;
.....

ballardw · Posted 09-05-2018 05:16 PM

@ginak wrote:

Thank you!! This is very helpful!

If I were to use the proc import code, where could I denote that?

 data MAIN2.combined_dat ;
   %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
   infile 'B:\Gina\combined_data.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
      informat ID best32. ;
      informat phoneNumber best32. ;
      informat strt_dt mmddyy10. ;
      informat end_dt mmddyy10. ;
.....

To use a custom informat you need to use a data step. Proc Import is not able to use such.

If you have multiple files that should have the same structure and always read them with Proc Import you will eventually have an issue with one or more variables changing data type or character variables with different lengths causing issues unless you are very lucky. Proc Import guesses as to variable types, lengths and informats. It can be pretty good but for this specific issue you need to use a data step to use either of the approaches.

Tom · Posted 09-05-2018 03:47 PM

Watch out for real values that happen to start with NA.

_infile_=tranwrd(cats(',',_infile_,','),',NA,',',,');
_infile_=substrn(_infile_,2,length(_infile_)-2);

Tom · Posted 09-05-2018 03:43 PM

One way is to just pre-process the data to remove the NA. That is really easy if you know the number of columns in the source file. So if the file has 20 columns and the longest possible value is 200 characters then a program like this will convert all the NA to blanks. Then SAS will properly read those a missing. If you really wanted to treat them as some type of special missing then convert them into a single letter instead. But why would need to have missing and NA as meaning two different things?

data _null_;
   infile 'from_r.csv' dsd truncover ;
   file 'to_sas.csv' dsd ;
   length word $200 ;
   do i=1 to 20;
      input word @;
      if word='NA' then word=' ';
      put word @;
   end;
   put;
run;

Then you could still try using PROC IMPORT to guess what is actually in the file. Note that it is not hard to figure out how many variables are in the file and put that into a macro variable.

data _null_;
  infile 'from_r.csv' obs=1;
  input;
  call symputx('ncol',countw(_infile_,',','q'));
run;

Another method is to make your own informats to handle the NA strings.

proc format;
invalue from_r
 'NA' = .
other = [comma32.]
;
invalue from_r_d
 'NA' = .
other = [anydtdte32.]
;
run;
data want ;
  infile 'from_r.csv' dsd truncover firstobs=2;
  length col1 $20 col2 8 col3 8;
  informat _numeric_ from_r.;
  informat col3 from_rd. ;
  format col3 yymmdd10.;
run;

ginak · Posted 09-05-2018 05:15 PM

Hi Tom,

Thank you! I ended up doing what you first suggested:

data _null_;
   infile 'B:\Gina\combineddata.csv' dsd truncover ;
   file 'B:\Gina\to_sas.csv' dsd ;
   length word $200 ;
   do i=1 to 140;
      input word @;
      if word='NA' then word=' ';
      put word @;
   end;
   put;
run;

I had 140 variables. This worked! I then used proc import to import my data, and am going through to change the phone numbers to character (most are in 8005525522 format, but there's some that look like (800)555-3344 and it's throwing SAS off). I wonder if there is a better way to do this and have SAS scan more records sot hat it can see that they're longer than $1 or something.
Thanks for your help!

Tom · Posted 09-05-2018 07:55 PM

You can add the GUESSINGROWS=MAX statement to your PROC IMPORT step.

But a better solution would be use to use a format that preserves the data types. There are packages for for R that will allow you write out a SAS dataset or at least a SAS transport file.

SAS Programming

Importing a .CSV file from R with NA's as missing

Re: Importing a .CSV file from R with NA's as missing

Re: Importing a .CSV file from R with NA's as missing

Re: Importing a .CSV file from R with NA's as missing

Re: Importing a .CSV file from R with NA's as missing

Re: Importing a .CSV file from R with NA's as missing

Re: Importing a .CSV file from R with NA's as missing

Re: Importing a .CSV file from R with NA's as missing

Re: Importing a .CSV file from R with NA's as missing

Follow Us

What is...

SAS Programming

Our biggest data and AI event of the year.

SAS Training: Just a Click Away

Follow Us

What is...