Solved: Replacing missing values across multiple variables

tofov2 · Posted 03-09-2022 02:31 PM

Hello everyone,

I have imported a .csv file into SAS. From reviewing the data, missing data is reported predominantly reported as '.', which works for my analysis.

However, the dataset also has import values of 'NA' which also describes missing values. 'NA' occurs across different variables (>100). Moreover, this is causing my variable labels which should be numeric and best12 to be imported as characters and $2.

Is there a quick way to replace NA across multiple variables with '.' during the import or to have SAS recognize 'NA' as a missing variable as well?

Or, after import, is there a way to replace 'NA' across multiple variables to be '.' or handled as missing? I can reformat each variable manually afterwards.

Initial code is:

FILENAME CSV "home/myfile.csv" TERMSTR=LF;

PROC IMPORT DATAFILE=CSV OUT=WORK.MYCSV DBMS=CSV REPLACE; RUN;
PROC CONTENTS DATA=WORK.MYCSV; RUN;

Tom · Posted 03-09-2022 04:07 PM

Would it help to remove those NA values from that data? Do you have any characters variables that are expected the have the symbols for chemical elements in them where Sodium is possibly one of the chemicals?

FILENAME CSV "/home/userid/myfile.csv" TERMSTR=LF;
filename fixed temp TERMSTR=LF;
data _null_;
  infile csv termstr=lf ;
  file fixed termstr=lf ;
  input;
  _infile_=tranwrd(cats(',',_infile_,','),',NA,',',,');
  _infile_=substrn(_infile_,2,length(_infile_)-2);
  put _infile_;
run;

PROC IMPORT DATAFILE=fixed OUT=MYCSV DBMS=CSV REPLACE; 
RUN;
PROC CONTENTS DATA=MYCSV; 
RUN;

View solution in original post

Tom · Posted 03-09-2022 02:41 PM

The easiest way is to NOT use PROC IMPORT. Instead just write your own data step to read the CSV file. Then you can define that variables as numeric and the NA strings in the CSV will become missing values since they are not valid characters for a numeric variable.

Tom · Posted 03-09-2022 04:07 PM

Would it help to remove those NA values from that data? Do you have any characters variables that are expected the have the symbols for chemical elements in them where Sodium is possibly one of the chemicals?

FILENAME CSV "/home/userid/myfile.csv" TERMSTR=LF;
filename fixed temp TERMSTR=LF;
data _null_;
  infile csv termstr=lf ;
  file fixed termstr=lf ;
  input;
  _infile_=tranwrd(cats(',',_infile_,','),',NA,',',,');
  _infile_=substrn(_infile_,2,length(_infile_)-2);
  put _infile_;
run;

PROC IMPORT DATAFILE=fixed OUT=MYCSV DBMS=CSV REPLACE; 
RUN;
PROC CONTENTS DATA=MYCSV; 
RUN;

tofov2 · Posted 03-10-2022 11:03 AM

This is exactly what I was looking for! Thank you! Basically, I had a combination of several data sets. Some used '.' to report missing valves, while others assigned 'NA'. Removing NA worked perfectly. NA was not representative of sodium or anything else except a missing value.

Tom · Posted 03-10-2022 11:13 AM

Great.

The other place to be careful is Last Name fields.

https://forebears.io/surnames?q=NA

moni2 · Posted 03-16-2023 09:22 AM

A follow-up on this question (I am not the original poster but I have the same issue). I am using SAS 9.4.

I am also interested in removing all the "NA" values from my variables in MyFile.csv. I have several variables and they all have several NA values.

I tried the code from above (included below) but it sill contained NA and the variables that should be numeric have remained characters.

Can you help me understand what I am missing?

FILENAME CSV "C:\Users\.....\MyFile.csv" TERMSTR=LF;
FILENAME A TEMP TERMSTR=LF;
DATA _NULL_;
INFILE csv TERMSTR=lf ;
FILE A TERMSTR=lf ;
INPUT;
_infile_=tranwrd(cats(',',_infile_,','),',NA,',',,');
*_infile_=substrn(_infile_,2,length(_infile_)-2);
PUT _INFILE_;
RUN;

PROC IMPORT DATAFILE=A OUT=MYCSV DBMS=CSV REPLACE;
RUN;
PROC CONTENTS DATA=MYCSV;
RUN;

Tom · Posted 03-16-2023 09:29 AM

Look at the CSV file (make sure to look at it with a text editor and not a spreadsheet program).

The code you posted is looking for NA surrounded by commas. If the delimiter is not a comma it will not work. If the NA values are enclosed in quotes it will not work. If the NA is in mixed or lowercase it will not work. If the line length is larger than 32767 bytes it will not work.

ballardw · Posted 03-16-2023 10:32 AM

@moni2 wrote:

A follow-up on this question (I am not the original poster but I have the same issue). I am using SAS 9.4.

I am also interested in removing all the "NA" values from my variables in MyFile.csv. I have several variables and they all have several NA values.

I tried the code from above (included below) but it sill contained NA and the variables that should be numeric have remained characters.

Can you help me understand what I am missing?

FILENAME CSV "C:\Users\.....\MyFile.csv" TERMSTR=LF;
FILENAME A TEMP TERMSTR=LF;
DATA _NULL_;
INFILE csv TERMSTR=lf ;
FILE A TERMSTR=lf ;
INPUT;
_infile_=tranwrd(cats(',',_infile_,','),',NA,',',,');
*_infile_=substrn(_infile_,2,length(_infile_)-2);
PUT _INFILE_;
RUN;

PROC IMPORT DATAFILE=A OUT=MYCSV DBMS=CSV REPLACE;
RUN;
PROC CONTENTS DATA=MYCSV;
RUN;

First, on this forum we recommend that you start a new thread topic and if needed reference a similar one. As the creator of a thread you have some options not available as a responder such as marking a solution to YOUR specific problem.

Second, READ the whole thread. My suggested solution is a couple of posts down from yours that involves a DATA STEP to read the file with a custom Informat, or several if the variables involved are of different types.

Reliance on Proc Import, since each call is treated completely separate even for files of the same structure, means that variables can be of different lengths and even types and in worst case with poor source control different names. Which means that you may have issues when you go to combine data sets for any reason. Or even just run the same analysis on a different set.

ballardw · Posted 03-09-2022 04:34 PM

Personally I generally read CSV with a data step. I may use the code generated by Proc Import as a base but will change informats and such as suggested by @Tom . When there are code values such as the NA you mention I will create a custom INFORMAT that captures that information but still creates numeric variables with missing values so calculations don't get wonky.

Proc format;
invalue myna (upcase)
'NA' = .N
other = [f12.] 
;
run;

data example;
  infile datalines dlm=',' dsd;
  informat  x  y myna. ;
  input x y;
datalines;
1,1
na,2
.,3
,4
5,234.56
;

The above reads values of NA (or nA, Na, na) into the special missing .N . That way I know the value was read with the original value of NA and for purposes such as printing data could replace, using a custom format, .N with text such as "Not applicable" "Not Collected" "Anything I want". It differentiates between values that are just plain not in the data such as the third and 4th rows of the data above.

Note the use of the [ ] around an existing informat with the other to provide how the other values are read.

The use of such an informat, besides resulting in documented values, is that the log is "clean" with no invalid data messages as occur when reading a variable with a simple 8. (or BEST32. as Proc import generated code shows) that have known text like 'NA' occuring.

Replacing missing values across multiple variables

Re: Replacing missing values across multiple variables

Re: Replacing missing values across multiple variables

Re: Replacing missing values across multiple variables

Re: Replacing missing values across multiple variables

Re: Replacing missing values across multiple variables

Re: Replacing missing values across multiple variables

Re: Replacing missing values across multiple variables

Re: Replacing missing values across multiple variables

Re: Replacing missing values across multiple variables

Register Today!