SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

Set all missing values as 'NA' even in columns whose format allows for one character only

Reply
Occasional Contributor
Posts: 6

Set all missing values as 'NA' even in columns whose format allows for one character only

[ Edited ]

Hey there, as the title says, handling many large and heterogeneous datasets has its difficulties. When applying this:

DATA &name.;  
 SET &name.;
    ARRAY ch(*) _character_;
    DO _n_ = 1 TO dim(ch);
        ATTRIB ch(_n_) INFORMAT=$32. FORMAT=$32.;
        ch(_n_) = coalescec(ch(_n_), 'NA');		/*replace empty cells with NA*/
        ch(_n_) = tranwrd(ch(_n_), "Unknown", 'NA'); 	/*replace "Unknown" entries with NA*/
        ch(_n_) = tranwrd(ch(_n_), '.', 'NA');		/*replace dots with NA*/
END;
RUN;

I am confident that all empty cells, all "Unknown" and all dots get replaced by an "NA".
However, sometimes only an "N" instead of "NA" will appear in a certain column since it is formatted to not have more than that one character. Is there a way to solve this issue elegantly? Maybe even a whole different approach? Thanks

Super User
Posts: 5,085

Re: Set all missing values as 'NA' even in columns whose format allows for one character only

If you want a variable to hold the characters "NA", it can't be defined as 1 character in length.

 

Fixes are not necessarily easy.  You could redefine the length of those variables, by adding a LENGTH statement before the SET statement.

 

You could apply a format to change the display (instead of changing the values):

 

proc format;

value $trouble '.' = 'NA';

run;

 

In that case, don't change the "." values.

Occasional Contributor
Posts: 6

Re: Set all missing values as 'NA' even in columns whose format allows for one character only

Thanks, sounds promising!
I am sorry though, I cannot follow you on everything.

If I were to add a LENGTH statement, wouldn't I be changing the LENGTH of all variables? Because that would not be okay, as only those who are <2 characters long need to be changed to 2 characters.

 

Also, what does $trouble do?

Thanks

Trusted Advisor
Posts: 1,115

Re: Set all missing values as 'NA' even in columns whose format allows for one character only

If I may step in here: With a suitable LENGTH statement you can define the length of each variable separately. But you have to list the names of the variables. (If needed, you could let SAS generate the list, e.g., of all character variables which have length 1, but that would require a bit of advanced code.)

 

$trouble is an arbitrary name of a user-defined character format which maps a single period to 'NA' (in the sense that a value '.' would be displayed as 'NA'). Two caveats:

  1. As it's defined here, the format has default length 2. So, for example put('ABCDE', $trouble.)='AB', but put('ABCDE', $trouble5.)='ABCDE'. (The format definition could be modified to specify a greater default length.)
  2. A character value such as '    .' (i.e. a period with one or more leading blanks) would not be formatted as 'NA'.

I don't know how the single periods got into your character variables, but a typical scenario for this situation is numeric-to-character conversion of missing values. In this case leading blanks occur quite frequently. So, you should make sure that those values are left-aligned before the format is applied.

 

Item 2 above would also interfere with your suggested code: A character variable of length 8 containing a right-aligned period, would be transformed to '       N' (with seven leading blanks) by your last assignment statement, not 'NA' or '      NA'.

 

Other issues might be that 'unknown', 'UNKNOWN' or abbreviations of it would not be replaced and 'Unknown value' would become 'NA value'.

 

But most importantly, the data step would not work at all, but fail with an error message, because the declarative ATTRIB statement cannot be applied to array elements in a DO loop. If you have a good reason for associating informat $32. and format $32. to all character variables of the input dataset (why would you want that?), you could do so by using the following statement after the SET statement:

attrib _character_ informat=$32. format=$32.;
Ask a Question
Discussion stats
  • 3 replies
  • 332 views
  • 1 like
  • 3 in conversation