BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
paulschm
Fluorite | Level 6

Hey there, as the title says, handling many large and heterogeneous datasets has its difficulties. When applying this:

DATA &name.;  
 SET &name.;
    ARRAY ch(*) _character_;
    DO _n_ = 1 TO dim(ch);
        ATTRIB ch(_n_) INFORMAT=$32. FORMAT=$32.;
        ch(_n_) = coalescec(ch(_n_), 'NA');		/*replace empty cells with NA*/
        ch(_n_) = tranwrd(ch(_n_), "Unknown", 'NA'); 	/*replace "Unknown" entries with NA*/
        ch(_n_) = tranwrd(ch(_n_), '.', 'NA');		/*replace dots with NA*/
END;
RUN;

I am confident that all empty cells, all "Unknown" and all dots get replaced by an "NA".
However, sometimes only an "N" instead of "NA" will appear in a certain column since it is formatted to not have more than that one character. Is there a way to solve this issue elegantly? Maybe even a whole different approach? Thanks

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

If I may step in here: With a suitable LENGTH statement you can define the length of each variable separately. But you have to list the names of the variables. (If needed, you could let SAS generate the list, e.g., of all character variables which have length 1, but that would require a bit of advanced code.)

 

$trouble is an arbitrary name of a user-defined character format which maps a single period to 'NA' (in the sense that a value '.' would be displayed as 'NA'). Two caveats:

  1. As it's defined here, the format has default length 2. So, for example put('ABCDE', $trouble.)='AB', but put('ABCDE', $trouble5.)='ABCDE'. (The format definition could be modified to specify a greater default length.)
  2. A character value such as '    .' (i.e. a period with one or more leading blanks) would not be formatted as 'NA'.

I don't know how the single periods got into your character variables, but a typical scenario for this situation is numeric-to-character conversion of missing values. In this case leading blanks occur quite frequently. So, you should make sure that those values are left-aligned before the format is applied.

 

Item 2 above would also interfere with your suggested code: A character variable of length 8 containing a right-aligned period, would be transformed to '       N' (with seven leading blanks) by your last assignment statement, not 'NA' or '      NA'.

 

Other issues might be that 'unknown', 'UNKNOWN' or abbreviations of it would not be replaced and 'Unknown value' would become 'NA value'.

 

But most importantly, the data step would not work at all, but fail with an error message, because the declarative ATTRIB statement cannot be applied to array elements in a DO loop. If you have a good reason for associating informat $32. and format $32. to all character variables of the input dataset (why would you want that?), you could do so by using the following statement after the SET statement:

attrib _character_ informat=$32. format=$32.;

View solution in original post

3 REPLIES 3
Astounding
PROC Star

If you want a variable to hold the characters "NA", it can't be defined as 1 character in length.

 

Fixes are not necessarily easy.  You could redefine the length of those variables, by adding a LENGTH statement before the SET statement.

 

You could apply a format to change the display (instead of changing the values):

 

proc format;

value $trouble '.' = 'NA';

run;

 

In that case, don't change the "." values.

paulschm
Fluorite | Level 6

Thanks, sounds promising!
I am sorry though, I cannot follow you on everything.

If I were to add a LENGTH statement, wouldn't I be changing the LENGTH of all variables? Because that would not be okay, as only those who are <2 characters long need to be changed to 2 characters.

 

Also, what does $trouble do?

Thanks

FreelanceReinh
Jade | Level 19

If I may step in here: With a suitable LENGTH statement you can define the length of each variable separately. But you have to list the names of the variables. (If needed, you could let SAS generate the list, e.g., of all character variables which have length 1, but that would require a bit of advanced code.)

 

$trouble is an arbitrary name of a user-defined character format which maps a single period to 'NA' (in the sense that a value '.' would be displayed as 'NA'). Two caveats:

  1. As it's defined here, the format has default length 2. So, for example put('ABCDE', $trouble.)='AB', but put('ABCDE', $trouble5.)='ABCDE'. (The format definition could be modified to specify a greater default length.)
  2. A character value such as '    .' (i.e. a period with one or more leading blanks) would not be formatted as 'NA'.

I don't know how the single periods got into your character variables, but a typical scenario for this situation is numeric-to-character conversion of missing values. In this case leading blanks occur quite frequently. So, you should make sure that those values are left-aligned before the format is applied.

 

Item 2 above would also interfere with your suggested code: A character variable of length 8 containing a right-aligned period, would be transformed to '       N' (with seven leading blanks) by your last assignment statement, not 'NA' or '      NA'.

 

Other issues might be that 'unknown', 'UNKNOWN' or abbreviations of it would not be replaced and 'Unknown value' would become 'NA value'.

 

But most importantly, the data step would not work at all, but fail with an error message, because the declarative ATTRIB statement cannot be applied to array elements in a DO loop. If you have a good reason for associating informat $32. and format $32. to all character variables of the input dataset (why would you want that?), you could do so by using the following statement after the SET statement:

attrib _character_ informat=$32. format=$32.;

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 4811 views
  • 1 like
  • 3 in conversation