Hey there, as the title says, handling many large and heterogeneous datasets has its difficulties. When applying this:
DATA &name.;
SET &name.;
ARRAY ch(*) _character_;
DO _n_ = 1 TO dim(ch);
ATTRIB ch(_n_) INFORMAT=$32. FORMAT=$32.;
ch(_n_) = coalescec(ch(_n_), 'NA'); /*replace empty cells with NA*/
ch(_n_) = tranwrd(ch(_n_), "Unknown", 'NA'); /*replace "Unknown" entries with NA*/
ch(_n_) = tranwrd(ch(_n_), '.', 'NA'); /*replace dots with NA*/
END;
RUN;
I am confident that all empty cells, all "Unknown" and all dots get replaced by an "NA".
However, sometimes only an "N" instead of "NA" will appear in a certain column since it is formatted to not have more than that one character. Is there a way to solve this issue elegantly? Maybe even a whole different approach? Thanks
If I may step in here: With a suitable LENGTH statement you can define the length of each variable separately. But you have to list the names of the variables. (If needed, you could let SAS generate the list, e.g., of all character variables which have length 1, but that would require a bit of advanced code.)
$trouble is an arbitrary name of a user-defined character format which maps a single period to 'NA' (in the sense that a value '.' would be displayed as 'NA'). Two caveats:
I don't know how the single periods got into your character variables, but a typical scenario for this situation is numeric-to-character conversion of missing values. In this case leading blanks occur quite frequently. So, you should make sure that those values are left-aligned before the format is applied.
Item 2 above would also interfere with your suggested code: A character variable of length 8 containing a right-aligned period, would be transformed to ' N' (with seven leading blanks) by your last assignment statement, not 'NA' or ' NA'.
Other issues might be that 'unknown', 'UNKNOWN' or abbreviations of it would not be replaced and 'Unknown value' would become 'NA value'.
But most importantly, the data step would not work at all, but fail with an error message, because the declarative ATTRIB statement cannot be applied to array elements in a DO loop. If you have a good reason for associating informat $32. and format $32. to all character variables of the input dataset (why would you want that?), you could do so by using the following statement after the SET statement:
attrib _character_ informat=$32. format=$32.;
If you want a variable to hold the characters "NA", it can't be defined as 1 character in length.
Fixes are not necessarily easy. You could redefine the length of those variables, by adding a LENGTH statement before the SET statement.
You could apply a format to change the display (instead of changing the values):
proc format;
value $trouble '.' = 'NA';
run;
In that case, don't change the "." values.
Thanks, sounds promising!
I am sorry though, I cannot follow you on everything.
If I were to add a LENGTH statement, wouldn't I be changing the LENGTH of all variables? Because that would not be okay, as only those who are <2 characters long need to be changed to 2 characters.
Also, what does $trouble do?
Thanks
If I may step in here: With a suitable LENGTH statement you can define the length of each variable separately. But you have to list the names of the variables. (If needed, you could let SAS generate the list, e.g., of all character variables which have length 1, but that would require a bit of advanced code.)
$trouble is an arbitrary name of a user-defined character format which maps a single period to 'NA' (in the sense that a value '.' would be displayed as 'NA'). Two caveats:
I don't know how the single periods got into your character variables, but a typical scenario for this situation is numeric-to-character conversion of missing values. In this case leading blanks occur quite frequently. So, you should make sure that those values are left-aligned before the format is applied.
Item 2 above would also interfere with your suggested code: A character variable of length 8 containing a right-aligned period, would be transformed to ' N' (with seven leading blanks) by your last assignment statement, not 'NA' or ' NA'.
Other issues might be that 'unknown', 'UNKNOWN' or abbreviations of it would not be replaced and 'Unknown value' would become 'NA value'.
But most importantly, the data step would not work at all, but fail with an error message, because the declarative ATTRIB statement cannot be applied to array elements in a DO loop. If you have a good reason for associating informat $32. and format $32. to all character variables of the input dataset (why would you want that?), you could do so by using the following statement after the SET statement:
attrib _character_ informat=$32. format=$32.;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.