03-01-2016 08:29 AM - edited 03-01-2016 08:30 AM
Hey there, as the title says, handling many large and heterogeneous datasets has its difficulties. When applying this:
DATA &name.; SET &name.; ARRAY ch(*) _character_; DO _n_ = 1 TO dim(ch); ATTRIB ch(_n_) INFORMAT=$32. FORMAT=$32.; ch(_n_) = coalescec(ch(_n_), 'NA'); /*replace empty cells with NA*/ ch(_n_) = tranwrd(ch(_n_), "Unknown", 'NA'); /*replace "Unknown" entries with NA*/ ch(_n_) = tranwrd(ch(_n_), '.', 'NA'); /*replace dots with NA*/ END; RUN;
I am confident that all empty cells, all "Unknown" and all dots get replaced by an "NA".
However, sometimes only an "N" instead of "NA" will appear in a certain column since it is formatted to not have more than that one character. Is there a way to solve this issue elegantly? Maybe even a whole different approach? Thanks
03-01-2016 08:38 AM
If you want a variable to hold the characters "NA", it can't be defined as 1 character in length.
Fixes are not necessarily easy. You could redefine the length of those variables, by adding a LENGTH statement before the SET statement.
You could apply a format to change the display (instead of changing the values):
value $trouble '.' = 'NA';
In that case, don't change the "." values.
03-01-2016 08:49 AM
Thanks, sounds promising!
I am sorry though, I cannot follow you on everything.
If I were to add a LENGTH statement, wouldn't I be changing the LENGTH of all variables? Because that would not be okay, as only those who are <2 characters long need to be changed to 2 characters.
Also, what does $trouble do?
03-01-2016 03:14 PM
If I may step in here: With a suitable LENGTH statement you can define the length of each variable separately. But you have to list the names of the variables. (If needed, you could let SAS generate the list, e.g., of all character variables which have length 1, but that would require a bit of advanced code.)
$trouble is an arbitrary name of a user-defined character format which maps a single period to 'NA' (in the sense that a value '.' would be displayed as 'NA'). Two caveats:
I don't know how the single periods got into your character variables, but a typical scenario for this situation is numeric-to-character conversion of missing values. In this case leading blanks occur quite frequently. So, you should make sure that those values are left-aligned before the format is applied.
Item 2 above would also interfere with your suggested code: A character variable of length 8 containing a right-aligned period, would be transformed to ' N' (with seven leading blanks) by your last assignment statement, not 'NA' or ' NA'.
Other issues might be that 'unknown', 'UNKNOWN' or abbreviations of it would not be replaced and 'Unknown value' would become 'NA value'.
But most importantly, the data step would not work at all, but fail with an error message, because the declarative ATTRIB statement cannot be applied to array elements in a DO loop. If you have a good reason for associating informat $32. and format $32. to all character variables of the input dataset (why would you want that?), you could do so by using the following statement after the SET statement:
attrib _character_ informat=$32. format=$32.;