Hi,
This is a data manipulation exercise. How do I include 0 counts for possible values that aren't in the data set?
The possible values are the colors of the rainbow, ROYGBIV. (red orange yellow green blue indigo violet). Currently indigo is missing.
Thanks
data have;
infile datalines dsd dlm=",";
input color $;
datalines;
blue
red
red
blue
green
red
orange
red
blue
violet
red
orange
yellow
;
run;
Desired output: (sort order doesn't matter)
Try this
data have;
infile datalines dsd dlm=",";
input color $;
datalines;
blue
red
red
blue
green
red
orange
red
blue
violet
red
orange
yellow
;
proc format;
value $ col "red" = "red"
"orange" = "orange"
"yellow" = "yellow"
"green" = "green"
"blue" = "blue"
"indigo" = "indigo"
"violet" = "violet"
;
run;
proc summary data = have nway completetypes;
class color / preloadfmt order = formated missing;
format color $col.;
output out = want(drop = _TYPE_ rename = _FREQ_ = count);
run;
Try this
data have;
infile datalines dsd dlm=",";
input color $;
datalines;
blue
red
red
blue
green
red
orange
red
blue
violet
red
orange
yellow
;
proc format;
value $ col "red" = "red"
"orange" = "orange"
"yellow" = "yellow"
"green" = "green"
"blue" = "blue"
"indigo" = "indigo"
"violet" = "violet"
;
run;
proc summary data = have nway completetypes;
class color / preloadfmt order = formated missing;
format color $col.;
output out = want(drop = _TYPE_ rename = _FREQ_ = count);
run;
Using a format, as shown by @PeterClemmensen , is the recommended way, because just one more step is required.
@andreas_lds wrote:
Using a format, as shown by @PeterClemmensen , is the recommended way, because just one more step is required.
I really do appreciate the two of you sharing your expertise here. ... but isn't the solution using a proc format not more like using a side effect? Isn't there any "official" way of addressing this issue?
@PaigeMiller wrote:
- In SAS, there are often many ways to get to the desired result. None of them are "official". One may be easier than another; one may require less code than another; one may execute faster than another.
That is understandable ... could you name some advantages of the solution using "proc format"? This time just to help me ... 🙂 (as there is an accepted solution already).
@fja wrote:... could you name some advantages of the solution using "proc format"?
I think in many practical use cases, there is a suitable format already available and associated with the categorical variable in question. In this situation, neither the PROC FORMAT step nor the FORMAT statement as shown in PeterClemmensen's solution would be needed. (The existence of a suitable dataset for the CLASSDATA= option is probably less common.) Typically, this format would map numeric or short character codes to longer text descriptions of the categories.
A special feature of the PRELOADFMT approach is available when used in conjunction with the ORDER=DATA option of the CLASS statement (of PROC MEANS, PROC SUMMARY or PROC TABULATE): If the format definition used the NOTSORTED option of PROC FORMAT's VALUE statement, the order of categories in the output will match the order from the format definition -- regardless of the (alphabetic) order of the formatted values, the default order of the unformatted values, the order in which they occur in the input dataset (!) and their frequencies. This is very useful if the PROC FORMAT code was written in view of the output specifications (e.g., table shells in a statistical analysis plan).
With PROC SUMMARY and PROC MEANS you can also use a CLASSDATA= dataset which contains (at least) the missing categories.
Simplified example (creating only printed output with a different header of the count column) using PROC MEANS:
data cd;
color='indigo ';
run;
proc means data=have classdata=cd;
class color;
run;
(Note the two trailing blanks in 'indigo ' to make variable COLOR the same length as in dataset HAVE.)
Thanks, FreelanceReinhard!
This is a very clean way of doing it.
Edit: this post was edited to remove some follow up questions that i had.
@Hello_there wrote:
I edited this post, but originally I had questions about how i would make this method more robust. Especially in particular cases where the data set was updated and the counts changed and if there was a way to macrotize the length value of the data set so i can use it for the cd data set. But after thinking about it some more, my use case involves knowing what the categories are beforehand and I would already know what the maximum length extends out to so it would be ok to hardcode that in the data set cd.
As long as the CD dataset contains all CLASS variable values that would be absent otherwise, the approach should work regardless of changed counts in dataset HAVE. It wouldn't hurt if CD redundantly contained values which are present in HAVE as well.
There is no need to hardcode the length of variable COLOR in dataset CD as you can always retrieve it from dataset HAVE. (Edit: ... assuming that the length there is also sufficient to accommodate the new values in CD.)
Example:
data cd;
if 0 then set have(keep=color);
input color;
cards;
indigo
;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.