BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Hello_there
Lapis Lazuli | Level 10

Hi,

 

This is a data manipulation exercise. How do I include 0 counts for possible values that aren't in the data set?

The possible values are the colors of the rainbow, ROYGBIV. (red orange yellow green blue indigo violet). Currently indigo is missing.

Thanks

data have;
infile datalines dsd dlm=",";
	input color $;
datalines;
blue
red
red
blue 
green
red
orange
red
blue
violet
red
orange
yellow
;
run;

Desired output: (sort order doesn't matter)

Hello_there_0-1672249886464.png

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
PeterClemmensen
Tourmaline | Level 20

Try this

 

data have;
infile datalines dsd dlm=",";
input color $;
datalines;
blue   
red    
red    
blue   
green  
red    
orange 
red    
blue   
violet 
red    
orange 
yellow 
;

proc format;
 value $ col   "red"     = "red" 
               "orange"  = "orange" 
               "yellow"  = "yellow" 
               "green"   = "green" 
               "blue"    = "blue" 
               "indigo"  = "indigo" 
               "violet"  = "violet" 
;
run;

proc summary data = have nway completetypes;
   class color / preloadfmt order = formated missing;
   format color $col.;
   output out = want(drop = _TYPE_ rename = _FREQ_ = count);
run;

View solution in original post

16 REPLIES 16
fja
Lapis Lazuli | Level 10 fja
Lapis Lazuli | Level 10
Would you please have a look at this posting: https://communities.sas.com/t5/SAS-Programming/Return-count-of-0-in-a-Group-By-SQL-Statement/m-p/539...
This should be applicable in your case too ...
Hello_there
Lapis Lazuli | Level 10
Thanks for replying, but this solution is slightly different because my data set does not include the possible value that I need.

This solution in the posted link appears to dynamically code the solution based on creating possible combinations from all distinct cross tabulated values of the two groups in the data set.
PeterClemmensen
Tourmaline | Level 20

Try this

 

data have;
infile datalines dsd dlm=",";
input color $;
datalines;
blue   
red    
red    
blue   
green  
red    
orange 
red    
blue   
violet 
red    
orange 
yellow 
;

proc format;
 value $ col   "red"     = "red" 
               "orange"  = "orange" 
               "yellow"  = "yellow" 
               "green"   = "green" 
               "blue"    = "blue" 
               "indigo"  = "indigo" 
               "violet"  = "violet" 
;
run;

proc summary data = have nway completetypes;
   class color / preloadfmt order = formated missing;
   format color $col.;
   output out = want(drop = _TYPE_ rename = _FREQ_ = count);
run;
andreas_lds
Jade | Level 19

Using a format, as shown by @PeterClemmensen , is the recommended way, because just one more step is required.

fja
Lapis Lazuli | Level 10 fja
Lapis Lazuli | Level 10

@andreas_lds wrote:

Using a format, as shown by @PeterClemmensen , is the recommended way, because just one more step is required.


I really do appreciate the two of you sharing your expertise here. ... but isn't the solution using a  proc format not more like using a side effect? Isn't there any "official" way of addressing this issue?

PaigeMiller
Diamond | Level 26
  • In SAS, there are often many ways to get to the desired result. None of them are "official". One may be easier than another; one may require less code than another; one may execute faster than another.
--
Paige Miller
fja
Lapis Lazuli | Level 10 fja
Lapis Lazuli | Level 10

@PaigeMiller wrote:
  • In SAS, there are often many ways to get to the desired result. None of them are "official". One may be easier than another; one may require less code than another; one may execute faster than another.

That is understandable ... could you name some advantages of the solution using "proc format"? This time just to help me ... 🙂  (as there is an accepted solution already).

FreelanceReinh
Jade | Level 19

@fja wrote:

 ... could you name some advantages of the solution using "proc format"?


I think in many practical use cases, there is a suitable format already available and associated with the categorical variable in question. In this situation, neither the PROC FORMAT step nor the FORMAT statement as shown in PeterClemmensen's solution would be needed. (The existence of a suitable dataset for the CLASSDATA= option is probably less common.) Typically, this format would map numeric or short character codes to longer text descriptions of the categories.

 

A special feature of the PRELOADFMT approach is available when used in conjunction with the ORDER=DATA option of the CLASS statement (of PROC MEANS, PROC SUMMARY or PROC TABULATE): If the format definition used the NOTSORTED option of PROC FORMAT's VALUE statement, the order of categories in the output will match the order from the format definition -- regardless of the (alphabetic) order of the formatted values, the default order of the unformatted values, the order in which they occur in the input dataset (!) and their frequencies. This is very useful if the PROC FORMAT code was written in view of the output specifications (e.g., table shells in a statistical analysis plan).

FreelanceReinh
Jade | Level 19

With PROC SUMMARY and PROC MEANS you can also use a CLASSDATA= dataset which contains (at least) the missing categories.

 

Simplified example (creating only printed output with a different header of the count column) using PROC MEANS:

data cd;
color='indigo  ';
run;

proc means data=have classdata=cd;
class color;
run;

(Note the two trailing blanks in 'indigo  ' to make variable COLOR the same length as in dataset HAVE.)

Hello_there
Lapis Lazuli | Level 10

Thanks, FreelanceReinhard!

 

This is a very clean way of doing it.

 

Edit: this post was edited to remove some follow up questions that i had.

Hello_there
Lapis Lazuli | Level 10
I edited this post, but originally I had questions about how i would make this method more robust. Especially in particular cases where the data set was updated and the counts changed and if there was a way to macrotize the length value of the data set so i can use it for the cd data set. But after thinking about it some more, my use case involves knowing what the categories are beforehand and I would already know what the maximum length extends out to so it would be ok to hardcode that in the data set cd.
FreelanceReinh
Jade | Level 19

@Hello_there wrote:
I edited this post, but originally I had questions about how i would make this method more robust. Especially in particular cases where the data set was updated and the counts changed and if there was a way to macrotize the length value of the data set so i can use it for the cd data set. But after thinking about it some more, my use case involves knowing what the categories are beforehand and I would already know what the maximum length extends out to so it would be ok to hardcode that in the data set cd.

As long as the CD dataset contains all CLASS variable values that would be absent otherwise, the approach should work regardless of changed counts in dataset HAVE. It wouldn't hurt if CD redundantly contained values which are present in HAVE as well.

 

There is no need to hardcode the length of variable COLOR in dataset CD as you can always retrieve it from dataset HAVE. (Edit: ... assuming that the length there is also sufficient to accommodate the new values in CD.)

 

Example:

data cd;
if 0 then set have(keep=color);
input color;
cards;
indigo
;

 

Hello_there
Lapis Lazuli | Level 10
Thanks, again.

Could you explain what is 0 in the "if 0 then set have (keep=color);" part?

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 16 replies
  • 3932 views
  • 11 likes
  • 6 in conversation