Hello SAS community,
I have a large dataset with lots of variables - see a sample below
Data have;
input id brain heart liver stomack llung lkidney lovary rlung rkidney rovary ucolon lcolon;
cards
;
1 1 0 1 0 1 0 0 0 1 0 1 1
2 0 0 0 1 0 1 0 0 0 1 0 1
3 0 1 0 0 1 0 1 1 1 0 0 0
4 0 0 0 0 0 1 0 0 0 0 0 1
5 1 1 1 1 1 1 1 1 1 1 1 1
6 0 0 1 0 0 0 0 0 1 0 1 1
7 0 0 1 1 0 0 0 0 0 0 0 0
8 1 1 1 1 0 0 0 1 1 1 0 1
9 0 1 0 1 0 0 0 0 0 0 0 0
10 0 0 0 0 1 1 1 1 0 1 1 0
;
run;
desired output
Body part % with XXX=1
Brain
Heart
Liver
Stomack
Lung
Kidney
Ovary
Colon
Appreciate any help!
Maggie
Do you expect the totals for variables like Lkidney and Rkidney as a single value?
You really aren't clear whether you need a data set or report. I am providing a report done one of the many possible:
Data have; input id brain heart liver stomack llung lkidney lovary rlung rkidney rovary ucolon lcolon; cards ; 1 1 0 1 0 1 0 0 0 1 0 1 1 2 0 0 0 1 0 1 0 0 0 1 0 1 3 0 1 0 0 1 0 1 1 1 0 0 0 4 0 0 0 0 0 1 0 0 0 0 0 1 5 1 1 1 1 1 1 1 1 1 1 1 1 6 0 0 1 0 0 0 0 0 1 0 1 1 7 0 0 1 1 0 0 0 0 0 0 0 0 8 1 1 1 1 0 0 0 1 1 1 0 1 9 0 1 0 1 0 0 0 0 0 0 0 0 10 0 0 0 0 1 1 1 1 0 1 1 0 ; run; /* assumes data have is sorted by ID AND that there are no duplicate values of ID */ proc transpose data=have out=trans; by id; run; proc format; value $bodyparts 'llung','rlung' ='Lung' 'lkidney','rkidney'='Kidney' 'lovary','rovary' ='Ovary' 'brain' ='Brain' 'heart' ='Heart' 'liver' ='Liver' 'stomack' ='Stomack' 'ucolon','lcolon' ='Colon' ; run; Proc tabulate data=trans; class _name_; format _name_ $bodyparts.; var col1; table _name_=' ', col1=' '*(Sum='Count'*f=best6. mean='%'*f=percent8.1) ; run;
Caution: The format assumes the current names of the variables are exactly as you have provided them here (stomach is the more typical English spelling if I recall correctly) and including the case. The defaults of the Proc Transpose create a variable named "_name_" containing the value of the name. You may need to double check spelling, upper and lower case included, to make the format text on the left side of each = to match your data. The default will also return the numeric value in Col1. These names can be changed but for an example like this should suffice.
Formats assigning common values to multiples or groups of values create groups that will be honored by most of the SAS report, analysis and graphing procedures. So I do not change any actual values of the _name_ but use the custom format to create the groups (and provide capitalization of the result).
When you have variables that take a value of 1/0 then the SUM statistic gives you the count of 1's and the Mean is a decimal percentage OF the observations with values. Missing values will be excluded for both the Sum and Mean calculations. So if you have 10 observations and one of the bodypart variable value is missing then that part will only have 9 values. If you expect a percentage based on 10 values then you have more work involved defining what the numerator and denominator for the rate should be.
Do you want this as a report or a SAS data set?
a report would be great
proc summary data=have;
var brain -- lcolon;
output out=want mean=;
run;
You can transpose it if you want the information displayed vertically.
Thanks PaigeMiller for your reply.
The data has variables brain heart liver stomack llung lkidney lovary rlung rkidney rovary ucolon lcolon
llung, rlung need to be put as lung, same for lkidney, rkidney, etc.
How do do that before calculating the frequencies?
Thanks,
Maggie
@urban58 wrote:
The data has variables brain heart liver stomack llung lkidney lovary rlung rkidney rovary ucolon lcolon
llung, rlung need to be put as lung, same for lkidney, rkidney, etc.
This detail was not mentioned originally. Could you please re-write the problem to include this information and all other relevant information (including other relevant points you have not yet mentioned), so the entire problem is in one problem statement? Please include in your explanation what happens if llung and rlung are both 1, is the frequency = 1 or is the frequency = 2?
Do you expect the totals for variables like Lkidney and Rkidney as a single value?
You really aren't clear whether you need a data set or report. I am providing a report done one of the many possible:
Data have; input id brain heart liver stomack llung lkidney lovary rlung rkidney rovary ucolon lcolon; cards ; 1 1 0 1 0 1 0 0 0 1 0 1 1 2 0 0 0 1 0 1 0 0 0 1 0 1 3 0 1 0 0 1 0 1 1 1 0 0 0 4 0 0 0 0 0 1 0 0 0 0 0 1 5 1 1 1 1 1 1 1 1 1 1 1 1 6 0 0 1 0 0 0 0 0 1 0 1 1 7 0 0 1 1 0 0 0 0 0 0 0 0 8 1 1 1 1 0 0 0 1 1 1 0 1 9 0 1 0 1 0 0 0 0 0 0 0 0 10 0 0 0 0 1 1 1 1 0 1 1 0 ; run; /* assumes data have is sorted by ID AND that there are no duplicate values of ID */ proc transpose data=have out=trans; by id; run; proc format; value $bodyparts 'llung','rlung' ='Lung' 'lkidney','rkidney'='Kidney' 'lovary','rovary' ='Ovary' 'brain' ='Brain' 'heart' ='Heart' 'liver' ='Liver' 'stomack' ='Stomack' 'ucolon','lcolon' ='Colon' ; run; Proc tabulate data=trans; class _name_; format _name_ $bodyparts.; var col1; table _name_=' ', col1=' '*(Sum='Count'*f=best6. mean='%'*f=percent8.1) ; run;
Caution: The format assumes the current names of the variables are exactly as you have provided them here (stomach is the more typical English spelling if I recall correctly) and including the case. The defaults of the Proc Transpose create a variable named "_name_" containing the value of the name. You may need to double check spelling, upper and lower case included, to make the format text on the left side of each = to match your data. The default will also return the numeric value in Col1. These names can be changed but for an example like this should suffice.
Formats assigning common values to multiples or groups of values create groups that will be honored by most of the SAS report, analysis and graphing procedures. So I do not change any actual values of the _name_ but use the custom format to create the groups (and provide capitalization of the result).
When you have variables that take a value of 1/0 then the SUM statistic gives you the count of 1's and the Mean is a decimal percentage OF the observations with values. Missing values will be excluded for both the Sum and Mean calculations. So if you have 10 observations and one of the bodypart variable value is missing then that part will only have 9 values. If you expect a percentage based on 10 values then you have more work involved defining what the numerator and denominator for the rate should be.
Thank you ballardw, I will use this code on my data and ask if I don't understand something or accept as the solution your response.
Maggie
your code is just great ballardw, it answered my question exactly - sorry for the late response. Is there a way I can display the %s so the largest is 1st, etc.
Maggie
@urban58 wrote:
your code is just great ballardw, it answered my question exactly - sorry for the late response. Is there a way I can display the %s so the largest is 1st, etc.
Maggie
One way is to calculate the numbers needed and sort.
proc summary data=trans nway; class _name_; format _name_ $bodyparts.; var col1; output out=summary sum= mean= /autoname; run; proc sort data=summary; by descending col1_mean; run; proc print data=summary noobs label; var _name_ col1_sum col1_mean; format col1_mean percent8.1; label col1_sum='Count' col1_mean='%' ; run;
Caution: this summary data set only has one value of each of the two part elements and you can't always be sure which one it should be.
Proc summary (or Means) will only keep one of the values of the Class variable(s). The format still applies though. The autoname options on the requested statistics appends the _ and statistic to the variable (assumes the result is less than 32 characters, longer variables will get truncated to fit 32 characters).
Note that by default you will get two additional variables in the result of Proc Summary: _type_ , which indicates which combination(s) of class variables are present, the NWAY option means only the largest type is in the result, and _freq_ which is a raw count of the number of values used in the summary for each row of output.
Proc Tabulate does not have a "sort order by statistic" option so a different approach is needed.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.