Hello SAS community,
I am working on a project and would appreciate thoughts on an efficient way to summarize some data sets that I am building. I will have 30-40 data sets with each data set tracking a unique study. Within each data set there will be one row per person participating in the study. There is one column for every day of the year with the column value indicating the persons status of the study such that:
(null) - not calculated in study (for that day)
0 - failing study
1 - passing study
For each day of the year (all numeric columns) I want to total all persons participating in the study (all '1's and '0's), all people passing study (all '1's) as well and calculating the percent of population passing (count of '1's / count of '1's and '0's).
Example of data:
STUDY | ID | d0101 | d0102 | d0103 | d0104 | … | d1231 |
study1 | person1 | . | . | 0 | 1 | 1 | |
study1 | person2 | 0 | 0 | 0 | 0 | 1 | |
study1 | person3 | 0 | 0 | 1 | 1 | . | |
study1 | person4 | 1 | 1 | 1 | . | . | |
study1 | person5 | . | . | 1 | 1 | . | |
study1 | person6 | 0 | 0 | 1 | 1 | 1 | |
study1 | person7 | 1 | 1 | . | . | . | |
study1 | person8 | 1 | 1 | 0 | 1 | . | |
study1 | person9 | 1 | 0 | 1 | 0 | 1 |
Data want:
STUDY | d0101 | d0102 | d0103 | d0104 | … | d1231 | |
study1 | pass | 4 | 3 | 5 | 5 | 4 | |
study1 | total | 7 | 7 | 8 | 7 | 4 | |
study1 | percent pass | 57.14% | 42.86% | 62.50% | 71.43% | 100.00% |
Again there will be 30-40 studies (or data sets) that will need to be summarized, each with 1-3 million rows of data. Any suggestions on an efficient way to summarize the data? From my understanding proc means will not be able to calculate the percent passing. Is a data step my best option here?
edit: desired output is a data set.
data have; input STUDY $ ID $ d0101 d0102 d0103 d0104 d1231 ; datalines; study1 person1 . . 0 1 1 study1 person2 0 0 0 0 1 study1 person3 0 0 1 1 . study1 person4 1 1 1 . . study1 person5 . . 1 1 . study1 person6 0 0 1 1 1 study1 person7 1 1 . . . study1 person8 1 1 0 1 . study1 person9 1 0 1 0 1 ; proc tabulate data=have; class study ; var d: ; table study=''*(sum='Pass'*f=best5. n='Total' mean='Percent Pass'*f=percent8.2), d: /misstext=' ' box='Study'; run;
Thanks Cynthia. I did have a typo in my original post as the %'s were not correct. I have corrected it. I will look into the procedures you listed.
Please try the below code
data have;
input STUDY$ ID$ d0101 d0102 d0103;
cards;
study1 person1 . . 0
study1 person2 0 0 0
study1 person3 0 0 1
study1 person4 1 1 1
study1 person5 . . 1
study1 person6 0 0 1
study1 person7 1 1 .
study1 person8 1 1 0
study1 person9 1 0 1
;
proc sql noprint;
select name, count(name) into: vars separated by '|', :cnt from dictionary.columns where libname='WORK' and memname='HAVE' and name like 'd%';
quit;
%put &vars &cnt;
options mprint;
%macro test;
proc sql;
create table want as select study,
%do i = 1 %to &cnt;
sum(%scan(&vars, &i, "|")=1) as %scan(&vars, &i, "|"),
%end;
'Pass' as status from have
union
select study,
%do i = 1 %to &cnt;
sum(%scan(&vars, &i, "|") ne .) as %scan(&vars, &i, "|"),
%end;
'Total' as status from have
union
select study,
%do i = 1 %to &cnt;
(sum(%scan(&vars, &i, "|") =1)/sum(%scan(&vars, &i, "|") ne .))*100 as %scan(&vars, &i, "|") ,
%end;
'Percent Pass' as status from have group by study;
quit;
%mend;
%test;
Thanks Jag, I ran the code you provided on about 1.2 million people. It provided the results as expected and took about 2.5 minutes real time to run. Not too bad! Here is some of the log from running the macro:
NOTE: PROCEDURE SQL used (Total process time):
real time 2:28.29
user cpu time 1:48.30
system cpu time 58.03 seconds
memory 1068867.39k
OS Memory 1097204.00k
Do you need a report or a dataset as result?
First idea: transpose + tabulate, unfortunately i don't have the time to finish the code today, the percent-pass row is missing:
data have; length study id $ 10 d0101-d0105 8; input STUDY ID d0101 d0102 d0103 d0104 d0105; datalines; study1 person1 . . 0 1 1 study1 person2 0 0 0 0 1 study1 person3 0 0 1 1 . study1 person4 1 1 1 . . study1 person5 . . 1 1 . study1 person6 0 0 1 1 1 study1 person7 1 1 . . . study1 person8 1 1 0 1 . study1 person9 1 0 1 0 1 ; run; proc sort data=have presorted; by study id; run; proc transpose data=work.have out=work.transposed(rename=(col1=passed)) name= day ; by study id; var d0101 d0102 d0103 d0104 d0105; run; proc sort data=transposed out=sorted; by study day; run; proc tabulate data=work.sorted ; var passed; class day / order=unformatted missing; class study / order=unformatted missing; table study* passed=' '*( sum='pass' n='total' ), day ; run;
@supp wrote:
Hello SAS community,
I am working on a project and would appreciate thoughts on an efficient way to summarize some data sets that I am building. I will have 30-40 data sets with each data set tracking a unique study. Within each data set there will be one row per person participating in the study. There is one column for every day of the year with the column value indicating the persons status of the study such that:
(null) - not calculated in study (for that day)
0 - failing study
1 - passing study
For each day of the year (all numeric columns) I want to total all persons participating in the study (all '1's and '0's), all people passing study (all '1's) as well and calculating the percent of population passing (count of '1's / count of '1's and '0's).
Example of data:
STUDY ID d0101 d0102 d0103 d0104 … d1231 study1 person1 . . 0 1 1 study1 person2 0 0 0 0 1 study1 person3 0 0 1 1 . study1 person4 1 1 1 . . study1 person5 . . 1 1 . study1 person6 0 0 1 1 1 study1 person7 1 1 . . . study1 person8 1 1 0 1 . study1 person9 1 0 1 0 1
I would suggest that the first step would be to put the data in a forma that is better conducive to analysis. You actually have actual data in the variable name, the date, and that can add complications when attempting to answer things like "values by month" or similar.
I would suggest transposing the data so that you have one record per date and that the date variable hold an actual SAS Date value.
Proc means or summary CAN do the percent passing because of the coding scheme you have. The N statistic for any group will give the number of participants, the SUM statistic will give the number of 1 values and the MEAN statistic will give the percentage 1's in a decimal form: .578 =>57.8%.
But do consider transposing the data a mentioned. If you have a date variable you can get the summary by calendar month or quarter by assigning a proper format to the date value.
data have; input STUDY $ ID $ d0101 d0102 d0103 d0104 d1231 ; datalines; study1 person1 . . 0 1 1 study1 person2 0 0 0 0 1 study1 person3 0 0 1 1 . study1 person4 1 1 1 . . study1 person5 . . 1 1 . study1 person6 0 0 1 1 1 study1 person7 1 1 . . . study1 person8 1 1 0 1 . study1 person9 1 0 1 0 1 ; proc tabulate data=have; class study ; var d: ; table study=''*(sum='Pass'*f=best5. n='Total' mean='Percent Pass'*f=percent8.2), d: /misstext=' ' box='Study'; run;
Thanks Ballardw. Seems like proc tabulate might be the best option. I ran the code you provided on the same 1.2 million people/rows as the SQL marco solution. It looks like I got the same results and ran very quickly, about 6 seconds real time (compared to 2.5 minutes for the SQL macro solution).
I have not used proc tabulate before so I will spend some time understanding the syntax. Again thanks for the assist!
I know you got brilliant solution. Just for fun for saving output as a dataset.
data have;
input STUDY $ ID $ d0101 d0102 d0103 d0104 d1231 ;
datalines;
study1 person1 . . 0 1 1
study1 person2 0 0 0 0 1
study1 person3 0 0 1 1 .
study1 person4 1 1 1 . .
study1 person5 . . 1 1 .
study1 person6 0 0 1 1 1
study1 person7 1 1 . . .
study1 person8 1 1 0 1 .
study1 person9 1 0 1 0 1
;
proc sql;
create table want as
select study,' pass' as flag,sum(d0101) as d0101, sum(d0102) as d0102
from have
group by study
union
select study,' total' ,count(d0101) as d0101, count(d0102) as d0102
from have
group by study
union
select study,'percent',sum(d0101)/count(d0101) as d0101, sum(d0102)/count(d0102) as d0102
from have
group by study
order by 1,2;
quit;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.