BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
supp
Pyrite | Level 9

Hello SAS community,

 

I am working on a project and would appreciate thoughts on an efficient way to summarize some data sets that I am building. I will have 30-40 data sets with each data set tracking a unique study. Within each data set there will be one row per person participating in the study. There is one column for every day of the year with the column value indicating the persons status of the study such that:

 

(null) - not calculated in study (for that day)

0 - failing study

1 -  passing study

 

For each day of the year (all numeric columns) I want to total all persons participating in the study (all '1's and '0's), all people passing study (all '1's) as well and calculating the percent of population passing (count of '1's / count of '1's and '0's).

 

Example of data:

 

STUDY ID d0101 d0102 d0103 d0104 d1231
study1 person1 . . 0 1   1
study1 person2 0 0 0 0   1
study1 person3 0 0 1 1   .
study1 person4 1 1 1 .   .
study1 person5 . . 1 1   .
study1 person6 0 0 1 1   1
study1 person7 1 1 . .   .
study1 person8 1 1 0 1   .
study1 person9 1 0 1 0   1

 

Data want:

 

STUDY   d0101 d0102 d0103 d0104 d1231
study1 pass 4 3 5 5   4
study1 total 7 7 8 7   4
study1 percent pass 57.14% 42.86% 62.50% 71.43%   100.00%

 

Again there will be 30-40 studies (or data sets) that will need to be summarized, each with 1-3 million rows of data. Any suggestions on an efficient way to summarize the data? From my understanding proc means will not be able to calculate the percent passing. Is a data step my best option here? 

 

edit: desired output is a data set.

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User
data have;
input STUDY $ ID $ d0101 d0102 d0103 d0104  d1231 ;
datalines;
study1 person1 . . 0 1   1 
study1 person2 0 0 0 0   1 
study1 person3 0 0 1 1   . 
study1 person4 1 1 1 .   . 
study1 person5 . . 1 1   . 
study1 person6 0 0 1 1   1 
study1 person7 1 1 . .   . 
study1 person8 1 1 0 1   . 
study1 person9 1 0 1 0   1 
;

proc tabulate data=have;
  class study ;
  var d: ;
  table study=''*(sum='Pass'*f=best5. n='Total' mean='Percent Pass'*f=percent8.2),
        d:
        /misstext=' ' box='Study';
 run;

View solution in original post

10 REPLIES 10
Cynthia_sas
SAS Super FREQ
Hi:
A DATA step program would be my last choice. PROC FREQ, PROC TABULATE and PROC REPORT are all able to calculate percents. The key will be to get the program working as you want for 1 data set, then to look at "macro-izing" the program so it can be run on multiple data sets.

In your desired output, as you show, are you sure that the % you show is correct? I don't think it should be 57.14% for every column based on the numbers you show.

Also, for some of the procedures, it might be easier to restructure the data otherwise, you'll have to list the column names for each day.

Cynthia
supp
Pyrite | Level 9

Thanks Cynthia. I did have a typo in my original post as the %'s were not correct. I have corrected it. I will look into the procedures you listed.

Jagadishkatam
Amethyst | Level 16

Please try the below code

 

data have;
input STUDY$ ID$ 	d0101 d0102 d0103;
cards;
study1 person1 .  . 0  
study1 person2 0  0 0  
study1 person3 0  0 1  
study1 person4 1  1 1  
study1 person5 .  . 1  
study1 person6 0  0 1  
study1 person7 1  1 .  
study1 person8 1  1 0  
study1 person9 1  0 1   
;

proc sql noprint;
select name, count(name) into: vars separated by '|', :cnt from dictionary.columns where libname='WORK' and memname='HAVE' and name like 'd%';
quit;

%put &vars &cnt;

options mprint;
%macro test;
proc sql;
create table want as select study, 
%do i = 1 %to &cnt;
sum(%scan(&vars, &i, "|")=1) as %scan(&vars, &i, "|"),
%end;
 'Pass' as status from have 
union 
select study, 
%do i = 1 %to &cnt;
sum(%scan(&vars, &i, "|") ne .) as %scan(&vars, &i, "|"),
%end; 
'Total' as status from have 
union
select study, 
%do i = 1 %to &cnt;
(sum(%scan(&vars, &i, "|") =1)/sum(%scan(&vars, &i, "|") ne .))*100 as %scan(&vars, &i, "|") ,
%end; 
'Percent Pass' as status from have group by study;
quit;
%mend;

%test;
Thanks,
Jag
supp
Pyrite | Level 9

Thanks Jag, I ran the code you provided on about 1.2 million people. It provided the results as expected and took about 2.5 minutes real time to run. Not too bad! Here is some of the log from running the macro:

 

NOTE: PROCEDURE SQL used (Total process time):
real time 2:28.29
user cpu time 1:48.30
system cpu time 58.03 seconds
memory 1068867.39k
OS Memory 1097204.00k

Jagadishkatam
Amethyst | Level 16
Thanks for testing the code, happy that it worked
Thanks,
Jag
andreas_lds
Jade | Level 19

Do you need a report or a dataset as result?

 

First idea: transpose + tabulate, unfortunately i don't have the time to finish the code today, the percent-pass row is missing:

 

data have;
   length study id $ 10 d0101-d0105 8;   
   input STUDY ID d0101 d0102 d0103 d0104 d0105;
   datalines;
study1 person1 . . 0 1 1
study1 person2 0 0 0 0 1
study1 person3 0 0 1 1 .
study1 person4 1 1 1 . .
study1 person5 . . 1 1 .
study1 person6 0 0 1 1 1
study1 person7 1 1 . . .
study1 person8 1 1 0 1 .
study1 person9 1 0 1 0 1
;
run;

proc sort data=have presorted;
   by study id;
run;

proc transpose data=work.have
      out=work.transposed(rename=(col1=passed))
      name= day
   ;
   by study id;
   var d0101 d0102 d0103 d0104 d0105;
run;

proc sort data=transposed out=sorted;
   by study day;
run;

proc tabulate data=work.sorted ;  
   var passed;
   class day / order=unformatted missing;
   class study / order=unformatted missing;
   table
      study*
      passed=' '*(
         sum='pass' 
         n='total'
      ),
      day
;
run;
ballardw
Super User

@supp wrote:

Hello SAS community,

 

I am working on a project and would appreciate thoughts on an efficient way to summarize some data sets that I am building. I will have 30-40 data sets with each data set tracking a unique study. Within each data set there will be one row per person participating in the study. There is one column for every day of the year with the column value indicating the persons status of the study such that:

 

(null) - not calculated in study (for that day)

0 - failing study

1 -  passing study

 

For each day of the year (all numeric columns) I want to total all persons participating in the study (all '1's and '0's), all people passing study (all '1's) as well and calculating the percent of population passing (count of '1's / count of '1's and '0's).

 

Example of data:

 

STUDY ID d0101 d0102 d0103 d0104 d1231
study1 person1 . . 0 1   1
study1 person2 0 0 0 0   1
study1 person3 0 0 1 1   .
study1 person4 1 1 1 .   .
study1 person5 . . 1 1   .
study1 person6 0 0 1 1   1
study1 person7 1 1 . .   .
study1 person8 1 1 0 1   .
study1 person9 1 0 1 0   1

 

I would suggest that the first step would be to put the data in a forma that is better conducive to analysis. You actually have actual data in the variable name, the date, and that can add complications when attempting to answer things like "values by month" or similar.

I would suggest transposing the data so that you have one record per date and that the date variable hold an actual SAS Date value.

 

Proc means or summary CAN do the percent passing because of the coding scheme you have. The N statistic for any group will give the number of participants, the SUM statistic will give the number of 1 values and the MEAN statistic will give the percentage 1's in a decimal form: .578 =>57.8%.

 

But do consider transposing the data a mentioned. If you have a date variable you can get the summary by calendar month or quarter by assigning a proper format to the date value.

ballardw
Super User
data have;
input STUDY $ ID $ d0101 d0102 d0103 d0104  d1231 ;
datalines;
study1 person1 . . 0 1   1 
study1 person2 0 0 0 0   1 
study1 person3 0 0 1 1   . 
study1 person4 1 1 1 .   . 
study1 person5 . . 1 1   . 
study1 person6 0 0 1 1   1 
study1 person7 1 1 . .   . 
study1 person8 1 1 0 1   . 
study1 person9 1 0 1 0   1 
;

proc tabulate data=have;
  class study ;
  var d: ;
  table study=''*(sum='Pass'*f=best5. n='Total' mean='Percent Pass'*f=percent8.2),
        d:
        /misstext=' ' box='Study';
 run;
supp
Pyrite | Level 9

Thanks Ballardw. Seems like proc tabulate might be the best option. I ran the code you provided on the same 1.2 million people/rows as the SQL marco solution. It looks like I got the same results and ran very quickly, about 6 seconds real time (compared to 2.5 minutes for the SQL macro solution). 

 

I have not used proc tabulate before so I will spend some time understanding the syntax. Again thanks for the assist!

Ksharp
Super User

I know you got brilliant solution. Just for fun for saving output as a dataset.

 

data have;
input STUDY $ ID $ d0101 d0102 d0103 d0104  d1231 ;
datalines;
study1 person1 . . 0 1   1 
study1 person2 0 0 0 0   1 
study1 person3 0 0 1 1   . 
study1 person4 1 1 1 .   . 
study1 person5 . . 1 1   . 
study1 person6 0 0 1 1   1 
study1 person7 1 1 . .   . 
study1 person8 1 1 0 1   . 
study1 person9 1 0 1 0   1 
;


proc sql;
create table want as
select 	study,'  pass' as flag,sum(d0101) as d0101, sum(d0102) as d0102
 from have 
  group by study

union

select 	study,' total' ,count(d0101) as d0101, count(d0102) as d0102
 from have 
  group by study

union

select 	study,'percent',sum(d0101)/count(d0101) as d0101, sum(d0102)/count(d0102) as d0102
 from have 
  group by study

order by 1,2;
quit;


SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 1501 views
  • 5 likes
  • 6 in conversation