Solved: Re: Thoughts on efficient approach to problem

supp · Posted 09-03-2019 09:46 AM

Hello SAS community,

I am working on a project and would appreciate thoughts on an efficient way to summarize some data sets that I am building. I will have 30-40 data sets with each data set tracking a unique study. Within each data set there will be one row per person participating in the study. There is one column for every day of the year with the column value indicating the persons status of the study such that:

(null) - not calculated in study (for that day)

0 - failing study

1 - passing study

For each day of the year (all numeric columns) I want to total all persons participating in the study (all '1's and '0's), all people passing study (all '1's) as well and calculating the percent of population passing (count of '1's / count of '1's and '0's).

Example of data:

STUDY	ID	d0101	d0102	d0103	d0104	…	d1231
study1	person1	.	.	0	1		1
study1	person2	0	0	0	0		1
study1	person3	0	0	1	1		.
study1	person4	1	1	1	.		.
study1	person5	.	.	1	1		.
study1	person6	0	0	1	1		1
study1	person7	1	1	.	.		.
study1	person8	1	1	0	1		.
study1	person9	1	0	1	0		1

Data want:

STUDY		d0101	d0102	d0103	d0104	…	d1231
study1	pass	4	3	5	5		4
study1	total	7	7	8	7		4
study1	percent pass	57.14%	42.86%	62.50%	71.43%		100.00%

Again there will be 30-40 studies (or data sets) that will need to be summarized, each with 1-3 million rows of data. Any suggestions on an efficient way to summarize the data? From my understanding proc means will not be able to calculate the percent passing. Is a data step my best option here?

edit: desired output is a data set.

ballardw · Posted 09-03-2019 11:50 AM

data have;
input STUDY $ ID $ d0101 d0102 d0103 d0104  d1231 ;
datalines;
study1 person1 . . 0 1   1 
study1 person2 0 0 0 0   1 
study1 person3 0 0 1 1   . 
study1 person4 1 1 1 .   . 
study1 person5 . . 1 1   . 
study1 person6 0 0 1 1   1 
study1 person7 1 1 . .   . 
study1 person8 1 1 0 1   . 
study1 person9 1 0 1 0   1 
;

proc tabulate data=have;
  class study ;
  var d: ;
  table study=''*(sum='Pass'*f=best5. n='Total' mean='Percent Pass'*f=percent8.2),
        d:
        /misstext=' ' box='Study';
 run;

View solution in original post

Cynthia_sas · Posted 09-03-2019 10:02 AM

Hi:
A DATA step program would be my last choice. PROC FREQ, PROC TABULATE and PROC REPORT are all able to calculate percents. The key will be to get the program working as you want for 1 data set, then to look at "macro-izing" the program so it can be run on multiple data sets.

In your desired output, as you show, are you sure that the % you show is correct? I don't think it should be 57.14% for every column based on the numbers you show.

Also, for some of the procedures, it might be easier to restructure the data otherwise, you'll have to list the column names for each day.

Cynthia

supp · Posted 09-03-2019 11:09 AM

Thanks Cynthia. I did have a typo in my original post as the %'s were not correct. I have corrected it. I will look into the procedures you listed.

Jagadishkatam · Posted 09-03-2019 10:30 AM

Please try the below code

data have;
input STUDY$ ID$ 	d0101 d0102 d0103;
cards;
study1 person1 .  . 0  
study1 person2 0  0 0  
study1 person3 0  0 1  
study1 person4 1  1 1  
study1 person5 .  . 1  
study1 person6 0  0 1  
study1 person7 1  1 .  
study1 person8 1  1 0  
study1 person9 1  0 1   
;

proc sql noprint;
select name, count(name) into: vars separated by '|', :cnt from dictionary.columns where libname='WORK' and memname='HAVE' and name like 'd%';
quit;

%put &vars &cnt;

options mprint;
%macro test;
proc sql;
create table want as select study, 
%do i = 1 %to &cnt;
sum(%scan(&vars, &i, "|")=1) as %scan(&vars, &i, "|"),
%end;
 'Pass' as status from have 
union 
select study, 
%do i = 1 %to &cnt;
sum(%scan(&vars, &i, "|") ne .) as %scan(&vars, &i, "|"),
%end; 
'Total' as status from have 
union
select study, 
%do i = 1 %to &cnt;
(sum(%scan(&vars, &i, "|") =1)/sum(%scan(&vars, &i, "|") ne .))*100 as %scan(&vars, &i, "|") ,
%end; 
'Percent Pass' as status from have group by study;
quit;
%mend;

%test;

Thanks,
Jag

supp · Posted 09-03-2019 11:41 AM

Thanks Jag, I ran the code you provided on about 1.2 million people. It provided the results as expected and took about 2.5 minutes real time to run. Not too bad! Here is some of the log from running the macro:

NOTE: PROCEDURE SQL used (Total process time):
real time 2:28.29
user cpu time 1:48.30
system cpu time 58.03 seconds
memory 1068867.39k
OS Memory 1097204.00k

Jagadishkatam · Posted 09-03-2019 11:49 AM

Thanks for testing the code, happy that it worked

Thanks,
Jag

andreas_lds · Posted 09-03-2019 10:35 AM

Do you need a report or a dataset as result?

First idea: transpose + tabulate, unfortunately i don't have the time to finish the code today, the percent-pass row is missing:

data have;
   length study id $ 10 d0101-d0105 8;   
   input STUDY ID d0101 d0102 d0103 d0104 d0105;
   datalines;
study1 person1 . . 0 1 1
study1 person2 0 0 0 0 1
study1 person3 0 0 1 1 .
study1 person4 1 1 1 . .
study1 person5 . . 1 1 .
study1 person6 0 0 1 1 1
study1 person7 1 1 . . .
study1 person8 1 1 0 1 .
study1 person9 1 0 1 0 1
;
run;

proc sort data=have presorted;
   by study id;
run;

proc transpose data=work.have
      out=work.transposed(rename=(col1=passed))
      name= day
   ;
   by study id;
   var d0101 d0102 d0103 d0104 d0105;
run;

proc sort data=transposed out=sorted;
   by study day;
run;

proc tabulate data=work.sorted ;  
   var passed;
   class day / order=unformatted missing;
   class study / order=unformatted missing;
   table
      study*
      passed=' '*(
         sum='pass' 
         n='total'
      ),
      day
;
run;

ballardw · Posted 09-03-2019 10:44 AM

@supp wrote:

Hello SAS community,

I am working on a project and would appreciate thoughts on an efficient way to summarize some data sets that I am building. I will have 30-40 data sets with each data set tracking a unique study. Within each data set there will be one row per person participating in the study. There is one column for every day of the year with the column value indicating the persons status of the study such that:

(null) - not calculated in study (for that day)

0 - failing study

1 - passing study

For each day of the year (all numeric columns) I want to total all persons participating in the study (all '1's and '0's), all people passing study (all '1's) as well and calculating the percent of population passing (count of '1's / count of '1's and '0's).

Example of data:

STUDY ID d0101 d0102 d0103 d0104 … d1231

study1 person1 . . 0 1 1

study1 person2 0 0 0 0 1

study1 person3 0 0 1 1 .

study1 person4 1 1 1 . .

study1 person5 . . 1 1 .

study1 person6 0 0 1 1 1

study1 person7 1 1 . . .

study1 person8 1 1 0 1 .

study1 person9 1 0 1 0 1

I would suggest that the first step would be to put the data in a forma that is better conducive to analysis. You actually have actual data in the variable name, the date, and that can add complications when attempting to answer things like "values by month" or similar.

I would suggest transposing the data so that you have one record per date and that the date variable hold an actual SAS Date value.

Proc means or summary CAN do the percent passing because of the coding scheme you have. The N statistic for any group will give the number of participants, the SUM statistic will give the number of 1 values and the MEAN statistic will give the percentage 1's in a decimal form: .578 =>57.8%.

But do consider transposing the data a mentioned. If you have a date variable you can get the summary by calendar month or quarter by assigning a proper format to the date value.

ballardw · Posted 09-03-2019 11:50 AM

data have;
input STUDY $ ID $ d0101 d0102 d0103 d0104  d1231 ;
datalines;
study1 person1 . . 0 1   1 
study1 person2 0 0 0 0   1 
study1 person3 0 0 1 1   . 
study1 person4 1 1 1 .   . 
study1 person5 . . 1 1   . 
study1 person6 0 0 1 1   1 
study1 person7 1 1 . .   . 
study1 person8 1 1 0 1   . 
study1 person9 1 0 1 0   1 
;

proc tabulate data=have;
  class study ;
  var d: ;
  table study=''*(sum='Pass'*f=best5. n='Total' mean='Percent Pass'*f=percent8.2),
        d:
        /misstext=' ' box='Study';
 run;

supp · Posted 09-03-2019 12:03 PM

Thanks Ballardw. Seems like proc tabulate might be the best option. I ran the code you provided on the same 1.2 million people/rows as the SQL marco solution. It looks like I got the same results and ran very quickly, about 6 seconds real time (compared to 2.5 minutes for the SQL macro solution).

I have not used proc tabulate before so I will spend some time understanding the syntax. Again thanks for the assist!

Ksharp · Posted 09-04-2019 09:19 AM

I know you got brilliant solution. Just for fun for saving output as a dataset.

data have;
input STUDY $ ID $ d0101 d0102 d0103 d0104  d1231 ;
datalines;
study1 person1 . . 0 1   1 
study1 person2 0 0 0 0   1 
study1 person3 0 0 1 1   . 
study1 person4 1 1 1 .   . 
study1 person5 . . 1 1   . 
study1 person6 0 0 1 1   1 
study1 person7 1 1 . .   . 
study1 person8 1 1 0 1   . 
study1 person9 1 0 1 0   1 
;


proc sql;
create table want as
select 	study,'  pass' as flag,sum(d0101) as d0101, sum(d0102) as d0102
 from have 
  group by study

union

select 	study,' total' ,count(d0101) as d0101, count(d0102) as d0102
 from have 
  group by study

union

select 	study,'percent',sum(d0101)/count(d0101) as d0101, sum(d0102)/count(d0102) as d0102
 from have 
  group by study

order by 1,2;
quit;

Register Today!

SAS Training: Just a Click Away