"Coalesce" a value across groups of observations

Accepted Solution Solved
Reply
Super Contributor
Super Contributor
Posts: 266
Accepted Solution

"Coalesce" a value across groups of observations

 

 

Here is some sample made up data to show what I want.

 

data successes;
   input code group something success;
datalines;
1104 1 8 0
1104 1 7 0
1105 1 2 0
1105 1 3 1
1105 1 4 0
1106 1 5 0
1106 1 6 1
1106 1 7 0
1106 1 8 0
1106 1 9 0
1106 1 1 0
1107 1 1 0
1107 1 3 0
1107 1 5 0
1107 1 7 0
1107 1 9 1
1108 1 2 0
1109 1 4 0
1109 1 6 0
1110 1 3 1
1110 1 2 1
1110 1 1 1
1111 2 2 0
1111 2 3 1
1111 2 4 0
1112 2 5 0
1112 2 6 1
1112 2 7 0
1112 2 8 0
1112 2 9 0
;
run;

For this data set, there are a number of observations for each code (in the real data anywhere between 1 and 25). A "1" in success means that that code was successful. In this data, codes 1104, 1108, and 1109 had no success in group 1. 1105, 1106, 1107, and 1110 had success.

 

What I want, is to adjust the data in such a way that I can do a Proc Freq to come up with a success percent by group. Group 1 has a success rate of 4/7 = 57%.

 

My thoughts were that I would make a successflag of some sort that was a 1 if any observation of a code had a success value of 1. Then de-duplicate based on code and successflag to create a data set I could run the FREQ against.

 

I tried

data mysuccesses;
	mysuccessflag = 0;
	if group = 1;
	do until (last.code);
		set successes;
		by code;
		if success = 1 then mysuccessflag = 1;
	end;
run;

but that is wrong in many ways. Looping in the data step error. Help here appreciated.

  

I was trying to figure out how to loop through all the observations for a particular code and then set my flag to 1 if any observation had a success of 1.

 

I think I want a data set that looks like

 

data successes;
   input code group something success mysuccessflag;
datalines;
1104 1 8 0 0
1104 1 7 0 0
1105 1 2 0 1
1105 1 3 1 1
1105 1 4 0 1
1106 1 5 0 1
1106 1 6 1 1
1106 1 7 0 1
1106 1 8 0 1
1106 1 9 0 1
1106 1 1 0 1
1107 1 1 0 1
1107 1 3 0 1
1107 1 5 0 1
1107 1 7 0 1
1107 1 9 1 1
1108 1 2 0 0
1109 1 4 0 0
1109 1 6 0 0
1110 1 3 1 1
1110 1 2 1 1
1110 1 1 1 1
1111 2 2 0 1
1111 2 3 1 1
1111 2 4 0 1
1112 2 5 0 1
1112 2 6 1 1
1112 2 7 0 1
1112 2 8 0 1
1112 2 9 0 1
;
run;

With that I could

proc sort data = successes out = cleanedsuccesses nodupkey;
	by code;
run;

proc freq data = cleanedsuccesses;
	where group = 1;
	tables mysuccessflag;
run;

which gets me where I want to be.

  

But I'm open to suggestions. I can do this with SQL but I want to do it something like this.

 

Thank you.


Accepted Solutions
Solution
‎03-02-2018 12:47 PM
Super User
Posts: 13,583

Re: "Coalesce" a value across groups of observations

Assuming you want a data set with the percent of any success by group, success is always 0/1 as described then this may work:

proc summary data=successes nway ;
   class group code;
   var success;
   output out=successmax max=;
run;

proc summary data=successmax nway;
   class group;
   var success;
   output out= perc mean=;
run;

The value of Success in the final set Perc is decimal percentage so instead of 57% it has 0.5714285714. Display as needed.

 

Note that use of Class variables to group the data doesn't require explicit sorting though the output data will be.

You can use data set options to drop the _type_ and _freq_ variables. _freq_ may help with verifying the number of codes used to calculate the percentage if concerned.

View solution in original post


All Replies
Solution
‎03-02-2018 12:47 PM
Super User
Posts: 13,583

Re: "Coalesce" a value across groups of observations

Assuming you want a data set with the percent of any success by group, success is always 0/1 as described then this may work:

proc summary data=successes nway ;
   class group code;
   var success;
   output out=successmax max=;
run;

proc summary data=successmax nway;
   class group;
   var success;
   output out= perc mean=;
run;

The value of Success in the final set Perc is decimal percentage so instead of 57% it has 0.5714285714. Display as needed.

 

Note that use of Class variables to group the data doesn't require explicit sorting though the output data will be.

You can use data set options to drop the _type_ and _freq_ variables. _freq_ may help with verifying the number of codes used to calculate the percentage if concerned.

Super Contributor
Super Contributor
Posts: 266

Re: "Coalesce" a value across groups of observations

proc summary data=successes nway ;

class group code;

var success;

output out=successmax max=;

run;

 

 

So, we are summarizing the data by group and code. 

The nway option collapses that down to one observation per combo of group and code.

Output creates a file.

Max= is where I kind of get lost.  That somehow magically assigns the maximum value of the variable named (var = success) as the value of the summarized success variable.  I think.

 

I don't understand it fully but it works.   If I add

 

PROC FORMAT;

   VALUE success_fmt

   0 = "No"

   1 = "Yes";

RUN;

 

proc freq data = successmax;

   where group = 1;

   format success success_fmt.;

   tables success;

run;

 

 

working right off the first created data set I get

                                          The SAS System           09:10 Friday, March 2, 2018   8

                                        The FREQ Procedure

                                                       Cumulative    Cumulative
                   success    Frequency     Percent     Frequency      Percent
                   
                       No            3       42.86             3        42.86
                       Yes           4       57.14             7       100.00

 

 

 

which gets me there. 

 

Very cool.  Thank you.

Super User
Posts: 13,583

Re: "Coalesce" a value across groups of observations


HB wrote:

proc summary data=successes nway ;

class group code;

var success;

output out=successmax max=;

run;

 

 

So, we are summarizing the data by group and code. 

The nway option collapses that down to one observation per combo of group and code.

Output creates a file.

Max= is where I kind of get lost.  That somehow magically assigns the maximum value of the variable named (var = success) as the value of the summarized success variable.  I think.

 

 

That is exactly what the MAX statistic does, report the maximum value of the variable(s) within the combinations of the class variable(s).

Since your description was sort of "I want a 'success' for the group/code combination if any of the individual records were a success" then one way to think of that is the maximum value.

0/1 coded variables are very nice for use with some of the summary statistics: n is the number of valid values; mean is percent and sum is the number of 1 values (successes, valid, true, what have you for interpretation).

Super User
Posts: 23,776

Re: "Coalesce" a value across groups of observations

Try a DoW loop instead. 

 

 

PROC Star
Posts: 1,843

Re: "Coalesce" a value across groups of observations

Dow fun:

data successes;
   input code group something success;
datalines;
1104 1 8 0
1104 1 7 0
1105 1 2 0
1105 1 3 1
1105 1 4 0
1106 1 5 0
1106 1 6 1
1106 1 7 0
1106 1 8 0
1106 1 9 0
1106 1 1 0
1107 1 1 0
1107 1 3 0
1107 1 5 0
1107 1 7 0
1107 1 9 1
1108 1 2 0
1109 1 4 0
1109 1 6 0
1110 1 3 1
1110 1 2 1
1110 1 1 1
1111 2 2 0
1111 2 3 1
1111 2 4 0
1112 2 5 0
1112 2 6 1
1112 2 7 0
1112 2 8 0
1112 2 9 0
;
run;

data want;
_success=0;__success=0;
do until(last.group);
	_found=0;
	do until(last.code);
	set successes;
	by group code;
	if not _found and success=1 then _success+success;
	if success=1 then do;__success+success;_found=1;end;
	end;
	if last.group then success_pct=divide(_success,__success)*100;
end;
drop _: code something success;
run;
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 5 replies
  • 242 views
  • 6 likes
  • 4 in conversation