Solved: Comparison of groups within groups

lezgin · Posted 11-09-2017 11:49 PM

I have the following dataset.

id1	id2	date	value
1	A	2014	0.5
1	A	2014	0.7
1	B	2014	0.8
1	C	2014	0.1
1	C	2014	1.1
1	D	2014	1.2
1	A	2015	1.3
1	A	2015	2.4
1	C	2015	5.6
1	D	2015	0.4
1	E	2015	0.7
1	F	2015	0.3
1	C	2016	1.3
1	D	2016	2.5
1	D	2016	2
1	F	2016	4.1
1	G	2016	3.1
1	H	2016	2.1
2	B	2014	5.4

I want to compare the distinct id2s by each id1 and year group. For instance, I want to compare id2s where id1=1 and year=2014 with id2s where id1=1 and year=2015. So, in the first case(id1=1 and year=2014), the distinct values of id2 are A, B, C, D and for the second case (id1=1 and year=2015) distinct values are A, C, D, E. I can co this by using first.id1, first.year etc. After this step the dataset will look like this:

id1	id2	date	value
1	A	2014	0.5
1	B	2014	0.8
1	C	2014	0.1
1	D	2014	1.2
1	A	2015	1.3
1	C	2015	5.6
1	D	2015	0.4
1	E	2015	0.7
1	F	2015	0.3
1	C	2016	1.3
1	D	2016	2.5
1	F	2016	4.1
1	G	2016	3.1
1	H	2016	2.1

Now, I want to compare these three groups and get the id2 that exist in first case but not exist in the second case (i.e. B exists in first id1xyear group but not in second id1xyear group). In the end, the measure I want will divide the total number of distinct id2s in first id1xyear with total number of nonexisting id2s in the second id1xyear (1/4 in the first comparison from 2014 to 2015).

It will repeat the same procedure with id1=2 and so on.

id1	date	measure
1	2014	.
1	2015	1/4
1	2016	2/5
2	2014

I hope I was able to convey what I need. I would appreciate any help. Thanks.

Ksharp · Posted 11-13-2017 09:19 AM

The following code is the fixed my original code. Sorry .

/*******For measure1*********/
data want;
array x{9999} $ 32 _temporary_;
m=0;
do until(last.date);
 set have;
 by id1 date;
 if id2 in x then m+1;
end;
want=divide(k-m,k);
if id1 ne lag(id1) then call missing(want);
output;

call missing(of x{*}); /* <--*/
k=0;
do until(last.date);
 set have;
 by id1 date;
 k+1;x{k}=id2;
end;
run;

/******For measure2***********/

data want;
array x{9999} $ 32 _temporary_;
m=0;
do until(last.date);
 set have;
 by id1 date;
 if id2 not in x then m+1; /*********/
end;
want=divide(m,k);  /*********/
if id1 ne lag(id1) then call missing(want);
output;

call missing(of x{*}); /*<---*/
k=0;
do until(last.date);
 set have;
 by id1 date;
 k+1;x{k}=id2;
end;

View solution in original post

Ksharp · Posted 11-10-2017 08:46 AM

I think it should be

id1	date	measure
1	2014	.
1	2015	1/5

Use your second dataset.

data have;
infile cards expandtabs ;
input id1	id2	$ date	value;
cards;
1	A	2014	0.5
1	B	2014	0.8
1	C	2014	0.1
1	D	2014	1.2
1	A	2015	1.3
1	C	2015	5.6
1	D	2015	0.4
1	E	2015	0.7
1	F	2015	0.3
1	C	2016	1.3
1	D	2016	2.5
1	F	2016	4.1
1	G	2016	3.1
1	H	2016	2.1
;
run;
data want;
array x{9999} $ 32 _temporary_;
m=0;
do n=1 by 1 until(last.date);
 set have;
 by id1 date;
 if id2 in x then m+1;
end;
want=divide(k-m,n);
if id1 ne lag(id1) then call missing(want);
output;

k=0;
do until(last.date);
 set have;
 by id1 date;
 k+1;x{k}=id2;
end;
run;

lezgin · Posted 11-10-2017 01:47 PM

Thank you very much. I appreciate your help but I think it is not giving what I am looking for. From 2014 to 2015 A,B,C,D becomes A,C,D,E,F. From 2015 to 2016 A,C,D,E,F becomes C,D,F,G,H.

So what I need is from 2014 to 2015: identify that B is missing in 2015 and divide this (1 missing) with all four (A,B,C,D) in 2014=1/4

From 2015 to 2016: identify that A and E is missing (2 missing). Divide 2 by what I have for 2015 (A,C,D,E,F) by 5= 2/5.

It should repeat this by id1. When next id1 comes the code should compare next first id1xdate group with the next date of same id1. I hope this is more clear now. Again, thank you very much.

Ksharp · Posted 11-11-2017 05:27 AM

OK. I get confused with denominator.

data have;
infile cards expandtabs ;
input id1	id2	$ date	value;
cards;
1	A	2014	0.5
1	B	2014	0.8
1	C	2014	0.1
1	D	2014	1.2
1	A	2015	1.3
1	C	2015	5.6
1	D	2015	0.4
1	E	2015	0.7
1	F	2015	0.3
1	C	2016	1.3
1	D	2016	2.5
1	F	2016	4.1
1	G	2016	3.1
1	H	2016	2.1
;
run;
data want;
array x{9999} $ 32 _temporary_;
m=0;
do until(last.date);
 set have;
 by id1 date;
 if id2 in x then m+1;
end;
want=divide(k-m,k);
if id1 ne lag(id1) then call missing(want);
output;

k=0;
do until(last.date);
 set have;
 by id1 date;
 k+1;x{k}=id2;
end;
run;

lezgin · Posted 11-11-2017 10:01 PM

Thanks for your patience. I understand if you don't have time for this. You have already been very helpful. I appreciate that. I realized that I need to decompose the measure into two parts. Let me give you a new example of each measure. There are only subtle changes.

Measure1: Suppose we have A,B,C,D in id1xdate1 and A,B,C,E in id1xdate2 and C,D,E,F in id1xdate3. Measure1 finds that D from id1xdate1 is missing in id1xdate2 and divides D (1) by A,B,C,D (4) of id1xdate1 and returns 1/4. Similarly, from id1xdate2 to id1xdate3, this measure finds that A,B are missing in id1xdate2 and divides (2) by A,B,C,E (4) and returns 2/4. This measure makes the second comparison between id2s of id1xdate2 and id1xdate3. The change from id1xdate2 to id1xdate3 matters now.

Measure 2: Again, suppose we have A,B,C,D in id1xdate1 and A,B,C,E in id1xdate2 and C,D,E,F in id1xdate3. Measure2 finds that E is not present in id1xdate1 but is present in id1xdate2. This time measure2 divides E (1) by A,B,C,D (4) in id1xdate1 and return 1/4. From id1xdate2 to id1xdate3, the code finds that D,F are not present in id1xdate2 but are present in id1xdate3. Measure divides D,F (2) by A,B,C,E (4) and return 2/4. Similarly, this measure makes the second comparison between id2s of id1xdate2 and id1xdate3. The change from id1xdate2 to id1xdate3 matters now.

Dataset for this example is below. I tried to modify your code to separate the two measures but I couldn't get the code to make the comparison between only two id1xdates. Not from the beginning.

id1	id2	date	value
1	A	20140530	0.5
1	B	20140530	0.8
1	C	20140530	0.1
1	D	20140530	1.2
1	A	20140831	1.3
1	B	20140831	5.6
1	C	20140831	0.4
1	E	20140831	0.7
1	C	20141130	0.3
1	D	20141130	1.3
1	E	20141130	2.5
1	F	20141130	4.1
2	A	19990130	3

Ksharp · Posted 11-12-2017 06:00 AM

OK. the following code should work for your second measure.

the code I have already posted is for your first measure.

Note: use your second table in your first post .

data have;
infile cards expandtabs ;
input id1	id2	$ date	value;
cards;
1	A	20140530	0.5
1	B	20140530	0.8
1	C	20140530	0.1
1	D	20140530	1.2
1	A	20140831	1.3
1	B	20140831	5.6
1	C	20140831	0.4
1	E	20140831	0.7
1	C	20141130	0.3
1	D	20141130	1.3
1	E	20141130	2.5
1	F	20141130	4.1
2	A	19990130	3
;
run;
data want;
array x{9999} $ 32 _temporary_;
m=0;
do until(last.date);
 set have;
 by id1 date;
 if id2 not in x then m+1; /*********/
end;
want=divide(m,k);  /*********/
if id1 ne lag(id1) then call missing(want);
output;

k=0;
do until(last.date);
 set have;
 by id1 date;
 k+1;x{k}=id2;
end;
run;

lezgin · Posted 11-12-2017 01:01 PM

In the first sheet of attached excel file I included a portion of the data. The first measure mostly works fine but sometimes returns negative values where it should return 0. The second measure does not work properly and I wasn't able to detect the problem. In the second sheet of the attached excel file I included the dateset transposed to show you how it should work. I included what the measure should return and what it returns when you use the dataset in the first sheet. The discrepancies are in red. Thank you very much.

Ksharp · Posted 11-13-2017 09:07 AM

I feel very Sorry. My bad. Add CALL MISSING(of x{*});

call missing(of x{*}) ;
k=0;
do until(last.date);
 set have;
 by id1 date;
 k+1;x{k}=id2;
end;
run;

Ksharp · Posted 11-13-2017 09:13 AM

Hi. I rewrite it by Hash Table. Hope you like it.

/*****For measure1 *******/
proc import datafile='/folders/myfolders/sascommunity.xlsx' out=have dbms=xlsx replace;
run;

proc sort data=have;
 by id1 date;
run;

data want;
if _n_=1 then do;
  if 0 then set have;
  declare hash h1();
  h1.definekey('id2');
  h1.definedone();
end;

n=h1.num_items;
do until(last.date);
 set have;
 by id1 date;
 rc=h1.remove();
end;
m=h1.num_items;
want=divide(m,n);
if id1 ne lag(id1) then call missing(want);
output;
h1.clear();
do until(last.date);
 set have;
 by id1 date;
 h1.ref();
end;
keep id1 id2 date want m n;
run;

/******For measure2 ***********/
proc import datafile='/folders/myfolders/sascommunity.xlsx' out=have dbms=xlsx replace;
run;

proc sort data=have;
 by id1 date;
run;

data want;
if _n_=1 then do;
  if 0 then set have;
  declare hash h1();
  h1.definekey('id2');
  h1.definedone();
end;

m=0;
n=h1.num_items;
do until(last.date);
 set have;
 by id1 date;
 if h1.check() ne 0 then m+1;
end;
want=divide(m,n);
if id1 ne lag(id1) then call missing(want);
output;
h1.clear();
do until(last.date);
 set have;
 by id1 date;
 h1.ref();
end;
keep id1 id2 date want m n;
run;

Ksharp · Posted 11-13-2017 09:19 AM

The following code is the fixed my original code. Sorry .

/*******For measure1*********/
data want;
array x{9999} $ 32 _temporary_;
m=0;
do until(last.date);
 set have;
 by id1 date;
 if id2 in x then m+1;
end;
want=divide(k-m,k);
if id1 ne lag(id1) then call missing(want);
output;

call missing(of x{*}); /* <--*/
k=0;
do until(last.date);
 set have;
 by id1 date;
 k+1;x{k}=id2;
end;
run;

/******For measure2***********/

data want;
array x{9999} $ 32 _temporary_;
m=0;
do until(last.date);
 set have;
 by id1 date;
 if id2 not in x then m+1; /*********/
end;
want=divide(m,k);  /*********/
if id1 ne lag(id1) then call missing(want);
output;

call missing(of x{*}); /*<---*/
k=0;
do until(last.date);
 set have;
 by id1 date;
 k+1;x{k}=id2;
end;

lezgin · Posted 11-13-2017 12:47 PM

This has been very helpful. Thank you very much for your kind and quick replies.

Comparison of groups within groups

Re: Comparison of groups within groups

Re: Comparison of groups within groups

Re: Comparison of groups within groups

Re: Comparison of groups within groups

Re: Comparison of groups within groups

Re: Comparison of groups within groups

Re: Comparison of groups within groups

Re: Comparison of groups within groups

Re: Comparison of groups within groups

Re: Comparison of groups within groups

Re: Comparison of groups within groups

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away