I have the following dataset.
id1 | id2 | date | value |
1 | A | 2014 | 0.5 |
1 | A | 2014 | 0.7 |
1 | B | 2014 | 0.8 |
1 | C | 2014 | 0.1 |
1 | C | 2014 | 1.1 |
1 | D | 2014 | 1.2 |
1 | A | 2015 | 1.3 |
1 | A | 2015 | 2.4 |
1 | C | 2015 | 5.6 |
1 | D | 2015 | 0.4 |
1 | E | 2015 | 0.7 |
1 | F | 2015 | 0.3 |
1 | C | 2016 | 1.3 |
1 | D | 2016 | 2.5 |
1 | D | 2016 | 2 |
1 | F | 2016 | 4.1 |
1 | G | 2016 | 3.1 |
1 | H | 2016 | 2.1 |
2 | B | 2014 | 5.4 |
I want to compare the distinct id2s by each id1 and year group. For instance, I want to compare id2s where id1=1 and year=2014 with id2s where id1=1 and year=2015. So, in the first case(id1=1 and year=2014), the distinct values of id2 are A, B, C, D and for the second case (id1=1 and year=2015) distinct values are A, C, D, E. I can co this by using first.id1, first.year etc. After this step the dataset will look like this:
id1 | id2 | date | value |
1 | A | 2014 | 0.5 |
1 | B | 2014 | 0.8 |
1 | C | 2014 | 0.1 |
1 | D | 2014 | 1.2 |
1 | A | 2015 | 1.3 |
1 | C | 2015 | 5.6 |
1 | D | 2015 | 0.4 |
1 | E | 2015 | 0.7 |
1 | F | 2015 | 0.3 |
1 | C | 2016 | 1.3 |
1 | D | 2016 | 2.5 |
1 | F | 2016 | 4.1 |
1 | G | 2016 | 3.1 |
1 | H | 2016 | 2.1 |
Now, I want to compare these three groups and get the id2 that exist in first case but not exist in the second case (i.e. B exists in first id1xyear group but not in second id1xyear group). In the end, the measure I want will divide the total number of distinct id2s in first id1xyear with total number of nonexisting id2s in the second id1xyear (1/4 in the first comparison from 2014 to 2015).
It will repeat the same procedure with id1=2 and so on.
id1 | date | measure |
1 | 2014 | . |
1 | 2015 | 1/4 |
1 | 2016 | 2/5 |
2 | 2014 |
I hope I was able to convey what I need. I would appreciate any help. Thanks.
The following code is the fixed my original code. Sorry .
/*******For measure1*********/
data want;
array x{9999} $ 32 _temporary_;
m=0;
do until(last.date);
set have;
by id1 date;
if id2 in x then m+1;
end;
want=divide(k-m,k);
if id1 ne lag(id1) then call missing(want);
output;
call missing(of x{*}); /* <--*/
k=0;
do until(last.date);
set have;
by id1 date;
k+1;x{k}=id2;
end;
run;
/******For measure2***********/
data want;
array x{9999} $ 32 _temporary_;
m=0;
do until(last.date);
set have;
by id1 date;
if id2 not in x then m+1; /*********/
end;
want=divide(m,k); /*********/
if id1 ne lag(id1) then call missing(want);
output;
call missing(of x{*}); /*<---*/
k=0;
do until(last.date);
set have;
by id1 date;
k+1;x{k}=id2;
end;
I think it should be
id1 | date | measure |
1 | 2014 | . |
1 | 2015 | 1/5 |
Use your second dataset.
data have;
infile cards expandtabs ;
input id1 id2 $ date value;
cards;
1 A 2014 0.5
1 B 2014 0.8
1 C 2014 0.1
1 D 2014 1.2
1 A 2015 1.3
1 C 2015 5.6
1 D 2015 0.4
1 E 2015 0.7
1 F 2015 0.3
1 C 2016 1.3
1 D 2016 2.5
1 F 2016 4.1
1 G 2016 3.1
1 H 2016 2.1
;
run;
data want;
array x{9999} $ 32 _temporary_;
m=0;
do n=1 by 1 until(last.date);
set have;
by id1 date;
if id2 in x then m+1;
end;
want=divide(k-m,n);
if id1 ne lag(id1) then call missing(want);
output;
k=0;
do until(last.date);
set have;
by id1 date;
k+1;x{k}=id2;
end;
run;
Thank you very much. I appreciate your help but I think it is not giving what I am looking for. From 2014 to 2015 A,B,C,D becomes A,C,D,E,F. From 2015 to 2016 A,C,D,E,F becomes C,D,F,G,H.
So what I need is from 2014 to 2015: identify that B is missing in 2015 and divide this (1 missing) with all four (A,B,C,D) in 2014=1/4
From 2015 to 2016: identify that A and E is missing (2 missing). Divide 2 by what I have for 2015 (A,C,D,E,F) by 5= 2/5.
It should repeat this by id1. When next id1 comes the code should compare next first id1xdate group with the next date of same id1. I hope this is more clear now. Again, thank you very much.
OK. I get confused with denominator.
data have;
infile cards expandtabs ;
input id1 id2 $ date value;
cards;
1 A 2014 0.5
1 B 2014 0.8
1 C 2014 0.1
1 D 2014 1.2
1 A 2015 1.3
1 C 2015 5.6
1 D 2015 0.4
1 E 2015 0.7
1 F 2015 0.3
1 C 2016 1.3
1 D 2016 2.5
1 F 2016 4.1
1 G 2016 3.1
1 H 2016 2.1
;
run;
data want;
array x{9999} $ 32 _temporary_;
m=0;
do until(last.date);
set have;
by id1 date;
if id2 in x then m+1;
end;
want=divide(k-m,k);
if id1 ne lag(id1) then call missing(want);
output;
k=0;
do until(last.date);
set have;
by id1 date;
k+1;x{k}=id2;
end;
run;
Thanks for your patience. I understand if you don't have time for this. You have already been very helpful. I appreciate that. I realized that I need to decompose the measure into two parts. Let me give you a new example of each measure. There are only subtle changes.
Measure1: Suppose we have A,B,C,D in id1xdate1 and A,B,C,E in id1xdate2 and C,D,E,F in id1xdate3. Measure1 finds that D from id1xdate1 is missing in id1xdate2 and divides D (1) by A,B,C,D (4) of id1xdate1 and returns 1/4. Similarly, from id1xdate2 to id1xdate3, this measure finds that A,B are missing in id1xdate2 and divides (2) by A,B,C,E (4) and returns 2/4. This measure makes the second comparison between id2s of id1xdate2 and id1xdate3. The change from id1xdate2 to id1xdate3 matters now.
Measure 2: Again, suppose we have A,B,C,D in id1xdate1 and A,B,C,E in id1xdate2 and C,D,E,F in id1xdate3. Measure2 finds that E is not present in id1xdate1 but is present in id1xdate2. This time measure2 divides E (1) by A,B,C,D (4) in id1xdate1 and return 1/4. From id1xdate2 to id1xdate3, the code finds that D,F are not present in id1xdate2 but are present in id1xdate3. Measure divides D,F (2) by A,B,C,E (4) and return 2/4. Similarly, this measure makes the second comparison between id2s of id1xdate2 and id1xdate3. The change from id1xdate2 to id1xdate3 matters now.
Dataset for this example is below. I tried to modify your code to separate the two measures but I couldn't get the code to make the comparison between only two id1xdates. Not from the beginning.
id1 | id2 | date | value |
1 | A | 20140530 | 0.5 |
1 | B | 20140530 | 0.8 |
1 | C | 20140530 | 0.1 |
1 | D | 20140530 | 1.2 |
1 | A | 20140831 | 1.3 |
1 | B | 20140831 | 5.6 |
1 | C | 20140831 | 0.4 |
1 | E | 20140831 | 0.7 |
1 | C | 20141130 | 0.3 |
1 | D | 20141130 | 1.3 |
1 | E | 20141130 | 2.5 |
1 | F | 20141130 | 4.1 |
2 | A | 19990130 | 3 |
OK. the following code should work for your second measure.
the code I have already posted is for your first measure.
Note: use your second table in your first post .
data have;
infile cards expandtabs ;
input id1 id2 $ date value;
cards;
1 A 20140530 0.5
1 B 20140530 0.8
1 C 20140530 0.1
1 D 20140530 1.2
1 A 20140831 1.3
1 B 20140831 5.6
1 C 20140831 0.4
1 E 20140831 0.7
1 C 20141130 0.3
1 D 20141130 1.3
1 E 20141130 2.5
1 F 20141130 4.1
2 A 19990130 3
;
run;
data want;
array x{9999} $ 32 _temporary_;
m=0;
do until(last.date);
set have;
by id1 date;
if id2 not in x then m+1; /*********/
end;
want=divide(m,k); /*********/
if id1 ne lag(id1) then call missing(want);
output;
k=0;
do until(last.date);
set have;
by id1 date;
k+1;x{k}=id2;
end;
run;
In the first sheet of attached excel file I included a portion of the data. The first measure mostly works fine but sometimes returns negative values where it should return 0. The second measure does not work properly and I wasn't able to detect the problem. In the second sheet of the attached excel file I included the dateset transposed to show you how it should work. I included what the measure should return and what it returns when you use the dataset in the first sheet. The discrepancies are in red. Thank you very much.
I feel very Sorry. My bad. Add CALL MISSING(of x{*});
call missing(of x{*}) ;
k=0;
do until(last.date);
set have;
by id1 date;
k+1;x{k}=id2;
end;
run;
Hi. I rewrite it by Hash Table. Hope you like it.
/*****For measure1 *******/
proc import datafile='/folders/myfolders/sascommunity.xlsx' out=have dbms=xlsx replace;
run;
proc sort data=have;
by id1 date;
run;
data want;
if _n_=1 then do;
if 0 then set have;
declare hash h1();
h1.definekey('id2');
h1.definedone();
end;
n=h1.num_items;
do until(last.date);
set have;
by id1 date;
rc=h1.remove();
end;
m=h1.num_items;
want=divide(m,n);
if id1 ne lag(id1) then call missing(want);
output;
h1.clear();
do until(last.date);
set have;
by id1 date;
h1.ref();
end;
keep id1 id2 date want m n;
run;
/******For measure2 ***********/
proc import datafile='/folders/myfolders/sascommunity.xlsx' out=have dbms=xlsx replace;
run;
proc sort data=have;
by id1 date;
run;
data want;
if _n_=1 then do;
if 0 then set have;
declare hash h1();
h1.definekey('id2');
h1.definedone();
end;
m=0;
n=h1.num_items;
do until(last.date);
set have;
by id1 date;
if h1.check() ne 0 then m+1;
end;
want=divide(m,n);
if id1 ne lag(id1) then call missing(want);
output;
h1.clear();
do until(last.date);
set have;
by id1 date;
h1.ref();
end;
keep id1 id2 date want m n;
run;
The following code is the fixed my original code. Sorry .
/*******For measure1*********/
data want;
array x{9999} $ 32 _temporary_;
m=0;
do until(last.date);
set have;
by id1 date;
if id2 in x then m+1;
end;
want=divide(k-m,k);
if id1 ne lag(id1) then call missing(want);
output;
call missing(of x{*}); /* <--*/
k=0;
do until(last.date);
set have;
by id1 date;
k+1;x{k}=id2;
end;
run;
/******For measure2***********/
data want;
array x{9999} $ 32 _temporary_;
m=0;
do until(last.date);
set have;
by id1 date;
if id2 not in x then m+1; /*********/
end;
want=divide(m,k); /*********/
if id1 ne lag(id1) then call missing(want);
output;
call missing(of x{*}); /*<---*/
k=0;
do until(last.date);
set have;
by id1 date;
k+1;x{k}=id2;
end;
This has been very helpful. Thank you very much for your kind and quick replies.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.