I have the following dataset.
id1 | id2 | date | value |
1 | A | 2014 | 0.5 |
1 | A | 2014 | 0.7 |
1 | B | 2014 | 0.8 |
1 | C | 2014 | 0.1 |
1 | C | 2014 | 1.1 |
1 | D | 2014 | 1.2 |
1 | A | 2015 | 1.3 |
1 | A | 2015 | 2.4 |
1 | C | 2015 | 5.6 |
1 | D | 2015 | 0.4 |
1 | E | 2015 | 0.7 |
1 | F | 2015 | 0.3 |
1 | C | 2016 | 1.3 |
1 | D | 2016 | 2.5 |
1 | D | 2016 | 2 |
1 | F | 2016 | 4.1 |
1 | G | 2016 | 3.1 |
1 | H | 2016 | 2.1 |
2 | B | 2014 | 5.4 |
I want to compare the distinct id2s by each id1 and year group. For instance, I want to compare id2s where id1=1 and year=2014 with id2s where id1=1 and year=2015. So, in the first case(id1=1 and year=2014), the distinct values of id2 are A, B, C, D and for the second case (id1=1 and year=2015) distinct values are A, C, D, E. I can co this by using first.id1, first.year etc. After this step the dataset will look like this:
id1 | id2 | date | value |
1 | A | 2014 | 0.5 |
1 | B | 2014 | 0.8 |
1 | C | 2014 | 0.1 |
1 | D | 2014 | 1.2 |
1 | A | 2015 | 1.3 |
1 | C | 2015 | 5.6 |
1 | D | 2015 | 0.4 |
1 | E | 2015 | 0.7 |
1 | F | 2015 | 0.3 |
1 | C | 2016 | 1.3 |
1 | D | 2016 | 2.5 |
1 | F | 2016 | 4.1 |
1 | G | 2016 | 3.1 |
1 | H | 2016 | 2.1 |
Now, I want to compare these three groups and get the id2 that exist in first case but not exist in the second case (i.e. B exists in first id1xyear group but not in second id1xyear group). In the end, the measure I want will divide the total number of distinct id2s in first id1xyear with total number of nonexisting id2s in the second id1xyear (1/4 in the first comparison from 2014 to 2015).
It will repeat the same procedure with id1=2 and so on.
id1 | date | measure |
1 | 2014 | . |
1 | 2015 | 1/4 |
1 | 2016 | 2/5 |
2 | 2014 |
I hope I was able to convey what I need. I would appreciate any help. Thanks.
The following code is the fixed my original code. Sorry .
/*******For measure1*********/
data want;
array x{9999} $ 32 _temporary_;
m=0;
do until(last.date);
set have;
by id1 date;
if id2 in x then m+1;
end;
want=divide(k-m,k);
if id1 ne lag(id1) then call missing(want);
output;
call missing(of x{*}); /* <--*/
k=0;
do until(last.date);
set have;
by id1 date;
k+1;x{k}=id2;
end;
run;
/******For measure2***********/
data want;
array x{9999} $ 32 _temporary_;
m=0;
do until(last.date);
set have;
by id1 date;
if id2 not in x then m+1; /*********/
end;
want=divide(m,k); /*********/
if id1 ne lag(id1) then call missing(want);
output;
call missing(of x{*}); /*<---*/
k=0;
do until(last.date);
set have;
by id1 date;
k+1;x{k}=id2;
end;
I think it should be
id1 | date | measure |
1 | 2014 | . |
1 | 2015 | 1/5 |
Use your second dataset.
data have;
infile cards expandtabs ;
input id1 id2 $ date value;
cards;
1 A 2014 0.5
1 B 2014 0.8
1 C 2014 0.1
1 D 2014 1.2
1 A 2015 1.3
1 C 2015 5.6
1 D 2015 0.4
1 E 2015 0.7
1 F 2015 0.3
1 C 2016 1.3
1 D 2016 2.5
1 F 2016 4.1
1 G 2016 3.1
1 H 2016 2.1
;
run;
data want;
array x{9999} $ 32 _temporary_;
m=0;
do n=1 by 1 until(last.date);
set have;
by id1 date;
if id2 in x then m+1;
end;
want=divide(k-m,n);
if id1 ne lag(id1) then call missing(want);
output;
k=0;
do until(last.date);
set have;
by id1 date;
k+1;x{k}=id2;
end;
run;
Thank you very much. I appreciate your help but I think it is not giving what I am looking for. From 2014 to 2015 A,B,C,D becomes A,C,D,E,F. From 2015 to 2016 A,C,D,E,F becomes C,D,F,G,H.
So what I need is from 2014 to 2015: identify that B is missing in 2015 and divide this (1 missing) with all four (A,B,C,D) in 2014=1/4
From 2015 to 2016: identify that A and E is missing (2 missing). Divide 2 by what I have for 2015 (A,C,D,E,F) by 5= 2/5.
It should repeat this by id1. When next id1 comes the code should compare next first id1xdate group with the next date of same id1. I hope this is more clear now. Again, thank you very much.
OK. I get confused with denominator.
data have;
infile cards expandtabs ;
input id1 id2 $ date value;
cards;
1 A 2014 0.5
1 B 2014 0.8
1 C 2014 0.1
1 D 2014 1.2
1 A 2015 1.3
1 C 2015 5.6
1 D 2015 0.4
1 E 2015 0.7
1 F 2015 0.3
1 C 2016 1.3
1 D 2016 2.5
1 F 2016 4.1
1 G 2016 3.1
1 H 2016 2.1
;
run;
data want;
array x{9999} $ 32 _temporary_;
m=0;
do until(last.date);
set have;
by id1 date;
if id2 in x then m+1;
end;
want=divide(k-m,k);
if id1 ne lag(id1) then call missing(want);
output;
k=0;
do until(last.date);
set have;
by id1 date;
k+1;x{k}=id2;
end;
run;
Thanks for your patience. I understand if you don't have time for this. You have already been very helpful. I appreciate that. I realized that I need to decompose the measure into two parts. Let me give you a new example of each measure. There are only subtle changes.
Measure1: Suppose we have A,B,C,D in id1xdate1 and A,B,C,E in id1xdate2 and C,D,E,F in id1xdate3. Measure1 finds that D from id1xdate1 is missing in id1xdate2 and divides D (1) by A,B,C,D (4) of id1xdate1 and returns 1/4. Similarly, from id1xdate2 to id1xdate3, this measure finds that A,B are missing in id1xdate2 and divides (2) by A,B,C,E (4) and returns 2/4. This measure makes the second comparison between id2s of id1xdate2 and id1xdate3. The change from id1xdate2 to id1xdate3 matters now.
Measure 2: Again, suppose we have A,B,C,D in id1xdate1 and A,B,C,E in id1xdate2 and C,D,E,F in id1xdate3. Measure2 finds that E is not present in id1xdate1 but is present in id1xdate2. This time measure2 divides E (1) by A,B,C,D (4) in id1xdate1 and return 1/4. From id1xdate2 to id1xdate3, the code finds that D,F are not present in id1xdate2 but are present in id1xdate3. Measure divides D,F (2) by A,B,C,E (4) and return 2/4. Similarly, this measure makes the second comparison between id2s of id1xdate2 and id1xdate3. The change from id1xdate2 to id1xdate3 matters now.
Dataset for this example is below. I tried to modify your code to separate the two measures but I couldn't get the code to make the comparison between only two id1xdates. Not from the beginning.
id1 | id2 | date | value |
1 | A | 20140530 | 0.5 |
1 | B | 20140530 | 0.8 |
1 | C | 20140530 | 0.1 |
1 | D | 20140530 | 1.2 |
1 | A | 20140831 | 1.3 |
1 | B | 20140831 | 5.6 |
1 | C | 20140831 | 0.4 |
1 | E | 20140831 | 0.7 |
1 | C | 20141130 | 0.3 |
1 | D | 20141130 | 1.3 |
1 | E | 20141130 | 2.5 |
1 | F | 20141130 | 4.1 |
2 | A | 19990130 | 3 |
OK. the following code should work for your second measure.
the code I have already posted is for your first measure.
Note: use your second table in your first post .
data have;
infile cards expandtabs ;
input id1 id2 $ date value;
cards;
1 A 20140530 0.5
1 B 20140530 0.8
1 C 20140530 0.1
1 D 20140530 1.2
1 A 20140831 1.3
1 B 20140831 5.6
1 C 20140831 0.4
1 E 20140831 0.7
1 C 20141130 0.3
1 D 20141130 1.3
1 E 20141130 2.5
1 F 20141130 4.1
2 A 19990130 3
;
run;
data want;
array x{9999} $ 32 _temporary_;
m=0;
do until(last.date);
set have;
by id1 date;
if id2 not in x then m+1; /*********/
end;
want=divide(m,k); /*********/
if id1 ne lag(id1) then call missing(want);
output;
k=0;
do until(last.date);
set have;
by id1 date;
k+1;x{k}=id2;
end;
run;
In the first sheet of attached excel file I included a portion of the data. The first measure mostly works fine but sometimes returns negative values where it should return 0. The second measure does not work properly and I wasn't able to detect the problem. In the second sheet of the attached excel file I included the dateset transposed to show you how it should work. I included what the measure should return and what it returns when you use the dataset in the first sheet. The discrepancies are in red. Thank you very much.
I feel very Sorry. My bad. Add CALL MISSING(of x{*});
call missing(of x{*}) ;
k=0;
do until(last.date);
set have;
by id1 date;
k+1;x{k}=id2;
end;
run;
Hi. I rewrite it by Hash Table. Hope you like it.
/*****For measure1 *******/
proc import datafile='/folders/myfolders/sascommunity.xlsx' out=have dbms=xlsx replace;
run;
proc sort data=have;
by id1 date;
run;
data want;
if _n_=1 then do;
if 0 then set have;
declare hash h1();
h1.definekey('id2');
h1.definedone();
end;
n=h1.num_items;
do until(last.date);
set have;
by id1 date;
rc=h1.remove();
end;
m=h1.num_items;
want=divide(m,n);
if id1 ne lag(id1) then call missing(want);
output;
h1.clear();
do until(last.date);
set have;
by id1 date;
h1.ref();
end;
keep id1 id2 date want m n;
run;
/******For measure2 ***********/
proc import datafile='/folders/myfolders/sascommunity.xlsx' out=have dbms=xlsx replace;
run;
proc sort data=have;
by id1 date;
run;
data want;
if _n_=1 then do;
if 0 then set have;
declare hash h1();
h1.definekey('id2');
h1.definedone();
end;
m=0;
n=h1.num_items;
do until(last.date);
set have;
by id1 date;
if h1.check() ne 0 then m+1;
end;
want=divide(m,n);
if id1 ne lag(id1) then call missing(want);
output;
h1.clear();
do until(last.date);
set have;
by id1 date;
h1.ref();
end;
keep id1 id2 date want m n;
run;
The following code is the fixed my original code. Sorry .
/*******For measure1*********/
data want;
array x{9999} $ 32 _temporary_;
m=0;
do until(last.date);
set have;
by id1 date;
if id2 in x then m+1;
end;
want=divide(k-m,k);
if id1 ne lag(id1) then call missing(want);
output;
call missing(of x{*}); /* <--*/
k=0;
do until(last.date);
set have;
by id1 date;
k+1;x{k}=id2;
end;
run;
/******For measure2***********/
data want;
array x{9999} $ 32 _temporary_;
m=0;
do until(last.date);
set have;
by id1 date;
if id2 not in x then m+1; /*********/
end;
want=divide(m,k); /*********/
if id1 ne lag(id1) then call missing(want);
output;
call missing(of x{*}); /*<---*/
k=0;
do until(last.date);
set have;
by id1 date;
k+1;x{k}=id2;
end;
This has been very helpful. Thank you very much for your kind and quick replies.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.