I want to compare sets (that is the mathematical notion of a set and not the SAS statement) in two different data sets to determine which identifiers are not unique.
For example, consider the two data sets below: data ‘common’ and data ‘address’. Data common groups identifiers that have some arbitrary variables in common (for example, they might have exactly the same income with a very high level of precision). Data ‘common’ looks like the table below, and you would interpret that id ‘A’ and id ‘B’ share the same variable (e.g. income) in common.
Data ‘common’
Obs |
id1 |
id2 |
id3 |
1 |
A |
B |
|
2 |
C |
D |
|
3 |
E |
F |
G |
In the second data set (data ‘address’) I have address information. I can use this data set to determine which identifiers share the same address. For example, in the table below, id ‘B’ and id ‘A’ have the same address ‘Fake 1’.
Data ‘address’
Address |
id1 |
id2 |
id3 |
Fake 1 |
B |
A |
|
Fake 2 |
C |
D |
E |
Fake 3 |
Y |
K |
Z |
In this specific example, id ‘A’ and ‘B’ are actually the same person. One of the observations needs to be dropped because they are not unique. How would you use SAS to address this problem?
One of the challenges I face includes the order of the set. For example, even though id ‘A’ and id ‘B’ are in the same row of data, they are clearly arranged differently in data ‘common’ and data ‘address’. In addition, id ‘C’ and id ‘D’ (which actually represents the same person) have an additional person (id ‘E’) at the address ‘Fake 2’.
Any help would be greatly appreciated!
I'm using SAS 9.4 enterprise guide.
Is this enough for your purpose?
data c;
infile datalines missover;
input Obs (id1 id2 id3) (:$1.);
datalines;
1 A B
2 C D
3 E F G
;
data a;
infile datalines missover;
input Address &$ (id1 id2 id3) (:$1.);
datalines;
Fake 1 B A
Fake 2 C D E
Fake 3 Y K Z
;
data samec;
set c;
array id id:;
length same $8;
do i = 1 to dim(id) - 1;
do j = i + 1 to dim(id);
fromId = id{i};
toId = id{j};
if cmiss(fromId, toId) = 0 then do;
call sortc(fromId, toId);
same = catx("-", fromId, toId);
output;
end;
end;
end;
keep obs same;
run;
data samea;
set a;
array id id:;
length same $8;
do i = 1 to dim(id) - 1;
do j = i + 1 to dim(id);
fromId = id{i};
toId = id{j};
if cmiss(fromId, toId) = 0 then do;
call sortc(fromId, toId);
same = catx("-", fromId, toId);
output;
end;
end;
end;
keep Address same;
run;
proc sql;
create table same as
select same from samec
intersect
select same from samea;
quit;
Is this enough for your purpose?
data c;
infile datalines missover;
input Obs (id1 id2 id3) (:$1.);
datalines;
1 A B
2 C D
3 E F G
;
data a;
infile datalines missover;
input Address &$ (id1 id2 id3) (:$1.);
datalines;
Fake 1 B A
Fake 2 C D E
Fake 3 Y K Z
;
data samec;
set c;
array id id:;
length same $8;
do i = 1 to dim(id) - 1;
do j = i + 1 to dim(id);
fromId = id{i};
toId = id{j};
if cmiss(fromId, toId) = 0 then do;
call sortc(fromId, toId);
same = catx("-", fromId, toId);
output;
end;
end;
end;
keep obs same;
run;
data samea;
set a;
array id id:;
length same $8;
do i = 1 to dim(id) - 1;
do j = i + 1 to dim(id);
fromId = id{i};
toId = id{j};
if cmiss(fromId, toId) = 0 then do;
call sortc(fromId, toId);
same = catx("-", fromId, toId);
output;
end;
end;
end;
keep Address same;
run;
proc sql;
create table same as
select same from samec
intersect
select same from samea;
quit;
Thanks PGStats!
What a great little program. I have never used the ‘cmiss’ function before or the ‘sortc’ call routine. I also love the loop to find all unique pairs of identifiers. I certainly got more than I bargained for!
Can you help me extend this a little more?
The proc sql creates the table ‘same’ which looks like this:
data ‘same’
Obs |
same |
1 |
a-b |
2 |
c-d |
imagine instead that the output looked like this
data ‘same_more_likely’
obs |
same |
1 |
a-b |
2 |
a-h |
3 |
c-d |
4 |
b-h |
In this scenario, a=b=h. There are situations where there are more than two ids related to the same person, so this output is more likely. I want to manipulate the output to look like this:
data ‘new_same’
id |
group |
a |
1 |
b |
1 |
h |
1 |
c |
2 |
d |
2 |
Any thoughts?
Yes, my thoughts involve the SAS/OR network solver. A group forms a network where every id is connected (same) with all other ids in the group. That's the definition of a clique in network theory. Finding all cliques in a network is intrinsically difficult, but SAS implemented an algorithm to do it. If you have SAS/OR, you can do:
data c;
infile datalines missover;
input Obs (id1 id2 id3) (:$1.);
datalines;
1 A B H
2 C D
3 E F G
4 B H
;
data a;
infile datalines missover;
input Address &$ (id1 id2 id3) (:$1.);
datalines;
Fake 1 B A H
Fake 2 C D E
Fake 3 Y K Z
;
data samec;
set c;
array id id:;
length same $8;
do i = 1 to dim(id) - 1;
do j = i + 1 to dim(id);
fromId = id{i};
toId = id{j};
if cmiss(fromId, toId) = 0 then do;
call sortc(fromId, toId);
same = catx("-", fromId, toId);
output;
end;
end;
end;
keep obs same fromId toId;
run;
data samea;
set a;
array id id:;
length same $8;
do i = 1 to dim(id) - 1;
do j = i + 1 to dim(id);
fromId = id{i};
toId = id{j};
if cmiss(fromId, toId) = 0 then do;
call sortc(fromId, toId);
same = catx("-", fromId, toId);
output;
end;
end;
end;
keep Address same fromId toId;
run;
proc sql;
create table same as
select fromId, toId from samec
intersect
select fromId, toId from samea;
select * from same;
quit;
proc optnet data_links=same;
data_links_var from=fromId to=toId;
clique out=groups(rename=(clique=group node=id));
run;
A simpler solution is almost obvious once you flip your data structure inside-out:
data c;
infile datalines missover;
input Obs (id1 id2 id3) (:$1.);
datalines;
1 A B H
2 C D
3 E F G
;
data a;
infile datalines missover;
input Address &$ (id1 id2 id3) (:$1.);
datalines;
Fake 1 B A H
Fake 2 C D E
Fake 3 Y K Z
;
data revc;
set c;
array a id:;
do i = 1 to dim(a);
if not missing(a{i}) then do;
id = a{i};
output;
end;
end;
keep id obs;
run;
data reva;
set a;
array a id:;
do i = 1 to dim(a);
if not missing(a{i}) then do;
id = a{i};
output;
end;
end;
keep id Address;
run;
proc sql;
create table want as
select
revc.id,
catx("@", obs, Address) as group
from
revc inner join reva on revc.id=reva.id
group by group
having count(distinct id) > 1
order by group, id;
select * from want;
quit;
Hi PGStats,
Thanks for the suggestions!
Unfortunately, my workplace does not have SAS/OR. In addition, I can’t follow through with your second ‘simpler’ solution because the data isn’t quite as neatly setup as your example data set ‘c’. In your data set ‘c’, observation 1 looks like this:
data ‘pgstats_c’
obs |
id1 |
id2 |
id3 |
1 |
A |
B |
H |
Even though it is true that A=B=H, the data actually looks like this:
data ‘actual_c’
obs |
id1 |
id2 |
id3 |
1 |
A |
B |
|
2 |
B |
H |
|
In this context, the ‘simpler’ solution does not work, because ‘H’ is not identified as being part of the same group as ‘A’ and ‘B’.
I noticed that you published a macro back in 2015:
This works! Although I confess that I am using it without fully understanding it.
Could you prescribe an introductory text on ‘network theory’?
Thanks for all your help!
actual_c doesn't say that A=H.Given your initial description of your data, A=B=H would require
obs |
id1 |
id2 |
id3 |
1 |
A |
B |
|
1 |
B |
H |
|
1 |
A |
H |
Hi PGStats,
You've identified a limitation of my motivating example, rather than a mistake about what the data actually represents. I apologies if this has caused confusion.
In the motivating example, I said that each observation represented ids that shared a common variable, such as income. If this was strictly true, then you are right, actual_c doesn’t say that A=H.
If it helps, I’ll be more abstract. The actual_data looks like this:
data 'actual_data'
obs |
id1 |
id2 |
1 |
A |
B |
2 |
B |
H |
The interpretation of this data is that A=B and B=H. By the properties of transitivity, A=H. The final output I want is:
data 'final_output'
id |
clust |
A |
1 |
B |
1 |
H |
1 |
Thanks for all your help!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.