Comparing ‘sets’ in two data sets using SAS

Solved
Occasional Contributor
Posts: 15

Comparing ‘sets’ in two data sets using SAS

I want to compare sets (that is the mathematical notion of a set and not the SAS statement) in two different data sets to determine which identifiers are not unique.

For example, consider the two data sets below: data ‘common’ and data ‘address’. Data common groups identifiers that have some arbitrary variables in common (for example, they might have exactly the same income with a very high level of precision). Data ‘common’ looks like the table below, and you would interpret that id ‘A’ and id ‘B’ share the same variable (e.g. income) in common.

Data ‘common’

 Obs id1 id2 id3 1 A B 2 C D 3 E F G

In the second data set (data ‘address’) I have address information. I can use this data set to determine which identifiers share the same address. For example, in the table below, id ‘B’ and id ‘A’ have the same address ‘Fake 1’.

 Address id1 id2 id3 Fake 1 B A Fake 2 C D E Fake 3 Y K Z

In this specific example, id ‘A’ and ‘B’ are actually the same person. One of the observations needs to be dropped because they are not unique. How would you use SAS to address this problem?

One of the challenges I face includes the order of the set. For example, even though id ‘A’ and id ‘B’ are in the same row of data, they are clearly arranged differently in data ‘common’ and data ‘address’. In addition, id ‘C’ and id ‘D’ (which actually represents the same person) have an additional person (id ‘E’) at the address ‘Fake 2’.

Any help would be greatly appreciated!

I'm using SAS 9.4 enterprise guide.

Accepted Solutions
Solution
‎09-07-2017 06:49 PM
Posts: 5,528

Re: Comparing ‘sets’ in two data sets using SAS

Is this enough for your purpose?

``````data c;
infile datalines missover;
input Obs	(id1	id2	id3) (:\$1.);
datalines;
1	A	B
2	C	D
3	E	F	G
;

data a;
infile datalines missover;
input Address &\$ (id1	id2	id3) (:\$1.);
datalines;
Fake 1	B	A
Fake 2	C	D	E
Fake 3	Y	K	Z
;

data samec;
set c;
array id id:;
length same \$8;
do i = 1 to dim(id) - 1;
do j = i + 1 to dim(id);
fromId = id{i};
toId = id{j};
if cmiss(fromId, toId) = 0 then do;
call sortc(fromId, toId);
same = catx("-", fromId, toId);
output;
end;
end;
end;
keep obs same;
run;

data samea;
set a;
array id id:;
length same \$8;
do i = 1 to dim(id) - 1;
do j = i + 1 to dim(id);
fromId = id{i};
toId = id{j};
if cmiss(fromId, toId) = 0 then do;
call sortc(fromId, toId);
same = catx("-", fromId, toId);
output;
end;
end;
end;
run;

proc sql;
create table same as
select same from samec
intersect
select same from samea;
quit;
``````
PG

All Replies
Solution
‎09-07-2017 06:49 PM
Posts: 5,528

Re: Comparing ‘sets’ in two data sets using SAS

Is this enough for your purpose?

``````data c;
infile datalines missover;
input Obs	(id1	id2	id3) (:\$1.);
datalines;
1	A	B
2	C	D
3	E	F	G
;

data a;
infile datalines missover;
input Address &\$ (id1	id2	id3) (:\$1.);
datalines;
Fake 1	B	A
Fake 2	C	D	E
Fake 3	Y	K	Z
;

data samec;
set c;
array id id:;
length same \$8;
do i = 1 to dim(id) - 1;
do j = i + 1 to dim(id);
fromId = id{i};
toId = id{j};
if cmiss(fromId, toId) = 0 then do;
call sortc(fromId, toId);
same = catx("-", fromId, toId);
output;
end;
end;
end;
keep obs same;
run;

data samea;
set a;
array id id:;
length same \$8;
do i = 1 to dim(id) - 1;
do j = i + 1 to dim(id);
fromId = id{i};
toId = id{j};
if cmiss(fromId, toId) = 0 then do;
call sortc(fromId, toId);
same = catx("-", fromId, toId);
output;
end;
end;
end;
run;

proc sql;
create table same as
select same from samec
intersect
select same from samea;
quit;
``````
PG
Occasional Contributor
Posts: 15

Re: Comparing ‘sets’ in two data sets using SAS

Thanks PGStats!

What a great little program. I have never used the ‘cmiss’ function before or the ‘sortc’ call routine. I also love the loop to find all unique pairs of identifiers. I certainly got more than I bargained for!

Can you help me extend this a little more?

The proc sql creates the table ‘same’ which looks like this:

data ‘same’

 Obs same 1 a-b 2 c-d

imagine instead that the output looked like this

data ‘same_more_likely’

 obs same 1 a-b 2 a-h 3 c-d 4 b-h

In this scenario, a=b=h. There are situations where there are more than two ids related to the same person, so this output is more likely. I want to manipulate the output to look like this:

data ‘new_same’

 id group a 1 b 1 h 1 c 2 d 2

Any thoughts?

Posts: 5,528

Re: Comparing ‘sets’ in two data sets using SAS

Yes, my thoughts involve the SAS/OR network solver. A group forms a network where every id is connected (same) with all other ids in the group. That's the definition of a clique in network theory. Finding all cliques in a network is intrinsically difficult, but SAS implemented an algorithm to do it. If you have SAS/OR, you can do:

``````data c;
infile datalines missover;
input Obs	(id1	id2	id3) (:\$1.);
datalines;
1	A	B	H
2	C	D
3	E	F	G
4   B   H
;

data a;
infile datalines missover;
input Address &\$ (id1	id2	id3) (:\$1.);
datalines;
Fake 1	B	A	H
Fake 2	C	D	E
Fake 3	Y	K	Z
;

data samec;
set c;
array id id:;
length same \$8;
do i = 1 to dim(id) - 1;
do j = i + 1 to dim(id);
fromId = id{i};
toId = id{j};
if cmiss(fromId, toId) = 0 then do;
call sortc(fromId, toId);
same = catx("-", fromId, toId);
output;
end;
end;
end;
keep obs same fromId toId;
run;

data samea;
set a;
array id id:;
length same \$8;
do i = 1 to dim(id) - 1;
do j = i + 1 to dim(id);
fromId = id{i};
toId = id{j};
if cmiss(fromId, toId) = 0 then do;
call sortc(fromId, toId);
same = catx("-", fromId, toId);
output;
end;
end;
end;
run;

proc sql;
create table same as
select fromId, toId from samec
intersect
select fromId, toId from samea;
select * from same;
quit;

clique out=groups(rename=(clique=group node=id));
run;``````
PG
Posts: 5,528

Re: Comparing ‘sets’ in two data sets using SAS

A simpler solution is almost obvious once you flip your data structure inside-out:

``````data c;
infile datalines missover;
input Obs	(id1	id2	id3) (:\$1.);
datalines;
1	A	B	H
2	C	D
3	E	F	G
;

data a;
infile datalines missover;
input Address &\$ (id1	id2	id3) (:\$1.);
datalines;
Fake 1	B	A	H
Fake 2	C	D	E
Fake 3	Y	K	Z
;

data revc;
set c;
array a id:;
do i = 1 to dim(a);
if not missing(a{i}) then do;
id = a{i};
output;
end;
end;
keep id obs;
run;

data reva;
set a;
array a id:;
do i = 1 to dim(a);
if not missing(a{i}) then do;
id = a{i};
output;
end;
end;
run;

proc sql;
create table want as
select
revc.id,
from
revc inner join reva on revc.id=reva.id
group by group
having count(distinct id) > 1
order by group, id;
select * from want;
quit;``````
PG
Occasional Contributor
Posts: 15

Re: Comparing ‘sets’ in two data sets using SAS

Hi PGStats,

Thanks for the suggestions!

Unfortunately, my workplace does not have SAS/OR. In addition, I can’t follow through with your second ‘simpler’ solution because the data isn’t quite as neatly setup as your example data set ‘c’. In your data set ‘c’, observation 1 looks like this:

data ‘pgstats_c’

 obs id1 id2 id3 1 A B H

Even though it is true that A=B=H, the data actually looks like this:

data ‘actual_c’

 obs id1 id2 id3 1 A B 2 B H

In this context, the ‘simpler’ solution does not work, because ‘H’ is not identified as being part of the same group as ‘A’ and ‘B’.

I noticed that you published a macro back in 2015:

https://communities.sas.com/t5/General-SAS-Programming/Finding-all-connected-components-in-a-graph/m...

This works! Although I confess that I am using it without fully understanding it.

Could you prescribe an introductory text on ‘network theory’?

Posts: 5,528

Re: Comparing ‘sets’ in two data sets using SAS

actual_c doesn't say that A=H.Given your initial description of your data, A=B=H would require

 obs id1 id2 id3 1 A B 1 B H 1 A H
PG
Occasional Contributor
Posts: 15

Re: Comparing ‘sets’ in two data sets using SAS

Hi PGStats,

You've identified a limitation of my motivating example, rather than a mistake about what the data actually represents. I apologies if this has caused confusion.

In the motivating example, I said that each observation represented ids that shared a common variable, such as income. If this was strictly true, then you are right, actual_c doesn’t say that A=H.

If it helps, I’ll be more abstract. The actual_data looks like this:

data 'actual_data'

 obs id1 id2 1 A B 2 B H

The interpretation of this data is that A=B and B=H. By the properties of transitivity, A=H. The final output I want is:

data 'final_output'

 id clust A 1 B 1 H 1