Solved: Re: Comparing ‘sets’ in two data sets using SAS

haydn89 · Posted 09-01-2017 08:17 PM

I want to compare sets (that is the mathematical notion of a set and not the SAS statement) in two different data sets to determine which identifiers are not unique.

For example, consider the two data sets below: data ‘common’ and data ‘address’. Data common groups identifiers that have some arbitrary variables in common (for example, they might have exactly the same income with a very high level of precision). Data ‘common’ looks like the table below, and you would interpret that id ‘A’ and id ‘B’ share the same variable (e.g. income) in common.

Data ‘common’

Obs	id1	id2	id3
1	A	B
2	C	D
3	E	F	G

In the second data set (data ‘address’) I have address information. I can use this data set to determine which identifiers share the same address. For example, in the table below, id ‘B’ and id ‘A’ have the same address ‘Fake 1’.

Data ‘address’

Address	id1	id2	id3
Fake 1	B	A
Fake 2	C	D	E
Fake 3	Y	K	Z

In this specific example, id ‘A’ and ‘B’ are actually the same person. One of the observations needs to be dropped because they are not unique. How would you use SAS to address this problem?

One of the challenges I face includes the order of the set. For example, even though id ‘A’ and id ‘B’ are in the same row of data, they are clearly arranged differently in data ‘common’ and data ‘address’. In addition, id ‘C’ and id ‘D’ (which actually represents the same person) have an additional person (id ‘E’) at the address ‘Fake 2’.

Any help would be greatly appreciated!

I'm using SAS 9.4 enterprise guide.

PGStats · Posted 09-01-2017 11:15 PM

Is this enough for your purpose?

data c;
infile datalines missover;
input Obs	(id1	id2	id3) (:$1.);
datalines;
1	A	B	 
2	C	D	 
3	E	F	G
;

data a;
infile datalines missover;
input Address &$ (id1	id2	id3) (:$1.);
datalines;
Fake 1	B	A	 
Fake 2	C	D	E
Fake 3	Y	K	Z
;

data samec;
set c;
array id id:;
length same $8;
do i = 1 to dim(id) - 1;
    do j = i + 1 to dim(id);
        fromId = id{i};
        toId = id{j}; 
        if cmiss(fromId, toId) = 0 then do;
            call sortc(fromId, toId);
            same = catx("-", fromId, toId);
            output;
            end;
        end;
    end;
keep obs same;
run;

data samea;
set a;
array id id:;
length same $8;
do i = 1 to dim(id) - 1;
    do j = i + 1 to dim(id);
        fromId = id{i};
        toId = id{j}; 
        if cmiss(fromId, toId) = 0 then do;
            call sortc(fromId, toId);
            same = catx("-", fromId, toId);
            output;
            end;
        end;
    end;
keep Address same;
run;

proc sql;
create table same as
select same from samec
intersect
select same from samea;
quit;

PG

View solution in original post

PGStats · Posted 09-01-2017 11:15 PM

Is this enough for your purpose?

data c;
infile datalines missover;
input Obs	(id1	id2	id3) (:$1.);
datalines;
1	A	B	 
2	C	D	 
3	E	F	G
;

data a;
infile datalines missover;
input Address &$ (id1	id2	id3) (:$1.);
datalines;
Fake 1	B	A	 
Fake 2	C	D	E
Fake 3	Y	K	Z
;

data samec;
set c;
array id id:;
length same $8;
do i = 1 to dim(id) - 1;
    do j = i + 1 to dim(id);
        fromId = id{i};
        toId = id{j}; 
        if cmiss(fromId, toId) = 0 then do;
            call sortc(fromId, toId);
            same = catx("-", fromId, toId);
            output;
            end;
        end;
    end;
keep obs same;
run;

data samea;
set a;
array id id:;
length same $8;
do i = 1 to dim(id) - 1;
    do j = i + 1 to dim(id);
        fromId = id{i};
        toId = id{j}; 
        if cmiss(fromId, toId) = 0 then do;
            call sortc(fromId, toId);
            same = catx("-", fromId, toId);
            output;
            end;
        end;
    end;
keep Address same;
run;

proc sql;
create table same as
select same from samec
intersect
select same from samea;
quit;

PG

haydn89 · Posted 09-02-2017 08:01 PM

Thanks PGStats!

What a great little program. I have never used the ‘cmiss’ function before or the ‘sortc’ call routine. I also love the loop to find all unique pairs of identifiers. I certainly got more than I bargained for!

Can you help me extend this a little more?

The proc sql creates the table ‘same’ which looks like this:

data ‘same’

Obs	same
1	a-b
2	c-d

imagine instead that the output looked like this

data ‘same_more_likely’

obs	same
1	a-b
2	a-h
3	c-d
4	b-h

In this scenario, a=b=h. There are situations where there are more than two ids related to the same person, so this output is more likely. I want to manipulate the output to look like this:

data ‘new_same’

id	group
a	1
b	1
h	1
c	2
d	2

Any thoughts?

PGStats · Posted 09-03-2017 01:06 AM

Yes, my thoughts involve the SAS/OR network solver. A group forms a network where every id is connected (same) with all other ids in the group. That's the definition of a clique in network theory. Finding all cliques in a network is intrinsically difficult, but SAS implemented an algorithm to do it. If you have SAS/OR, you can do:

data c;
infile datalines missover;
input Obs	(id1	id2	id3) (:$1.);
datalines;
1	A	B	H
2	C	D	 
3	E	F	G
4   B   H
;

data a;
infile datalines missover;
input Address &$ (id1	id2	id3) (:$1.);
datalines;
Fake 1	B	A	H 
Fake 2	C	D	E
Fake 3	Y	K	Z
;

data samec;
set c;
array id id:;
length same $8;
do i = 1 to dim(id) - 1;
    do j = i + 1 to dim(id);
        fromId = id{i};
        toId = id{j}; 
        if cmiss(fromId, toId) = 0 then do;
            call sortc(fromId, toId);
            same = catx("-", fromId, toId);
            output;
            end;
        end;
    end;
keep obs same fromId toId;
run;

data samea;
set a;
array id id:;
length same $8;
do i = 1 to dim(id) - 1;
    do j = i + 1 to dim(id);
        fromId = id{i};
        toId = id{j}; 
        if cmiss(fromId, toId) = 0 then do;
            call sortc(fromId, toId);
            same = catx("-", fromId, toId);
            output;
            end;
        end;
    end;
keep Address same fromId toId;
run;

proc sql;
create table same as
select fromId, toId from samec
intersect
select fromId, toId from samea;
select * from same;
quit;

proc optnet data_links=same;
data_links_var from=fromId to=toId;
clique out=groups(rename=(clique=group node=id));
run;

PG

PGStats · Posted 09-03-2017 04:24 PM

A simpler solution is almost obvious once you flip your data structure inside-out:

data c;
infile datalines missover;
input Obs	(id1	id2	id3) (:$1.);
datalines;
1	A	B	H
2	C	D	 
3	E	F	G
;

data a;
infile datalines missover;
input Address &$ (id1	id2	id3) (:$1.);
datalines;
Fake 1	B	A	H 
Fake 2	C	D	E
Fake 3	Y	K	Z
;

data revc;
set c;
array a id:;
do i = 1 to dim(a);
    if not missing(a{i}) then do;
        id = a{i};
        output;
        end;
    end;
keep id obs;
run;

data reva;
set a;
array a id:;
do i = 1 to dim(a);
    if not missing(a{i}) then do;
        id = a{i};
        output;
        end;
    end;
keep id Address;
run;

proc sql;
create table want as
select
    revc.id,
    catx("@", obs, Address) as group
from 
    revc inner join reva on revc.id=reva.id
group by group
having count(distinct id) > 1
order by group, id;
select * from want;
quit;

PG

haydn89 · Posted 09-04-2017 02:27 AM

Hi PGStats,

Thanks for the suggestions!

Unfortunately, my workplace does not have SAS/OR. In addition, I can’t follow through with your second ‘simpler’ solution because the data isn’t quite as neatly setup as your example data set ‘c’. In your data set ‘c’, observation 1 looks like this:

data ‘pgstats_c’

obs	id1	id2	id3
1	A	B	H

Even though it is true that A=B=H, the data actually looks like this:

data ‘actual_c’

obs	id1	id2	id3
1	A	B
2	B	H

In this context, the ‘simpler’ solution does not work, because ‘H’ is not identified as being part of the same group as ‘A’ and ‘B’.

I noticed that you published a macro back in 2015:

https://communities.sas.com/t5/General-SAS-Programming/Finding-all-connected-components-in-a-graph/m...

This works! Although I confess that I am using it without fully understanding it.

Could you prescribe an introductory text on ‘network theory’?

Thanks for all your help!

PGStats · Posted 09-04-2017 12:26 PM

actual_c doesn't say that A=H.Given your initial description of your data, A=B=H would require

obs	id1	id2	id3
1	A	B
1	B	H
1	A	H

PG

haydn89 · Posted 09-04-2017 06:55 PM

Hi PGStats,

You've identified a limitation of my motivating example, rather than a mistake about what the data actually represents. I apologies if this has caused confusion.

In the motivating example, I said that each observation represented ids that shared a common variable, such as income. If this was strictly true, then you are right, actual_c doesn’t say that A=H.

If it helps, I’ll be more abstract. The actual_data looks like this:

data 'actual_data'

obs	id1	id2
1	A	B
2	B	H

The interpretation of this data is that A=B and B=H. By the properties of transitivity, A=H. The final output I want is:

data 'final_output'

id	clust
A	1
B	1
H	1

Thanks for all your help!

Classroom Training Available!