BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
haydn89
Obsidian | Level 7

 

I want to compare sets (that is the mathematical notion of a set and not the SAS statement) in two different data sets to determine which identifiers are not unique.

 

For example, consider the two data sets below: data ‘common’ and data ‘address’. Data common groups identifiers that have some arbitrary variables in common (for example, they might have exactly the same income with a very high level of precision). Data ‘common’ looks like the table below, and you would interpret that id ‘A’ and id ‘B’ share the same variable (e.g. income) in common.

 

Data ‘common’

Obs

id1

id2

id3

1

A

B

 

2

C

D

 

3

E

F

G

 

In the second data set (data ‘address’) I have address information. I can use this data set to determine which identifiers share the same address. For example, in the table below, id ‘B’ and id ‘A’ have the same address ‘Fake 1’.

 

Data ‘address’

Address

id1

id2

id3

Fake 1

B

A

 

Fake 2

C

D

E

Fake 3

Y

K

Z

 

In this specific example, id ‘A’ and ‘B’ are actually the same person. One of the observations needs to be dropped because they are not unique. How would you use SAS to address this problem?

 

One of the challenges I face includes the order of the set. For example, even though id ‘A’ and id ‘B’ are in the same row of data, they are clearly arranged differently in data ‘common’ and data ‘address’. In addition, id ‘C’ and id ‘D’ (which actually represents the same person) have an additional person (id ‘E’) at the address ‘Fake 2’.

 

Any help would be greatly appreciated!

 

I'm using SAS 9.4 enterprise guide.

 

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

Is this enough for your purpose?

 

data c;
infile datalines missover;
input Obs	(id1	id2	id3) (:$1.);
datalines;
1	A	B	 
2	C	D	 
3	E	F	G
;

data a;
infile datalines missover;
input Address &$ (id1	id2	id3) (:$1.);
datalines;
Fake 1	B	A	 
Fake 2	C	D	E
Fake 3	Y	K	Z
;

data samec;
set c;
array id id:;
length same $8;
do i = 1 to dim(id) - 1;
    do j = i + 1 to dim(id);
        fromId = id{i};
        toId = id{j}; 
        if cmiss(fromId, toId) = 0 then do;
            call sortc(fromId, toId);
            same = catx("-", fromId, toId);
            output;
            end;
        end;
    end;
keep obs same;
run;

data samea;
set a;
array id id:;
length same $8;
do i = 1 to dim(id) - 1;
    do j = i + 1 to dim(id);
        fromId = id{i};
        toId = id{j}; 
        if cmiss(fromId, toId) = 0 then do;
            call sortc(fromId, toId);
            same = catx("-", fromId, toId);
            output;
            end;
        end;
    end;
keep Address same;
run;

proc sql;
create table same as
select same from samec
intersect
select same from samea;
quit;
PG

View solution in original post

7 REPLIES 7
PGStats
Opal | Level 21

Is this enough for your purpose?

 

data c;
infile datalines missover;
input Obs	(id1	id2	id3) (:$1.);
datalines;
1	A	B	 
2	C	D	 
3	E	F	G
;

data a;
infile datalines missover;
input Address &$ (id1	id2	id3) (:$1.);
datalines;
Fake 1	B	A	 
Fake 2	C	D	E
Fake 3	Y	K	Z
;

data samec;
set c;
array id id:;
length same $8;
do i = 1 to dim(id) - 1;
    do j = i + 1 to dim(id);
        fromId = id{i};
        toId = id{j}; 
        if cmiss(fromId, toId) = 0 then do;
            call sortc(fromId, toId);
            same = catx("-", fromId, toId);
            output;
            end;
        end;
    end;
keep obs same;
run;

data samea;
set a;
array id id:;
length same $8;
do i = 1 to dim(id) - 1;
    do j = i + 1 to dim(id);
        fromId = id{i};
        toId = id{j}; 
        if cmiss(fromId, toId) = 0 then do;
            call sortc(fromId, toId);
            same = catx("-", fromId, toId);
            output;
            end;
        end;
    end;
keep Address same;
run;

proc sql;
create table same as
select same from samec
intersect
select same from samea;
quit;
PG
haydn89
Obsidian | Level 7

Thanks PGStats!

 

What a great little program. I have never used the ‘cmiss’ function before or the ‘sortc’ call routine. I also love the loop to find all unique pairs of identifiers. I certainly got more than I bargained for!

 

Can you help me extend this a little more?

 

The proc sql creates the table ‘same’ which looks like this:

 

data ‘same’

Obs

same

1

a-b

2

c-d

 

imagine instead that the output looked like this

 

data ‘same_more_likely’

obs

same

1

a-b

2

a-h

3

c-d

4

b-h

 

In this scenario, a=b=h. There are situations where there are more than two ids related to the same person, so this output is more likely. I want to manipulate the output to look like this:

 

data ‘new_same’

id

group

a

1

b

1

h

1

c

2

d

2

 

Any thoughts?

PGStats
Opal | Level 21

Yes, my thoughts involve the SAS/OR network solver. A group forms a network where every id is connected (same) with all other ids in the group. That's the definition of a clique in network theory. Finding all cliques in a network is intrinsically difficult, but SAS implemented an algorithm to do it. If you have SAS/OR, you can do:

 

data c;
infile datalines missover;
input Obs	(id1	id2	id3) (:$1.);
datalines;
1	A	B	H
2	C	D	 
3	E	F	G
4   B   H
;

data a;
infile datalines missover;
input Address &$ (id1	id2	id3) (:$1.);
datalines;
Fake 1	B	A	H 
Fake 2	C	D	E
Fake 3	Y	K	Z
;

data samec;
set c;
array id id:;
length same $8;
do i = 1 to dim(id) - 1;
    do j = i + 1 to dim(id);
        fromId = id{i};
        toId = id{j}; 
        if cmiss(fromId, toId) = 0 then do;
            call sortc(fromId, toId);
            same = catx("-", fromId, toId);
            output;
            end;
        end;
    end;
keep obs same fromId toId;
run;

data samea;
set a;
array id id:;
length same $8;
do i = 1 to dim(id) - 1;
    do j = i + 1 to dim(id);
        fromId = id{i};
        toId = id{j}; 
        if cmiss(fromId, toId) = 0 then do;
            call sortc(fromId, toId);
            same = catx("-", fromId, toId);
            output;
            end;
        end;
    end;
keep Address same fromId toId;
run;

proc sql;
create table same as
select fromId, toId from samec
intersect
select fromId, toId from samea;
select * from same;
quit;

proc optnet data_links=same;
data_links_var from=fromId to=toId;
clique out=groups(rename=(clique=group node=id));
run;
PG
PGStats
Opal | Level 21

A simpler solution is almost obvious once you flip your data structure inside-out:

 

data c;
infile datalines missover;
input Obs	(id1	id2	id3) (:$1.);
datalines;
1	A	B	H
2	C	D	 
3	E	F	G
;

data a;
infile datalines missover;
input Address &$ (id1	id2	id3) (:$1.);
datalines;
Fake 1	B	A	H 
Fake 2	C	D	E
Fake 3	Y	K	Z
;

data revc;
set c;
array a id:;
do i = 1 to dim(a);
    if not missing(a{i}) then do;
        id = a{i};
        output;
        end;
    end;
keep id obs;
run;

data reva;
set a;
array a id:;
do i = 1 to dim(a);
    if not missing(a{i}) then do;
        id = a{i};
        output;
        end;
    end;
keep id Address;
run;

proc sql;
create table want as
select
    revc.id,
    catx("@", obs, Address) as group
from 
    revc inner join reva on revc.id=reva.id
group by group
having count(distinct id) > 1
order by group, id;
select * from want;
quit;
PG
haydn89
Obsidian | Level 7

Hi PGStats,

 

Thanks for the suggestions!

 

Unfortunately, my workplace does not have SAS/OR. In addition, I can’t follow through with your second ‘simpler’ solution because the data isn’t quite as neatly setup as your example data set ‘c’. In your data set ‘c’, observation 1 looks like this:

 

data ‘pgstats_c’

obs

id1

id2

id3

1

A

B

H

 

Even though it is true that A=B=H, the data actually looks like this:

 

data ‘actual_c’

obs

id1

id2

id3

1

A

B

 

2

B

H

 

 

In this context, the ‘simpler’ solution does not work, because ‘H’ is not identified as being part of the same group as ‘A’ and ‘B’.

 

I noticed that you published a macro back in 2015:

 

https://communities.sas.com/t5/General-SAS-Programming/Finding-all-connected-components-in-a-graph/m...

 

This works! Although I confess that I am using it without fully understanding it.

 

Could you prescribe an introductory text on ‘network theory’?

 

Thanks for all your help!

PGStats
Opal | Level 21

actual_c doesn't say that A=H.Given your initial description of your data, A=B=H would require

 

obs

id1

id2

id3

1

A

B

 

1

B

H

1

A

H

PG
haydn89
Obsidian | Level 7

Hi PGStats,

 

You've identified a limitation of my motivating example, rather than a mistake about what the data actually represents. I apologies if this has caused confusion.

 

In the motivating example, I said that each observation represented ids that shared a common variable, such as income. If this was strictly true, then you are right, actual_c doesn’t say that A=H.

 

If it helps, I’ll be more abstract. The actual_data looks like this:

 

data 'actual_data'

obs

id1

id2

1

A

B

2

B

H

 

The interpretation of this data is that A=B and B=H. By the properties of transitivity, A=H. The final output I want is:

 

data 'final_output'

id

clust

A

1

B

1

H

1

 

Thanks for all your help!

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1362 views
  • 4 likes
  • 2 in conversation