Desktop productivity for business analysts and programmers

How to identify unique records in two data sets

Reply
Occasional Contributor
Posts: 19

How to identify unique records in two data sets

[ Edited ]

Would like to do this in SAS EG 6.1 Query builder but if Proc SQL or BASE SAS makes more sense, I'll go in either direction. 

I have two datasets:

 

data one;

input id country $ revenue commission;

datalines;

1 Germany 500 100

2 Germany 500 100

4 Germany 500 100

9 Germany 500 100

11 Mexico 500 100

12 Mexico 500 100

14 Mexico 500 100

17 Mexico 500 100

run;

 

data two;

input id country $ revenue commission;

datalines;

2 Germany 500 100

3 Germany 500 100

4 Germany 500 100

9 Germany 500 100

11 Mexico 500 100

13 Mexico 500 100

15 Mexico 500 100

17 Mexico 500 100

run;

 

Expected Output

1 Germany 500 100

3 Germany 500 100

12 Mexico 500 100

13 Mexico 500 100

14 Mexico 500 100

15 Mexico 500 100

 

Have tried various combinations of inner and left joins with new variables added without success, any help is much appreciated

Respected Advisor
Posts: 4,998

Re: How to identify unique reords in two data sets

Questions to get started ...

 

(1) Are you matching on ID and COUNTRY only, or are you also matching on the last two variables as well?  (What if ID and COUNTRY match, but REVENUE doesn't?  Would that be part of the output?)

 

(2) Why is 14 Mexico not part of the expected output?

Occasional Contributor
Posts: 19

Re: How to identify unique reords in two data sets

I am matching on id, for simplicity just listing two countries would prefer not to match on country as I would need to clean up spelling and abbreviations accoss the tables.

 

Although I do not need in the output I can use other variables that are unique to each table for filtering.

Division(int) for data one, and Risk_Profile $ for data two. 

 

Tried various combinations of is not missing, in a list etc. using the unique variables without success.

Occasional Contributor
Posts: 19

Re: How to identify unique reords in two data sets

sorry 14 Mexico 500 100 should be listed in expected output 

Grand Advisor
Posts: 10,239

Re: How to identify unique reords in two data sets

For clean data sorted by ID country this works pretty good:

proc compare base=one compare=two out=comp outnoequal outbase outcomp  noprint ;
id id country;
run;

But if your country variable has as much issue with correct values as you imply then how do we know not to consider these as "matches"?

 

 

11 Mexico 500 100

11 Mecixo 500 100

Occasional Contributor
Posts: 19

Re: How to identify unique reords in two data sets

[ Edited ]

You raise a good point Ballard, the data is less than perfect, what I can say for sure is the id is accurate in both data sets and will either match or not match, this exercise is ultimately trying to verify the amount of $$$ variation when there is a match and when there is a non match.  When there is a non match, identifying the unique data set of origination.

 

Respected Advisor
Posts: 4,998

Re: How to identify unique reords in two data sets

OK, here's a standard way to find mismatches.  It assumes your data sets are sorted by ID:

 

data mismatch;

merge one (in=in1) two (in=in2);

by id;

if in1=0 or in2=0;

run;

 

Note that this will find mismatches on ID.  However, it will not find disagreements (for example, same ID in both data sets, but COUNTRY is different).  That's a different issue entirely.

Grand Advisor
Posts: 9,593

Re: How to identify unique records in two data sets


data one;
input id country $ revenue commission;
datalines;
1 Germany 500 100
2 Germany 500 100
4 Germany 500 100
9 Germany 500 100
11 Mexico 500 100
12 Mexico 500 100
14 Mexico 500 100
17 Mexico 500 100
run;
 
data two;
input id country $ revenue commission;
datalines;
2 Germany 500 100
3 Germany 500 100
4 Germany 500 100
9 Germany 500 100
11 Mexico 500 100
13 Mexico 500 100
15 Mexico 500 100
17 Mexico 500 100
run;

proc sql;
(
select *
 from one
except
select *
 from two
)
 
union

(
select *
 from two
except
select *
 from one
)
;
quit;
Occasional Contributor
Posts: 19

Re: How to identify unique records in two data sets

Thinking this through in my head, would need to standardize the variables in the tables and labels for variables across the tables to do the union based on wildcard or just select and standardize the variables I need.

Grand Advisor
Posts: 9,593

Re: How to identify unique records in two data sets

You don't need standardize variable or the same variable name .

Only need the the same position and same type variable.

Ask a Question
Discussion stats
  • 9 replies
  • 141 views
  • 1 like
  • 4 in conversation