Re: How to identify unique records in two data sets

RickyS · Posted 05-18-2017 03:24 PM

Would like to do this in SAS EG 6.1 Query builder but if Proc SQL or BASE SAS makes more sense, I'll go in either direction.

I have two datasets:

data one;

input id country $ revenue commission;

datalines;

1 Germany 500 100

2 Germany 500 100

4 Germany 500 100

9 Germany 500 100

11 Mexico 500 100

12 Mexico 500 100

14 Mexico 500 100

17 Mexico 500 100

run;

data two;

input id country $ revenue commission;

datalines;

2 Germany 500 100

3 Germany 500 100

4 Germany 500 100

9 Germany 500 100

11 Mexico 500 100

13 Mexico 500 100

15 Mexico 500 100

17 Mexico 500 100

run;

Expected Output

1 Germany 500 100

3 Germany 500 100

12 Mexico 500 100

13 Mexico 500 100

14 Mexico 500 100

15 Mexico 500 100

Have tried various combinations of inner and left joins with new variables added without success, any help is much appreciated

Astounding · Posted 05-18-2017 03:39 PM

Questions to get started ...

(1) Are you matching on ID and COUNTRY only, or are you also matching on the last two variables as well? (What if ID and COUNTRY match, but REVENUE doesn't? Would that be part of the output?)

(2) Why is 14 Mexico not part of the expected output?

RickyS · Posted 05-18-2017 03:46 PM

I am matching on id, for simplicity just listing two countries would prefer not to match on country as I would need to clean up spelling and abbreviations accoss the tables.

Although I do not need in the output I can use other variables that are unique to each table for filtering.

Division(int) for data one, and Risk_Profile $ for data two.

Tried various combinations of is not missing, in a list etc. using the unique variables without success.

RickyS · Posted 05-18-2017 03:46 PM

sorry 14 Mexico 500 100 should be listed in expected output

ballardw · Posted 05-18-2017 07:17 PM

For clean data sorted by ID country this works pretty good:

proc compare base=one compare=two out=comp outnoequal outbase outcomp  noprint ;
id id country;
run;

But if your country variable has as much issue with correct values as you imply then how do we know not to consider these as "matches"?

11 Mexico 500 100

11 Mecixo 500 100

RickyS · Posted 05-19-2017 09:55 AM

You raise a good point Ballard, the data is less than perfect, what I can say for sure is the id is accurate in both data sets and will either match or not match, this exercise is ultimately trying to verify the amount of $$$ variation when there is a match and when there is a non match. When there is a non match, identifying the unique data set of origination.

Astounding · Posted 05-18-2017 03:50 PM

OK, here's a standard way to find mismatches. It assumes your data sets are sorted by ID:

data mismatch;

merge one (in=in1) two (in=in2);

by id;

if in1=0 or in2=0;

run;

Note that this will find mismatches on ID. However, it will not find disagreements (for example, same ID in both data sets, but COUNTRY is different). That's a different issue entirely.

Abbie · Posted 02-08-2019 10:31 AM

This worked like a charm for my problem. Thank you.

Ksharp · Posted 05-19-2017 09:59 AM


data one;
input id country $ revenue commission;
datalines;
1 Germany 500 100
2 Germany 500 100
4 Germany 500 100
9 Germany 500 100
11 Mexico 500 100
12 Mexico 500 100
14 Mexico 500 100
17 Mexico 500 100
run;
 
data two;
input id country $ revenue commission;
datalines;
2 Germany 500 100
3 Germany 500 100
4 Germany 500 100
9 Germany 500 100
11 Mexico 500 100
13 Mexico 500 100
15 Mexico 500 100
17 Mexico 500 100
run;

proc sql;
(
select *
 from one
except
select *
 from two
)
 
union

(
select *
 from two
except
select *
 from one
)
;
quit;

RickyS · Posted 05-19-2017 10:30 AM

Thinking this through in my head, would need to standardize the variables in the tables and labels for variables across the tables to do the union based on wildcard or just select and standardize the variables I need.

Ksharp · Posted 05-19-2017 10:35 AM

You don't need standardize variable or the same variable name .

Only need the the same position and same type variable.

Patrick · Posted 02-09-2019 09:06 PM

@RickyS

If you use UNION CORR then columns in your source tables don't need to be in the same order. All you need is to ensure that same named variable in different source tables are of the same data type (Char or Num).

You can also use a SELECT *. The result will then contain all the variables which exist in all source tables.

And another nice feature of a SAS SQL UNION:

If lengths for character variables in source tables differ then the resulting column will have the max length from all source tables (=no truncation or warning/error messages).

How to identify unique records in two data sets

Re: How to identify unique reords in two data sets

Re: How to identify unique reords in two data sets

Re: How to identify unique reords in two data sets

Re: How to identify unique reords in two data sets

Re: How to identify unique reords in two data sets

Re: How to identify unique reords in two data sets

Re: How to identify unique reords in two data sets