compare two datasets unique values

Q1983 · Posted 12-11-2018 03:31 PM

data a;

input my1:$2. my2:$2. clm drg tin;

datalines;

aa bb 1 2 3

bb cc 2 3 4

dd ee 3 4 5

ff gg 4 5 6

hh ii 6 7 8

;

run;

data b;

input cust1:$2. cust2:$2. clm drg tin;

datalines;

aa bb 1 2 3

bb cc 2 3 4

zz xx 3 4 5

yy ww 4 5 6

hh ii 2 7 8

;

run;

I am attempting to create a dataset where the contents of both tables DO NOT EQUAL

So in this case I want the first two rows in both are the same so i want to eliminate them. I then want to create a dataset of the unique values

Can this be done in a proc sql or datastep

Reeza · Posted 12-11-2018 03:38 PM

Can you please show what you'd like as an output? Is it ok to rename the first data set. Is this a one time thing, or something that needs to scale to other data sets? Or will be run multiple times?

Astounding · Posted 12-11-2018 03:45 PM

It's not clear why the variables have different names in the two data sets, but let's take that as a given. Then:

proc sort data=a;

by my1 my2 clm drg tin;

run;

proc sort data=b;

by cust1 cust2 clm drg tin;

run;

data unique;

set a (in=ina rename=(my1=cust1 my2=cust2))

b;

by cust1 cust2 clm drg tin;

if first.tin and last.tin;

if in1 then source='Dataset A';

else source3='Dataset B';

run;

SuryaKiran · Posted 12-11-2018 03:49 PM

Hi,

proc sql;
select * from 
(select * from a
	except 
 select * from b)
union 
select * from 
(select * from b
	except 
 select * from a)
;
quit;

Also,try PROC COMPARE which gives more information that might help you.

proc compare data=a compare=b;
run;

Thanks,
Suryakiran

novinosrin · Posted 12-11-2018 03:51 PM


data a;

input my1:$2. my2:$2. clm drg tin;

datalines;
aa bb 1 2 3
bb cc 2 3 4
dd ee 3 4 5
ff gg 4 5 6
hh ii 6 7 8
;

run;

data b;

input cust1:$2. cust2:$2. clm drg tin;

datalines;
aa bb 1 2 3
bb cc 2 3 4
zz xx 3 4 5
yy ww 4 5 6
hh ii 2 7 8
;

proc sql;
create table want as
(select * from a
except all
select * from b)
union all
(select * from b
except all
select * from a);
quit;

novinosrin · Posted 12-11-2018 03:52 PM

Oh well @Q1983 I failed to notice the similar approach by @SuryaKiran that was posted earlier. Ignore mine! Thanks

SuryaKiran · Posted 12-11-2018 04:25 PM

@novinosrin Although it's almost similar you have covered one extra point by mentioning ALL, which will also include duplicate records if their are any.

Thanks,
Suryakiran

novinosrin · Posted 12-11-2018 04:29 PM

Yes thank you, but you did post it and I missed to see it which I generally don't. Anyways, indeed probably the best approach by any means right from convenience of coding, to maintenance and change requests by anybody regardless of skills level in my opinion. Well done! & cheers!

novinosrin · Posted 12-11-2018 04:20 PM

data a;

input my1:$2. my2:$2. clm drg tin;

datalines;
aa bb 1 2 3
bb cc 2 3 4
dd ee 3 4 5
ff gg 4 5 6
hh ii 6 7 8
;

run;

data b;

input cust1:$2. cust2:$2. clm drg tin;

datalines;
aa bb 1 2 3
bb cc 2 3 4
zz xx 3 4 5
yy ww 4 5 6
hh ii 2 7 8
;

data _null_;
if _n_=1 then do;
if 0 then set a;
   dcl hash H (dataset:'a',multidata:'y',ordered:'y') ;
   h.definekey  ('my1','my2','clm','drg','tin') ;
   h.definedone () ;
end;
set b end=l;
rc= h.check(key:cust1,key:cust2,key:clm,key:drg,key:tin) ;
if rc=0 then rc1=h.remove(key:cust1,key:cust2,key:clm,key:drg,key:tin);
else h.add(key:cust1,key:cust2,key:clm,key:drg,key:tin,data:cust1,data:cust2,data:clm,data:drg,data:tin);
if l then h.output(dataset:'want');
run;

mkeintz · Posted 12-11-2018 04:28 PM

Here's a single data step solution that doesn't require sorting.

data a;
  input my1:$2. my2:$2. clm drg tin;
datalines;
aa bb 1 2 3
bb cc 2 3 4
dd ee 3 4 5
ff gg 4 5 6
hh ii 6 7 8
run;

data b;
  input cust1:$2. cust2:$2. clm drg tin;
datalines;
aa bb 1 2 3
bb cc 2 3 4
zz xx 3 4 5
yy ww 4 5 6
hh ii 2 7 8
run;

data want1 want2;
  set a b;
  if _n_=1 then do;
    declare hash h (dataset:'a (obs=0))';
      h.definekey(all:'Y');
      h.definedata(all:'Y');
      h.definedone();
  end;
  if h.check^=0 then do;
    output want1;
    h.add();
  end;
  else output want2;
run;

The strategy here is

read a record

if this is first time for these values (i.e. not in the hash object) then output want1 and put it in the hash object for later checking

but if it's already in the hash then output want2.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

novinosrin · Posted 12-11-2018 05:08 PM

data a;
input my1:$2. my2:$2. clm drg tin;
datalines;
aa bb 1 2 3
bb cc 2 3 4
dd ee 3 4 5
ff gg 4 5 6
hh ii 6 7 8
;

run;

data b;
input cust1:$2. cust2:$2. clm drg tin;
datalines;
aa bb 1 2 3
bb cc 2 3 4
zz xx 3 4 5
yy ww 4 5 6
hh ii 2 7 8
;

proc sql;
create table want(drop=t) as
select *
from (select *, 1 as t from a
union all
select *, 2 as t from b)
group by my1,my2,clm,drg,tin
having count(distinct t)=1;
quit;

compare two datasets unique values

Re: compare two datasets unique values

Re: compare two datasets unique values

Re: compare two datasets unique values

Re: compare two datasets unique values

Re: compare two datasets unique values

Re: compare two datasets unique values

Re: compare two datasets unique values

Re: compare two datasets unique values

Re: compare two datasets unique values

Re: compare two datasets unique values

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away