topic Re: proc compare in New SAS User

proc compare

VDD — Thu, 01 Nov 2018 16:20:27 GMT

Hello Community,

I am working with proc compare and would like to output records that are in the base dataset and not found in the compare dataset and the same if records are found in the compare dataset and not in the base dataset.

In my example have1 dataset is the master dataset never to be updated, if something in have2 is different then an issue accrued during data entry, The datasets from the proc compare output H1_H2_diff_id and H2_H1_diff_id give me what I would expect for records needing attention in have2 that differ from have1.

How can I make proc compare output records found in one dataset and not the other dataset.

data have1;
input id $ date mmddyy10.;
cards;
12 11/1/2009
1456 1/1/2008
1 12/12/2012
2158974 4/4/2018
2158984 5/4/2018
2458974 9/4/2018
2158972 4/1/2018
2587911 10/1/2018
2587915 10/8/2018
;
proc sort data=have1;
	by id;
run;

data have2;
input id $ date mmddyy10.;
cards;
1456 1/1/2008
1 12/12/2012
2158974 4/4/2018
2158984 5/4/2018
2587915 10/7/2018
2458974 9/4/2018
2158972 4/1/2018
987588121 10/31/2018
;
proc sort data=have2;
	by id;
run;
/*  With the by ID I get what I want when records have different values.
	but I would like to output records found in base that are not 
    found in compare to dataset inbase_not_compare
	and if records are in compare that are not
    in base to another dataset incompare_not_base.
	Can this be done in the proc compare process?
*/ 
proc compare base=have1 compare=have2 out=h1_h2_diff_id outall outnoequal;
by id;
run;
proc compare base=have2 compare=have1 out=h2_h1_diff_id outall outnoequal;
by id;
run;
proc compare base=have1 compare=have2 out=h1_h2_diff outall outnoequal;

run;
proc compare base=have2 compare=have1 out=h2_h1_diff outall outnoequal;

run;

Thank you for you assistance.

Re: proc compare

ballardw — Thu, 01 Nov 2018 16:32:40 GMT

While Proc Compare can be handy for some things I think you may be looking form something more like this which creates two datasets with the unique records or different values. Note that this does not work with data sets with different variables and would require explicitly sub-setting each data set to the common variables.

proc sql;
   create table in1not2 as
   select * from have1
   except
   select * from have2;
   create table in2not1 as
   select * from have2
   except
   select * from have1;
quit;

Re: proc compare

VDD — Thu, 01 Nov 2018 16:47:50 GMT

thank you @ballardw This works for my request. I see that ID 2587915 shows in both datasets because of the miss-match on date. This would give me the missing records and the miss-matched records. Seems better than proc compare in my case.

Thanks again.

Re: proc compare

novinosrin — Thu, 01 Nov 2018 16:54:37 GMT

The distinction in comparison is whether based on ID or based on the entire record.

If it's on id, subsetting using merge, join or subsquery, hash etc should be the way to go

if it's the complete record, set operators in sql is very handy like @ballardw demonstrated

Re: proc compare

VDD — Thu, 01 Nov 2018 17:01:08 GMT

thanks for all the assistance. My sample was only on ID. but the real world case is the full record compare of 881 variables so @ballardw example code works the best for my case at this time. I was trying to make the horse drink water by using proc compare in this case.

Re: proc compare

novinosrin — Thu, 01 Nov 2018 17:13:02 GMT

Nice & I am glad. To spice up a little bit for knowledge share, Hash can even mimic set operators by loading full record loading entire record i.e all columns are keys.

Ah, well and the consequences?, effect on RAM,? will it even fit in memory?

This leads us to jump into creating composite keys using partial keys with MD5 or ~~SHA256~~ algorithm to make the keys fit into memory.

The reason for mentioning and scoring off ~~SHA256~~ , is because it is several times slower than MD5 because of its very nature as explained by guru PD

B-tree index is slightly easy to implement as opposed AVL trees is what I am guessing.

The the next question is why? Why can't the implementation be perfected. Let's see what happens in next release.