BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
zdassu
Quartz | Level 8

Hi I have two tables, table A and Table B, I want to match these tables and I want to bring back everything where the data in Table B does not match table A by the name field. See sample data below.

 

TableA

name                   date1             date2

7328914          01/10/2008      31/12/2099

7338572          27/07/2015      31/12/2099

7400000          01/10/2010      31/12/2099

 

TableB

name                   date1             date2

7328914          01/10/2008      31/12/2099

7338572          27/07/2015      31/12/2099

7400000          01/10/2010      31/12/2099

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

In that case, you need to use both variables in the by statements:

(stealing the example data from @PeterClemmensen)

data table_A;
input name$ (date1 date2)(:ddmmyy10.);
format date: ddmmyy10.;
datalines;
7328914 01/10/2008 31/12/2099
7338572 27/07/2015 31/12/2099
7400000 01/10/2010 31/12/2099
7400000 03/10/2010 31/12/2099
;
run;

data table_B;
input name$ (date1 date2)(:ddmmyy10.);
format date: ddmmyy10.;
datalines;
7328914 01/10/2008 31/12/2099
7338572 27/07/2015 31/12/2099
7400000 01/10/2010 31/12/2099
7400000 02/10/2010 31/12/2099
;
run;

/* sort both datasets by the important variables */
proc sort data=table_a;
by name date1;
run;

proc sort
  data=table_b (keep=name date1)
  out=exclude
  nodupkey
;
by name date1;
run;

/* now, a simple merge by the same variables */
data want;
merge
  table_a (in=a) /* creates automatic boolean variable a that indicates an observation from table_a was read */
  exclude (in=b) /* same for the exclude table */
;
by name date1;
if a and not b; /* take only observations from table_a for which no match was found in exclude dataset */
/* this is called a "subsetting if" */
run;

Since your example data had only perfect matches, I put in some extra lines to verify the function.

 

Note that the hash solution suggested by @PeterClemmensen will also work (and probably be faster), but be limited by available memory. The sort/data step solution is limited by disk space only.

View solution in original post

12 REPLIES 12
Kurt_Bremser
Super User
proc sort data=table_a;
by name;
run;

proc sort
  data=table_b (keep=name)
  out=exclude
  nodupkey
;
by name;
run;

data want;
merge
  table_a (in=a)
  exclude (in=b)
;
by name;
if a and not b;
run;

The nodupkey is there to prevent unwanted observations if table_b has more obs for a given name than table_a.

zdassu
Quartz | Level 8

Thanks for your replay, what would I do if I want to bring everything back that does not match in table B

By Name and Date1

for example

Table A Name = Table B Name

And Table A Date1 = Table B Date1

Bring back everything that does not match

PeterClemmensen
Tourmaline | Level 20

You could do something like this

 

data A;
input name$ (date1 date2)(:ddmmyy10.);
format date: ddmmyy10.;
datalines;
7328914 01/10/2008 31/12/2099
7338572 27/07/2015 31/12/2099
7400000 01/10/2010 31/12/2099
;

data B;
input name$ (date1 date2)(:ddmmyy10.);
format date: ddmmyy10.;
datalines;
7328914 01/10/2008 31/12/2099
7338572 27/07/2015 31/12/2099
7400000 01/10/2010 31/12/2099
;

data want;
   if 0 then set B;
   if _N_=1 then do;
      declare hash h(dataset:"B", multidata:"Y");
      h.definekey("name", "date1");
      h.definedata(all:"Y");
      h.definedone();
   end; 
   set A;
   if h.check()>0 then output;
run;
novinosrin
Tourmaline | Level 20

HI  @zdassu Just a nit and FWIW if you play with hashes

if h.check()>0 then output;

will not fetch the accurate records although will not produce any error. Assuming it's a typo, correcting to

 

if h.check()=0 then output;
if check() ne 0 then output;

 or

if h.check()=0 ;
if h.check() ne 0;

or

rc= h.check();
if rc=0;
if rc ne 0;

 

The return code based on object notation method always results in a non zero value when it fails, since it's non zero it's not ideal for inequality comparison operation unless the return code is of course reset to some value knowing that we are gonna use that for inequality comparisons. One of the tricky things is that any hash object would have a non-scalar value however return code being though being scalar in the PDV resulting from a object do notation operation is not particularly useful until some gymnastics is done. 

The scalar vs non scalar in hashes is so similar to arrays but occurs at meta-level as opposed to giving much clarity. Thanks!

Kurt_Bremser
Super User

Quote from your original post:

"where the data in Table B does not match table A by the name field"

(emphasis by me)

My code does exactly that.

Please make up your mind what you really want.

zdassu
Quartz | Level 8

My apologies, when I checked the tables they have millions of lines and I noticed that sometimes the name is the same but date1 is different, therefore I would class this as a mismatch (date 2 is not an issue this does not matter), your code seems a little complicated when trying to do it this way, I don't really understand it

Kurt_Bremser
Super User

In that case, you need to use both variables in the by statements:

(stealing the example data from @PeterClemmensen)

data table_A;
input name$ (date1 date2)(:ddmmyy10.);
format date: ddmmyy10.;
datalines;
7328914 01/10/2008 31/12/2099
7338572 27/07/2015 31/12/2099
7400000 01/10/2010 31/12/2099
7400000 03/10/2010 31/12/2099
;
run;

data table_B;
input name$ (date1 date2)(:ddmmyy10.);
format date: ddmmyy10.;
datalines;
7328914 01/10/2008 31/12/2099
7338572 27/07/2015 31/12/2099
7400000 01/10/2010 31/12/2099
7400000 02/10/2010 31/12/2099
;
run;

/* sort both datasets by the important variables */
proc sort data=table_a;
by name date1;
run;

proc sort
  data=table_b (keep=name date1)
  out=exclude
  nodupkey
;
by name date1;
run;

/* now, a simple merge by the same variables */
data want;
merge
  table_a (in=a) /* creates automatic boolean variable a that indicates an observation from table_a was read */
  exclude (in=b) /* same for the exclude table */
;
by name date1;
if a and not b; /* take only observations from table_a for which no match was found in exclude dataset */
/* this is called a "subsetting if" */
run;

Since your example data had only perfect matches, I put in some extra lines to verify the function.

 

Note that the hash solution suggested by @PeterClemmensen will also work (and probably be faster), but be limited by available memory. The sort/data step solution is limited by disk space only.

zdassu
Quartz | Level 8

Thanks for your help much appreciated

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Maybe:

proc sql;
  create table want as 
  select * from tablea except select * from tableb
  union all
  select * from tableb except select * from tablea;
quit;
Astounding
PROC Star

Here's a simple approach, assuming both of your data sets are already sorted by NAME:

 

data want;
merge a  b (rename=(date1=date1b date2=date2b));
by name;
if (date1 ne date1b) or (date2 ne date2b);
run;

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 12 replies
  • 2301 views
  • 1 like
  • 6 in conversation