- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi I have two tables, table A and Table B, I want to match these tables and I want to bring back everything where the data in Table B does not match table A by the name field. See sample data below.
TableA
name date1 date2
7328914 01/10/2008 31/12/2099
7338572 27/07/2015 31/12/2099
7400000 01/10/2010 31/12/2099
TableB
name date1 date2
7328914 01/10/2008 31/12/2099
7338572 27/07/2015 31/12/2099
7400000 01/10/2010 31/12/2099
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
In that case, you need to use both variables in the by statements:
(stealing the example data from @PeterClemmensen)
data table_A;
input name$ (date1 date2)(:ddmmyy10.);
format date: ddmmyy10.;
datalines;
7328914 01/10/2008 31/12/2099
7338572 27/07/2015 31/12/2099
7400000 01/10/2010 31/12/2099
7400000 03/10/2010 31/12/2099
;
run;
data table_B;
input name$ (date1 date2)(:ddmmyy10.);
format date: ddmmyy10.;
datalines;
7328914 01/10/2008 31/12/2099
7338572 27/07/2015 31/12/2099
7400000 01/10/2010 31/12/2099
7400000 02/10/2010 31/12/2099
;
run;
/* sort both datasets by the important variables */
proc sort data=table_a;
by name date1;
run;
proc sort
data=table_b (keep=name date1)
out=exclude
nodupkey
;
by name date1;
run;
/* now, a simple merge by the same variables */
data want;
merge
table_a (in=a) /* creates automatic boolean variable a that indicates an observation from table_a was read */
exclude (in=b) /* same for the exclude table */
;
by name date1;
if a and not b; /* take only observations from table_a for which no match was found in exclude dataset */
/* this is called a "subsetting if" */
run;
Since your example data had only perfect matches, I put in some extra lines to verify the function.
Note that the hash solution suggested by @PeterClemmensen will also work (and probably be faster), but be limited by available memory. The sort/data step solution is limited by disk space only.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
proc sort data=table_a;
by name;
run;
proc sort
data=table_b (keep=name)
out=exclude
nodupkey
;
by name;
run;
data want;
merge
table_a (in=a)
exclude (in=b)
;
by name;
if a and not b;
run;
The nodupkey is there to prevent unwanted observations if table_b has more obs for a given name than table_a.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your replay, what would I do if I want to bring everything back that does not match in table B
By Name and Date1
for example
Table A Name = Table B Name
And Table A Date1 = Table B Date1
Bring back everything that does not match
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You could do something like this
data A;
input name$ (date1 date2)(:ddmmyy10.);
format date: ddmmyy10.;
datalines;
7328914 01/10/2008 31/12/2099
7338572 27/07/2015 31/12/2099
7400000 01/10/2010 31/12/2099
;
data B;
input name$ (date1 date2)(:ddmmyy10.);
format date: ddmmyy10.;
datalines;
7328914 01/10/2008 31/12/2099
7338572 27/07/2015 31/12/2099
7400000 01/10/2010 31/12/2099
;
data want;
if 0 then set B;
if _N_=1 then do;
declare hash h(dataset:"B", multidata:"Y");
h.definekey("name", "date1");
h.definedata(all:"Y");
h.definedone();
end;
set A;
if h.check()>0 then output;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
HI @zdassu Just a nit and FWIW if you play with hashes
if h.check()>0 then output;
will not fetch the accurate records although will not produce any error. Assuming it's a typo, correcting to
if h.check()=0 then output;if check() ne 0 then output;
or
if h.check()=0 ;if h.check() ne 0;
or
rc= h.check();if rc=0;if rc ne 0;
The return code based on object notation method always results in a non zero value when it fails, since it's non zero it's not ideal for inequality comparison operation unless the return code is of course reset to some value knowing that we are gonna use that for inequality comparisons. One of the tricky things is that any hash object would have a non-scalar value however return code being though being scalar in the PDV resulting from a object do notation operation is not particularly useful until some gymnastics is done.
The scalar vs non scalar in hashes is so similar to arrays but occurs at meta-level as opposed to giving much clarity. Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Quote from your original post:
"where the data in Table B does not match table A by the name field"
(emphasis by me)
My code does exactly that.
Please make up your mind what you really want.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
My apologies, when I checked the tables they have millions of lines and I noticed that sometimes the name is the same but date1 is different, therefore I would class this as a mismatch (date 2 is not an issue this does not matter), your code seems a little complicated when trying to do it this way, I don't really understand it
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
In that case, you need to use both variables in the by statements:
(stealing the example data from @PeterClemmensen)
data table_A;
input name$ (date1 date2)(:ddmmyy10.);
format date: ddmmyy10.;
datalines;
7328914 01/10/2008 31/12/2099
7338572 27/07/2015 31/12/2099
7400000 01/10/2010 31/12/2099
7400000 03/10/2010 31/12/2099
;
run;
data table_B;
input name$ (date1 date2)(:ddmmyy10.);
format date: ddmmyy10.;
datalines;
7328914 01/10/2008 31/12/2099
7338572 27/07/2015 31/12/2099
7400000 01/10/2010 31/12/2099
7400000 02/10/2010 31/12/2099
;
run;
/* sort both datasets by the important variables */
proc sort data=table_a;
by name date1;
run;
proc sort
data=table_b (keep=name date1)
out=exclude
nodupkey
;
by name date1;
run;
/* now, a simple merge by the same variables */
data want;
merge
table_a (in=a) /* creates automatic boolean variable a that indicates an observation from table_a was read */
exclude (in=b) /* same for the exclude table */
;
by name date1;
if a and not b; /* take only observations from table_a for which no match was found in exclude dataset */
/* this is called a "subsetting if" */
run;
Since your example data had only perfect matches, I put in some extra lines to verify the function.
Note that the hash solution suggested by @PeterClemmensen will also work (and probably be faster), but be limited by available memory. The sort/data step solution is limited by disk space only.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your help much appreciated
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Maybe:
proc sql; create table want as select * from tablea except select * from tableb union all select * from tableb except select * from tablea; quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here's a simple approach, assuming both of your data sets are already sorted by NAME:
data want;
merge a b (rename=(date1=date1b date2=date2b));
by name;
if (date1 ne date1b) or (date2 ne date2b);
run;