BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
superbug
Quartz | Level 8

I have two datasets, A and B. A has 15353 cases and B has 15234 cases. I used the following code to merge these two datasets

proc sort data=A; by ID form testdate ; run;
proc sort data=B; by ID form testdate; run;

 

data ck;
merge A(in=in1) B(in=in2);
by  ID form testdate;
if in1=1;
run;

 

In  dataset A, I created a variable inA, assigned value "Y" to indicate all cases in dataset A.  In dataset B, I created a variable inB, assigned  value "Y" to indicate all cases in dataset B. Theoretically A is supposed to have all cases of B, so in the merged dataset ck, the variable inB is either "Y" or empty. Then I want to get cases that in A but not in B. I used the following code

 

data ck1; set ck;
if inB="";
run;

 

It turned out ck1 has 137 cases. But the case difference between A and B should be 15353-15234=119. where I was wrong? please help.

thanks!

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
superbug
Quartz | Level 8

@FreelanceReinh 

Thank you! You are right.

In data A, for the same ID, that person took the test on different dates with different form, so the combination of ID-form-testdate are unique. This is the same situation in data B, i.e, combination of ID-form-testdate is unique. 

So what's the best way that I can look into detail on what I got (difference=137 versus "nobs(A) minus nobs(B)" (=119)?

 

View solution in original post

11 REPLIES 11
PaigeMiller
Diamond | Level 26

@superbug wrote:

Theoretically A is supposed to have all cases of B


Invalid assumption here. Clearly not correct, as your results show.

--
Paige Miller
superbug
Quartz | Level 8

@PaigeMiller 

dataset A is the final official data, which contains all cases. Dataset B contains less cases, I want to find out cases that in A but not in B. Any suggestion of how should I get what I wanted? thanks!

Kurt_Bremser
Super User

Also do this for a start:

proc sort
  data=A
  out=_A
  nodupkey
;
by ID form testdate;
run;

proc sort
  data=B
  out=_b
  nodupkey
;
by ID form testdate;
run;

and inspect the log to see if duplicate values were deleted.

superbug
Quartz | Level 8

@Kurt_Bremser 

thank!

I used the code you suggested, in the log, it says "0 observations with duplicate key values were deleted",

that conforms to my understanding of the data. In dataset A, for each case, if by ID form testdate, it should not have any duplicate.

PaigeMiller
Diamond | Level 26

@superbug wrote:

@Kurt_Bremser 

thank!

I used the code you suggested, in the log, it says "0 observations with duplicate key values were deleted",

that conforms to my understanding of the data. In dataset A, for each case, if by ID form testdate, it should not have any duplicate.


You have not ruled out the other problem, that A does not contain all IDs in B.

--
Paige Miller
Kurt_Bremser
Super User

Then you have ID-form-testdate combinations in A that are not present in B. And possibly the other way round.

 

Maxim 3: Know Your Data

FreelanceReinh
Jade | Level 19

Hi @superbug,

 

Results like this are typical for a situation involving duplicate keys (i.e., combinations of BY variables). For example, suppose that B contains more than one observation with the same ID-form-testdate combination, but A contains this (and any other) combination only once. Then the difference "nobs(A) minus nobs(B)" (=119) will underestimate the number of combinations in A which do not occur in B.

 

Of course, if the assumption "B is subset of A" (in terms of BY variable combinations) turned out to be wrong, the result could occur even with unique keys on both sides.

superbug
Quartz | Level 8

@FreelanceReinh 

Thank you! You are right.

In data A, for the same ID, that person took the test on different dates with different form, so the combination of ID-form-testdate are unique. This is the same situation in data B, i.e, combination of ID-form-testdate is unique. 

So what's the best way that I can look into detail on what I got (difference=137 versus "nobs(A) minus nobs(B)" (=119)?

 

FreelanceReinh
Jade | Level 19

If the keys are unique in both A and B, then there must be key values in B which do not occur in A:

data myst;
merge a(in=a) b;
by ID form testdate;
if ~a;
run;
superbug
Quartz | Level 8

@FreelanceReinh 

use the code you provided, I got 18 in B but not in A.

so 119+18=137, which matches what I got.

Very much appreciate your help!

Thank you SO SO MUCH!

ChrisNZ
Tourmaline | Level 20

Please choose the most helpful answer as the solution. Not your reply.

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 11 replies
  • 2007 views
  • 0 likes
  • 5 in conversation