I have two datasets, A and B. A has 15353 cases and B has 15234 cases. I used the following code to merge these two datasets
proc sort data=A; by ID form testdate ; run;
proc sort data=B; by ID form testdate; run;
data ck;
merge A(in=in1) B(in=in2);
by ID form testdate;
if in1=1;
run;
In dataset A, I created a variable inA, assigned value "Y" to indicate all cases in dataset A. In dataset B, I created a variable inB, assigned value "Y" to indicate all cases in dataset B. Theoretically A is supposed to have all cases of B, so in the merged dataset ck, the variable inB is either "Y" or empty. Then I want to get cases that in A but not in B. I used the following code
data ck1; set ck;
if inB="";
run;
It turned out ck1 has 137 cases. But the case difference between A and B should be 15353-15234=119. where I was wrong? please help.
thanks!
Thank you! You are right.
In data A, for the same ID, that person took the test on different dates with different form, so the combination of ID-form-testdate are unique. This is the same situation in data B, i.e, combination of ID-form-testdate is unique.
So what's the best way that I can look into detail on what I got (difference=137 versus "nobs(A) minus nobs(B)" (=119)?
@superbug wrote:
Theoretically A is supposed to have all cases of B
Invalid assumption here. Clearly not correct, as your results show.
dataset A is the final official data, which contains all cases. Dataset B contains less cases, I want to find out cases that in A but not in B. Any suggestion of how should I get what I wanted? thanks!
Also do this for a start:
proc sort
data=A
out=_A
nodupkey
;
by ID form testdate;
run;
proc sort
data=B
out=_b
nodupkey
;
by ID form testdate;
run;
and inspect the log to see if duplicate values were deleted.
thank!
I used the code you suggested, in the log, it says "0 observations with duplicate key values were deleted",
that conforms to my understanding of the data. In dataset A, for each case, if by ID form testdate, it should not have any duplicate.
@superbug wrote:
thank!
I used the code you suggested, in the log, it says "0 observations with duplicate key values were deleted",
that conforms to my understanding of the data. In dataset A, for each case, if by ID form testdate, it should not have any duplicate.
You have not ruled out the other problem, that A does not contain all IDs in B.
Then you have ID-form-testdate combinations in A that are not present in B. And possibly the other way round.
Maxim 3: Know Your Data
Hi @superbug,
Results like this are typical for a situation involving duplicate keys (i.e., combinations of BY variables). For example, suppose that B contains more than one observation with the same ID-form-testdate combination, but A contains this (and any other) combination only once. Then the difference "nobs(A) minus nobs(B)" (=119) will underestimate the number of combinations in A which do not occur in B.
Of course, if the assumption "B is subset of A" (in terms of BY variable combinations) turned out to be wrong, the result could occur even with unique keys on both sides.
Thank you! You are right.
In data A, for the same ID, that person took the test on different dates with different form, so the combination of ID-form-testdate are unique. This is the same situation in data B, i.e, combination of ID-form-testdate is unique.
So what's the best way that I can look into detail on what I got (difference=137 versus "nobs(A) minus nobs(B)" (=119)?
If the keys are unique in both A and B, then there must be key values in B which do not occur in A:
data myst;
merge a(in=a) b;
by ID form testdate;
if ~a;
run;
use the code you provided, I got 18 in B but not in A.
so 119+18=137, which matches what I got.
Very much appreciate your help!
Thank you SO SO MUCH!
Please choose the most helpful answer as the solution. Not your reply.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.