Solved: question about merge

superbug · Posted 04-01-2020 08:57 AM

I have two datasets, A and B. A has 15353 cases and B has 15234 cases. I used the following code to merge these two datasets

proc sort data=A; by ID form testdate ; run;
proc sort data=B; by ID form testdate; run;

data ck;
merge A(in=in1) B(in=in2);
by ID form testdate;
if in1=1;
run;

In dataset A, I created a variable inA, assigned value "Y" to indicate all cases in dataset A. In dataset B, I created a variable inB, assigned value "Y" to indicate all cases in dataset B. Theoretically A is supposed to have all cases of B, so in the merged dataset ck, the variable inB is either "Y" or empty. Then I want to get cases that in A but not in B. I used the following code

data ck1; set ck;
if inB="";
run;

It turned out ck1 has 137 cases. But the case difference between A and B should be 15353-15234=119. where I was wrong? please help.

thanks!

superbug · Posted 04-01-2020 10:03 AM

@FreelanceReinh

Thank you! You are right.

In data A, for the same ID, that person took the test on different dates with different form, so the combination of ID-form-testdate are unique. This is the same situation in data B, i.e, combination of ID-form-testdate is unique.

So what's the best way that I can look into detail on what I got (difference=137 versus "nobs(A) minus nobs(B)" (=119)?

View solution in original post

PaigeMiller · Posted 04-01-2020 09:26 AM

@superbug wrote:

Theoretically A is supposed to have all cases of B

Invalid assumption here. Clearly not correct, as your results show.

--
Paige Miller

superbug · Posted 04-01-2020 09:33 AM

@PaigeMiller

dataset A is the final official data, which contains all cases. Dataset B contains less cases, I want to find out cases that in A but not in B. Any suggestion of how should I get what I wanted? thanks!

Kurt_Bremser · Posted 04-01-2020 09:34 AM

Also do this for a start:

proc sort
  data=A
  out=_A
  nodupkey
;
by ID form testdate;
run;

proc sort
  data=B
  out=_b
  nodupkey
;
by ID form testdate;
run;

and inspect the log to see if duplicate values were deleted.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

superbug · Posted 04-01-2020 09:45 AM

@Kurt_Bremser

thank!

I used the code you suggested, in the log, it says "0 observations with duplicate key values were deleted",

that conforms to my understanding of the data. In dataset A, for each case, if by ID form testdate, it should not have any duplicate.

PaigeMiller · Posted 04-01-2020 09:48 AM

@superbug wrote:

@Kurt_Bremser

thank!

I used the code you suggested, in the log, it says "0 observations with duplicate key values were deleted",

that conforms to my understanding of the data. In dataset A, for each case, if by ID form testdate, it should not have any duplicate.

You have not ruled out the other problem, that A does not contain all IDs in B.

--
Paige Miller

Kurt_Bremser · Posted 04-01-2020 09:57 AM

Then you have ID-form-testdate combinations in A that are not present in B. And possibly the other way round.

Maxim 3: Know Your Data

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

FreelanceReinh · Posted 04-01-2020 09:39 AM

Hi @superbug,

Results like this are typical for a situation involving duplicate keys (i.e., combinations of BY variables). For example, suppose that B contains more than one observation with the same ID-form-testdate combination, but A contains this (and any other) combination only once. Then the difference "nobs(A) minus nobs(B)" (=119) will underestimate the number of combinations in A which do not occur in B.

Of course, if the assumption "B is subset of A" (in terms of BY variable combinations) turned out to be wrong, the result could occur even with unique keys on both sides.

superbug · Posted 04-01-2020 10:03 AM

@FreelanceReinh

Thank you! You are right.

In data A, for the same ID, that person took the test on different dates with different form, so the combination of ID-form-testdate are unique. This is the same situation in data B, i.e, combination of ID-form-testdate is unique.

So what's the best way that I can look into detail on what I got (difference=137 versus "nobs(A) minus nobs(B)" (=119)?

FreelanceReinh · Posted 04-01-2020 10:09 AM

If the keys are unique in both A and B, then there must be key values in B which do not occur in A:

data myst;
merge a(in=a) b;
by ID form testdate;
if ~a;
run;

superbug · Posted 04-01-2020 10:19 AM

@FreelanceReinh

use the code you provided, I got 18 in B but not in A.

so 119+18=137, which matches what I got.

Very much appreciate your help!

Thank you SO SO MUCH!

ChrisNZ · Posted 04-01-2020 04:56 PM

Please choose the most helpful answer as the solution. Not your reply.

High-Performance SAS Coding - Third Edition

question about merge

Re: question about merge

Re: question about merge

Re: question about merge

Re: question about merge

Re: question about merge

Re: question about merge

Re: question about merge

Re: question about merge

Re: question about merge

Re: question about merge

Re: question about merge

Re: question about merge

Registration is open

SAS Training: Just a Click Away