topic Re: Matching in SAS Procedures

Matching

gobejo — Wed, 04 Mar 2015 14:50:15 GMT

Hi,

I have a dataset with 250 records and a dataset of 1.4 million. I created a key which is a concatenation of lastname, last4digitstaxid and zipcode. How can i find how many of the 250 records are there in 1.4 million and what they are. I have SAS EG.

Thanks,

Tom

Re: Matching

LinusH — Wed, 04 Mar 2015 14:58:51 GMT

Query Builder is the easiest way to match/join data.

Be sure that you have the corresponding variables/columns in both your tables.

Re: Matching

gobejo — Wed, 04 Mar 2015 16:20:49 GMT

I am looking to see if if the dataset of 1.4 million contains any of 250. So the out put range should be within 1-250. Now when i did an inner join, i get 498.

Re: Matching

TomKari — Wed, 04 Mar 2015 17:24:34 GMT

Not necessarily. If you have duplicate keys in your 1.4 million records, ALL of them will end up in your output table.

Re: Matching

billfish — Wed, 04 Mar 2015 18:11:10 GMT

Getting a count > 250 means there are some duplicates (as far as the key fields are concerned) in the larger dataset.

Suggested query:

proc sql;
   select count (distinct a.last_name||a.ssn4||a.zipcode) as distinct_cnts
   from   table1 a
   inner join table2 b
   on    (a.last_name=b.last_name)
     and (a.ssn4=b.ssn4)
     and (a.zipcode=b.zipcode);
quit;

Exploring using the MERGE statement.

Some sample datasets:

/*******************/
/*** dataset t_a ***/
/*******************/
data t_a(keep=j k);
do i = 1 to 40;
     j = ceil(5*ranuni(3));
     k = ceil(5*ranuni(3));
     output;
end;
run;

proc sort data=t_a nodupkey; by j k; run; /*** get unique sets of (j,k) ***/

/*******************/
/*** dataset t_a ***/
/*******************/
data t_b;
do i = 1 to 4000;
     j = ceil(50*ranuni(3));
     k = ceil(50*ranuni(3));
     output;
end;
run;

proc sort data=t_b; by j k; run;

/******************************************/
/*** calculate the number of (j,k) sets ***/
/*** (cnts) present in dataset t_b.     ***/
/******************************************/
data t_e(keep=cnts);
   merge t_a(in=a) t_b(in=b) end=n_last;
   by j k;
   if (first.j or first.k) and (a and b) then cnts+1;
   if n_last then output;
run;

In the current case, dataset t_e has 1 variable (cnts), 1 row, with cnts = 16.