Help using Base SAS procedures

Matching

Reply
Contributor
Posts: 28

Matching

Hi,

I have a dataset with 250 records and a dataset of 1.4 million. I created a key which is a concatenation of lastname, last4digitstaxid and zipcode. How can i find how many of the 250 records are there in 1.4 million and what they are. I have SAS EG.

Thanks,

Tom

Super User
Posts: 5,257

Re: Matching

Query Builder is the easiest way to match/join data.

Be sure that you have the corresponding variables/columns in both your tables.

Data never sleeps
Contributor
Posts: 28

Re: Matching

I am looking to see if if the dataset of 1.4 million contains any of 250. So the out put range should be within 1-250. Now when i did an inner join, i get 498.

PROC Star
Posts: 1,093

Re: Matching

Not necessarily. If you have duplicate keys in your 1.4 million records, ALL of them will end up in your output table.

Contributor
Posts: 52

Re: Matching

Getting a count > 250 means there are some duplicates (as far as the key fields are concerned) in the larger dataset.

Suggested query:

proc sql;
   select count (distinct a.last_name||a.ssn4||a.zipcode) as distinct_cnts
   from   table1 a
   inner join table2 b
   on    (a.last_name=b.last_name)
     and (a.ssn4=b.ssn4)
     and (a.zipcode=b.zipcode);
quit;


Exploring using the MERGE statement.

Some sample datasets:

/*******************/
/*** dataset t_a ***/
/*******************/
data t_a(keep=j k);
  do i = 1 to 40;
     j = ceil(5*ranuni(3));
     k = ceil(5*ranuni(3));
     output;
  end;
run;

proc sort data=t_a nodupkey; by j k; run; /*** get unique sets of (j,k) ***/


/*******************/
/*** dataset t_a ***/
/*******************/
data t_b;
  do i = 1 to 4000;
     j = ceil(50*ranuni(3));
     k = ceil(50*ranuni(3));
     output;
  end;
run;

proc sort data=t_b; by j k; run;

/******************************************/
/*** calculate the number of (j,k) sets ***/
/*** (cnts) present in dataset t_b.     ***/
/******************************************/
data t_e(keep=cnts);
   merge t_a(in=a) t_b(in=b) end=n_last;
   by j k;
   if (first.j or first.k) and (a and b) then cnts+1;
   if n_last then output;
run;

In the current case, dataset t_e has 1 variable (cnts), 1 row, with cnts = 16.

Ask a Question
Discussion stats
  • 4 replies
  • 221 views
  • 0 likes
  • 4 in conversation