topic Re: Join 6 datasets using Proc SQL on custid and dont want any duplicate values. in SAS Programming

Join 6 datasets using Proc SQL on custid and dont want any duplicate values.

meetagupta — Mon, 30 Sep 2019 21:18:29 GMT

I am trying to join 6 datasets and using proc sql and the obs are supposed to be 5000 but i am getting 5020. When i use proc sort with nodupkey after that, then i get 5000 obs. Why I am not getting the correct answer even after using DISTINCT in proc sql? Can anyone please help? Thanks.

/*Answer should be 5000 obs but below is giving me 5020 obs.*/

proc sql;
create table new as
select distinct(demossubmit.custid), *
from task1.demossubmit join task1.political
on demossubmit.custid=political.custid
join task1.response
on political.custid=response.custid
join task1.gadgets
on response.custid=gadgets.custid
join task1.financial
on gadgets.custid=financial.custid
join task1.pets
on financial.custid=pets.custid;
quit;

/*Below is giving me 5000 obs. but i dont want to this extra step*/

proc sort data=new out=new1 nodupkey;
by custid;
run;

Re: Join 6 datasets using Proc SQL on custid and dont want any duplicate values.

PGStats — Mon, 30 Sep 2019 21:21:35 GMT

Try dropping the parentheses

distinct demossubmit.custid, ...

Re: Join 6 datasets using Proc SQL on custid and dont want any duplicate values.

Reeza — Mon, 30 Sep 2019 21:34:31 GMT

It usually means you have mulitple CUSTID in one of your 6 data sets.
Usually you want to remove one from the data set in question so you should first find out which data set has duplicates and then figure out which record should be used.

You can find which tables have duplicates by checking the counts vs count distinct.

select count(*) as N, count(distinct custid) as N_Distinct
from task1.demossubmit;

Re: Join 6 datasets using Proc SQL on custid and dont want any duplicate values.

meetagupta — Mon, 30 Sep 2019 21:43:32 GMT

Thanks for replying. All the 6 datasets have custid in common. The dataset
demossubmit have duplicate custid.

Re: Join 6 datasets using Proc SQL on custid and dont want any duplicate values.

Reeza — Mon, 30 Sep 2019 21:44:55 GMT

Well, when you have duplicates it brings in all the duplicates. Perhaps you need to add another condition when joining that table? Or is there a specific record that makes sense to bring in?

Re: Join 6 datasets using Proc SQL on custid and dont want any duplicate values.

meetagupta — Mon, 30 Sep 2019 21:45:32 GMT

tried doing that but still same answer...

Re: Join 6 datasets using Proc SQL on custid and dont want any duplicate values.

meetagupta — Mon, 30 Sep 2019 21:49:32 GMT

What other conditions can I apply while joining? I just dont need any
duplicate observations.

Re: Join 6 datasets using Proc SQL on custid and dont want any duplicate values.

Reeza — Mon, 30 Sep 2019 22:11:03 GMT

WHY DO YOU HAVE DUPLICATES in the first place?

You need to figure out which record is the correct one to join on. It's very rare that it won't matter. For example, if a person tried to fill out a survey twice, we default to the last set of values. You first have to understand why you have duplicates, how the duplicates can be uniquely identified and then you'll know how to filter them. It's a subject matter problem, not a technical problem.

If you're 100% sure it doesn't matter, run PROC SORT on the table with duplicates prior to the join and remove duplicates and then join it. You can create a different output data set when using PROC SORT and I highly recommend you do that.

Re: Join 6 datasets using Proc SQL on custid and dont want any duplicate values.

ChrisNZ — Tue, 01 Oct 2019 03:34:49 GMT

There is no equivalent to nodupkey in SQL, where we pick any one observation for a key value.

If you cannot avoid the duplicate in the source table, this is a better way to deal with them:

proc sql;
create table new as
select distinct(demossubmit.custid), *
from task1.demossubmit join task1.political
on demossubmit.custid=political.custid
join task1.response
on political.custid=response.custid
join task1.gadgets
on response.custid=gadgets.custid
join task1.financial
on gadgets.custid=financial.custid
join task1.pets
on financial.custid=pets.custid
order by CUSTID;
quit;

data NEW1; 
  set NEW;
  by CUSTID;
  if first.CUSTID;
run;

Re: Join 6 datasets using Proc SQL on custid and dont want any duplicate values.

PGStats — Tue, 01 Oct 2019 03:39:00 GMT

The distinct predicate requires rows that are completely distinct, not just distinct custid. So you must figure out in what way those duplicate custid rows differ. Suppose duplicate custid rows have different timestamps. That makes them distinct rows. To keep only the row with the latest timestamp, you could add to your query:

group by custid

having timestamp = max(timestamp);