I am trying to join 6 datasets and using proc sql and the obs are supposed to be 5000 but i am getting 5020. When i use proc sort with nodupkey after that, then i get 5000 obs. Why I am not getting the correct answer even after using DISTINCT in proc sql? Can anyone please help? Thanks.
/*Answer should be 5000 obs but below is giving me 5020 obs.*/
proc sql;
create table new as
select distinct(demossubmit.custid), *
from task1.demossubmit join task1.political
on demossubmit.custid=political.custid
join task1.response
on political.custid=response.custid
join task1.gadgets
on response.custid=gadgets.custid
join task1.financial
on gadgets.custid=financial.custid
join task1.pets
on financial.custid=pets.custid;
quit;
/*Below is giving me 5000 obs. but i dont want to this extra step*/
proc sort data=new out=new1 nodupkey;
by custid;
run;
Try dropping the parentheses
distinct demossubmit.custid, ...
There is no equivalent to nodupkey in SQL, where we pick any one observation for a key value.
If you cannot avoid the duplicate in the source table, this is a better way to deal with them:
proc sql;
create table new as
select distinct(demossubmit.custid), *
from task1.demossubmit join task1.political
on demossubmit.custid=political.custid
join task1.response
on political.custid=response.custid
join task1.gadgets
on response.custid=gadgets.custid
join task1.financial
on gadgets.custid=financial.custid
join task1.pets
on financial.custid=pets.custid
order by CUSTID;
quit;
data NEW1;
set NEW;
by CUSTID;
if first.CUSTID;
run;
The distinct predicate requires rows that are completely distinct, not just distinct custid. So you must figure out in what way those duplicate custid rows differ. Suppose duplicate custid rows have different timestamps. That makes them distinct rows. To keep only the row with the latest timestamp, you could add to your query:
group by custid
having timestamp = max(timestamp);
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.