BookmarkSubscribeRSS Feed
meetagupta
Fluorite | Level 6

I am trying to join 6 datasets and using proc sql and the obs are supposed to be 5000 but i am getting 5020. When i use proc sort with nodupkey after that, then i get 5000 obs. Why I am not getting the correct answer even after using DISTINCT in proc sql? Can anyone please help? Thanks.

 

/*Answer should be 5000 obs but below is giving me 5020 obs.*/

proc sql;
create table new as
select distinct(demossubmit.custid), *
from task1.demossubmit join task1.political
on demossubmit.custid=political.custid
join task1.response
on political.custid=response.custid
join task1.gadgets
on response.custid=gadgets.custid
join task1.financial
on gadgets.custid=financial.custid
join task1.pets
on financial.custid=pets.custid;
quit;

 

/*Below is giving me 5000 obs. but i dont want to this extra step*/

proc sort data=new out=new1 nodupkey;
by custid;
run;

9 REPLIES 9
PGStats
Opal | Level 21

Try dropping the parentheses

 

distinct demossubmit.custid, ...

PG
meetagupta
Fluorite | Level 6
tried doing that but still same answer...
Reeza
Super User
It usually means you have mulitple CUSTID in one of your 6 data sets.
Usually you want to remove one from the data set in question so you should first find out which data set has duplicates and then figure out which record should be used.

You can find which tables have duplicates by checking the counts vs count distinct.

select count(*) as N, count(distinct custid) as N_Distinct
from task1.demossubmit;

meetagupta
Fluorite | Level 6
Thanks for replying. All the 6 datasets have custid in common. The dataset
demossubmit have duplicate custid.
Reeza
Super User
Well, when you have duplicates it brings in all the duplicates. Perhaps you need to add another condition when joining that table? Or is there a specific record that makes sense to bring in?
meetagupta
Fluorite | Level 6
What other conditions can I apply while joining? I just dont need any
duplicate observations.
Reeza
Super User
WHY DO YOU HAVE DUPLICATES in the first place?

You need to figure out which record is the correct one to join on. It's very rare that it won't matter. For example, if a person tried to fill out a survey twice, we default to the last set of values. You first have to understand why you have duplicates, how the duplicates can be uniquely identified and then you'll know how to filter them. It's a subject matter problem, not a technical problem.

If you're 100% sure it doesn't matter, run PROC SORT on the table with duplicates prior to the join and remove duplicates and then join it. You can create a different output data set when using PROC SORT and I highly recommend you do that.
ChrisNZ
Tourmaline | Level 20

There is no equivalent to nodupkey in SQLwhere we pick any one observation for a key value.

If you cannot avoid the duplicate in the source table, this is a better way to deal with them:

proc sql;
create table new as
select distinct(demossubmit.custid), *
from task1.demossubmit join task1.political
on demossubmit.custid=political.custid
join task1.response
on political.custid=response.custid
join task1.gadgets
on response.custid=gadgets.custid
join task1.financial
on gadgets.custid=financial.custid
join task1.pets
on financial.custid=pets.custid
order by CUSTID;
quit;

data NEW1; 
  set NEW;
  by CUSTID;
  if first.CUSTID;
run;

 

 

 

PGStats
Opal | Level 21

The distinct predicate requires rows that are completely distinct, not just distinct custid. So you must figure out in what way those duplicate custid rows differ. Suppose duplicate custid rows have different timestamps. That makes them distinct rows. To keep only the row with the latest timestamp, you could add to your query:

 

group by custid

having timestamp = max(timestamp);

PG

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 712 views
  • 0 likes
  • 4 in conversation