remove duplicates but keep all fields (using proc sql)

podarum · Posted 09-05-2013 10:55 AM

Hi, is there a way to remove duplicates but keep all fields.. Example below (but keep in mind there are 30 fields in total, and I want to capture all and the table is a huge data warehouse table) :

HAVE:

Acct_No Name Date

0001 Mike Jun-13

0002 Mike Jun-13

0003 Fred Apr-13

0003 Fred May-13

0003 Fred Jun-13

0004 Will Sep-13

WANT:

Acct_No Name Date

0001 Mike Jun-13

0002 Mike Jun-13

0003 Fred Apr-13

0004 Will Sep-13

esjackso · Posted 09-05-2013 11:05 AM

Does the date matter? IE your example chose the first one ... was that by design?

EJ

podarum · Posted 09-05-2013 11:09 AM

no it doesn't matter, as long as I get only one by Acct_No

and jpstats, the data is from a server, and it's large that I have to use proc sql (rsubmit)

esjackso · Posted 09-05-2013 11:11 AM

Are trying to update the table in place or just get a sas dataset to work with?

EJ

podarum · Posted 09-05-2013 11:14 AM

a dataset to work with

esjackso · Posted 09-05-2013 11:24 AM

This seems to work with your example assuming that the date is numeric:

proc sql;

create table want as

select acct_no, name, min(date) as date

from have

group by acct_no, name

;

quit;

EJ

jpstats · Posted 09-05-2013 11:29 AM

I may have misunderstood, but I think he wants the resulting set to have all the fields...

podarum · Posted 09-05-2013 11:33 AM

that is correct, and since theer are about 30 fields, I don't want to type each one.. a dup may occur though in another field, my example only shows date, but it could happen for any other of the 30 fields... I may have to comprimise.

esjackso · Posted 09-05-2013 11:41 AM

Even taking the min of just the date may not get rid of all the duplicates if the min is in the table multiple time for acctno and name and other fields changing... if the values dont really matter then proc sort with out= option is the best choice as mentioned in his post.

EJ

jpstats · Posted 09-05-2013 11:05 AM

Just use -

proc sort

data = mydataset nodupkey;

by acct_no name;

run;

Or, if you have to use SQL -

proc sql noprint;

create table key_keep as

select distinct acct_no, name, date

from mydataset

group by acct_no, name

having date = min(date);

create table mydataset2 as

select a.*

from mydataset a

INNER JOIN

key_keep b

ON

a.acct_no = b.acct_no and

a.name = b.name and

a.date = b.date;

quit;

I think that should do the trick.

esjackso · Posted 09-05-2013 11:10 AM

That doesnt actually get what they ask for a couple of reasons:

- doesnt use SQL

- it really is only a no dup on acct_no since the dates are actually unique

jpstats · Posted 09-05-2013 11:11 AM

Right, good catch.

PGStats · Posted 09-05-2013 11:13 AM

With SQL you will have somehow to tell the procedure which value of your fields to keep. In your example, it seems that you want to keep the minimum Date. If your choice of value to keep depends on the order of the data in the table (e.g. the first Date), then SQL is not a good tool to perform the operation and proc SORT or a datastep would do a better job.

PG

cooker · Posted 09-05-2013 11:33 AM

You may try the following:

[insert your passthru code here]

proc sql;

create table newtable as

select a.acct_no, a.name, a.date

from dataset a

where a.date = (select min(a1.date)

from dataset a1

where a1.name = a.name

and a1.acct_no = a.acct_no)

quit;

You'll want to list all of your fields of course or use the a.* method. There are other methods which are available to Oracle DB but this should work on most platforms.

Good luck!

cooker · Posted 09-05-2013 11:47 AM

Yes, if you could have differences in all the columns; you'd have to replicate the subquery for each column. This may be something which knocks the snot out of your DB server and may timeout.

remove duplicates but keep all fields (using proc sql)

Re: remove duplicates but keep all fields (using proc sql)

Re: remove duplicates but keep all fields (using proc sql)

Re: remove duplicates but keep all fields (using proc sql)

Re: remove duplicates but keep all fields (using proc sql)

Re: remove duplicates but keep all fields (using proc sql)

Re: remove duplicates but keep all fields (using proc sql)

Re: remove duplicates but keep all fields (using proc sql)

Re: remove duplicates but keep all fields (using proc sql)

Re: remove duplicates but keep all fields (using proc sql)

Re: remove duplicates but keep all fields (using proc sql)

Re: remove duplicates but keep all fields (using proc sql)

Re: remove duplicates but keep all fields (using proc sql)

Re: remove duplicates but keep all fields (using proc sql)

Re: remove duplicates but keep all fields (using proc sql)

Registration is open

SAS Training: Just a Click Away