Help using Base SAS procedures

remove duplicates but keep all fields (using proc sql)

Reply
Super Contributor
Posts: 396

remove duplicates but keep all fields (using proc sql)

Hi,  is there a way to remove duplicates but keep all fields.. Example below (but keep in mind there are 30 fields in total, and I want to capture all and the table is a huge data warehouse table) :

HAVE:

Acct_No       Name          Date

0001             Mike          Jun-13

0002             Mike          Jun-13

0003             Fred           Apr-13

0003             Fred           May-13

0003             Fred           Jun-13

0004             Will            Sep-13

WANT:

Acct_No       Name          Date

0001             Mike          Jun-13

0002             Mike          Jun-13

0003             Fred           Apr-13

0004             Will            Sep-13

Super Contributor
Posts: 333

Re: remove duplicates but keep all fields (using proc sql)

Does the date matter? IE your example chose the first one ... was that by design?

EJ

Super Contributor
Posts: 396

Re: remove duplicates but keep all fields (using proc sql)

no it doesn't matter, as long as I get only one by Acct_No

and jpstats, the data is from a server, and it's large that I have to use proc sql (rsubmit)

Super Contributor
Posts: 333

Re: remove duplicates but keep all fields (using proc sql)

Are trying to update the table in place or just get a sas dataset to work with?

EJ

Super Contributor
Posts: 396

Re: remove duplicates but keep all fields (using proc sql)

a dataset to work with

Super Contributor
Posts: 333

Re: remove duplicates but keep all fields (using proc sql)

This seems to work with your example assuming that the date is numeric:

proc sql;

  create table want as

  select acct_no, name, min(date) as date

  from have

  group by acct_no, name

  ;

quit;

EJ

Occasional Contributor
Posts: 5

Re: remove duplicates but keep all fields (using proc sql)

I may have misunderstood, but I think he wants the resulting set to have all the fields...

Super Contributor
Posts: 396

Re: remove duplicates but keep all fields (using proc sql)

that is correct, and since theer are about 30 fields, I don't want to type each one.. a dup may occur though in another field, my example only shows date, but it could happen for any other of the 30 fields... I may have to comprimise.

Super Contributor
Posts: 333

Re: remove duplicates but keep all fields (using proc sql)

Even taking the min of just the date may not get rid of all the duplicates if the min is in the table multiple time for acctno and name and other fields changing... if the values dont really matter then proc sort with out= option is the best choice as mentioned in his post.

EJ

Occasional Contributor
Posts: 5

Re: remove duplicates but keep all fields (using proc sql)

Just use -

proc sort

     data = mydataset nodupkey;

     by acct_no name;

run;

Or, if you have to use SQL -

proc sql noprint;

     create table key_keep as

     select distinct acct_no, name, date

     from mydataset

     group by acct_no, name

     having date = min(date);

     create table mydataset2 as

     select a.*

     from mydataset a

     INNER JOIN

     key_keep b

     ON

     a.acct_no = b.acct_no and

     a.name = b.name and

     a.date = b.date;

quit;

I think that should do the trick.

Super Contributor
Posts: 333

Re: remove duplicates but keep all fields (using proc sql)

That doesnt actually get what they ask for a couple of reasons:

- doesnt use SQL

- it really is only a no dup on acct_no since the dates are actually unique

Occasional Contributor
Posts: 5

Re: remove duplicates but keep all fields (using proc sql)

Right, good catch.

Respected Advisor
Posts: 4,663

Re: remove duplicates but keep all fields (using proc sql)

With SQL you will have somehow to tell the procedure which value of your fields to keep. In your example, it seems that you want to keep the minimum Date. If your choice of value to keep depends on the order of the data in the table (e.g. the first Date), then SQL is not a good tool to perform the operation and proc SORT or a datastep would do a better job.

PG

PG
New Contributor
Posts: 4

Re: remove duplicates but keep all fields (using proc sql)

You may try the following:

[insert your passthru code here]

proc sql;

create table newtable as

select a.acct_no, a.name, a.date

from dataset a

where a.date = (select min(a1.date)

    from dataset a1

    where a1.name = a.name

    and a1.acct_no = a.acct_no)

quit;

You'll  want to list all of your fields of course or use the a.* method. There are other methods which are available to Oracle DB but this should work on most platforms.

Good luck!

New Contributor
Posts: 4

Re: remove duplicates but keep all fields (using proc sql)

Yes, if you could have differences in all the columns; you'd have to replicate the subquery for each column. This may be something which knocks the snot out of your DB server and may timeout.

Ask a Question
Discussion stats
  • 27 replies
  • 636 views
  • 6 likes
  • 7 in conversation