Re: remove duplicates but keep all fields (using proc sql) - Page 2

jpstats · Posted 09-05-2013 11:59 AM

Copy & paste the following:

data mydataset;

input acct_no name $ date date9.;

datalines;

0001 Mike 01JUN2013

0002 Mike 01JUN2013

0003 Fred 01APR2013

0003 Fred 01MAY2013

0003 Fred 01JUN2013

0004 Will 01SEP2013

run;

proc sql noprint;

create table key_keep as

select distinct acct_no, name, date

from mydataset

group by acct_no, name

having date = min(date);

create table mydataset2 as

select a.*

from mydataset a

INNER JOIN

key_keep b

ON

a.acct_no = b.acct_no and

a.name = b.name and

a.date = b.date;

quit;

The output (mydataset2) should match your specification.

Haikuo · Posted 09-05-2013 09:13 PM

Well, since nobody has had mentioned, and it seems to work (Proc SQL only, and regardless of the quantity of variables), however, it will have to involve an undocumented feature, so not recommend for production usage.

proc sql;

create table no_dup as

select * from have

group by acct_no, name

having monotonic()=min(monotonic());

quit;

Haikuo

esjackso · Posted 09-06-2013 08:16 AM

Very interesting ... I will have to file this trick away!

EJ

esjackso · Posted 09-06-2013 08:45 AM

So I tried the trick out and must be doing something wrong.

I was expecting the same results from a proc sort nodupkey process but didnt get that. What am I missing? The made up data is attached

LOG:

15 proc sql;

16 create table no_dup as

17 select *

18 from duptest

19 group by name, date

20 having monotonic()=min(monotonic());

NOTE: The query requires remerging summary statistics back with the

original data.

NOTE: Table WORK.NO_DUP created, with 1 rows and 5 columns.

21 quit;

NOTE: PROCEDURE SQL used (Total process time):

real time 0.01 seconds

cpu time 0.01 seconds

22 proc sort data=duptest out=out nodupkeys; by name date; run;

NOTE: There were 12 observations read from the data set WORK.DUPTEST.

NOTE: 1 observations with duplicate key values were deleted.

NOTE: The data set WORK.OUT has 11 observations and 5 variables.

NOTE: PROCEDURE SORT used (Total process time):

real time 0.00 seconds

cpu time 0.00 seconds

Tom · Posted 09-06-2013 09:20 AM

It is probably better to use the MONOTONIC() function in a subquery and assign it to a variable rather than try to call the function in two different places in the same SQL code.

create table no_dup (drop=_n_) as

select *

from (select *,monotonic() as _n_ from duptest)

group by name,date

having _n_ = min(_n_)

order by name,date

;

jpstats · Posted 09-06-2013 09:44 AM

This is better, though the subquery part doesn't appear to be necessary. I think with your example the result of the monotonic() function being executed prior to the having clause is that the group by statement is able to kick in and take effect with the summary function. When it's part of the having clause it appears to evaluate across all observations without any grouping.

esjackso · Posted 09-06-2013 10:22 AM

Thanks that does seem to work for my example. My goal was to understand this method see it was new to me. Something to file away for future use!

23 proc sql;

24 create table no_dup (drop=_n_) as

25 select *

26 from (select *,monotonic() as _n_ from duptest)

27 group by name,date

28 having _n_ = min(_n_)

29 order by name,date

30 ;

NOTE: The query requires remerging summary statistics back with the

original data.

NOTE: Table WORK.NO_DUP created, with 11 rows and 5 columns.

31 quit;

NOTE: PROCEDURE SQL used (Total process time):

real time 0.02 seconds

cpu time 0.04 seconds

32

33 proc sort data=duptest out=out nodupkeys; by name date; run;

NOTE: There were 12 observations read from the data set WORK.DUPTEST.

NOTE: 1 observations with duplicate key values were deleted.

NOTE: The data set WORK.OUT has 11 observations and 5 variables.

NOTE: PROCEDURE SORT used (Total process time):

real time 0.01 seconds

cpu time 0.01 seconds

Haikuo · Posted 09-06-2013 09:26 AM

Strongly agree with Tom, and Thanks to Eric for pointing this out. I only did one test run on the data presented, and it seems to work, I guess I just got lucky:

data mydataset;

input acct_no name $ date date9.;

datalines;

0001 Mike 01JUN2013

0002 Mike 01JUN2013

0003 Fred 01APR2013

0003 Fred 01MAY2013

0003 Fred 01JUN2013

0004 Will 01SEP2013

run;

proc sql;

create table no_dup as

select *

from mydataset

group by acct_no, name

having monotonic()=min(monotonic());

quit;

Haikuo

cooker · Posted 09-06-2013 10:00 AM

These are great ideas; but if he's using a passthru query, will any these SAS functions work?

Haikuo · Posted 09-06-2013 10:14 AM

Not good for pass-through. Where did OP mention the target is RDBMS? The thread is getting long, I must have missed it.

cooker · Posted 09-06-2013 10:24 AM

3.Re: remove duplicates but keep all fields (using proc sql)

podarum Sep 5, 2013 11:09 AM (in response to esjackso1)

Currently Being Moderated

no it doesn't matter, as long as I get only one by Acct_No

and jpstats, the data is from a server, and it's large that I have to use proc sql (rsubmit)

esjackso · Posted 09-06-2013 10:28 AM

Its been awhile since I used rsubmit but isnt that only for remote submits to SAS on a server? Which wouldnt be the same as a pass thru query to a different SQL platform.

Eric

Haikuo · Posted 09-06-2013 10:49 AM

Yes, Eric. I use "rsubmit" on a daily basis, mostly to a server, once for while to some PCs which have SAS running (at least some sort of spawner is active, connect spawner, job spawner or object spawner, not sure). I believe "rsubmit" is for SAS/CONNECT.

Haikuo

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!