Copy & paste the following:
data mydataset;
input acct_no name $ date date9.;
datalines;
0001 Mike 01JUN2013
0002 Mike 01JUN2013
0003 Fred 01APR2013
0003 Fred 01MAY2013
0003 Fred 01JUN2013
0004 Will 01SEP2013
run;
proc sql noprint;
create table key_keep as
select distinct acct_no, name, date
from mydataset
group by acct_no, name
having date = min(date);
create table mydataset2 as
select a.*
from mydataset a
INNER JOIN
key_keep b
ON
a.acct_no = b.acct_no and
a.name = b.name and
a.date = b.date;
quit;
The output (mydataset2) should match your specification.
Well, since nobody has had mentioned, and it seems to work (Proc SQL only, and regardless of the quantity of variables), however, it will have to involve an undocumented feature, so not recommend for production usage.
proc sql;
create table no_dup as
select * from have
group by acct_no, name
having monotonic()=min(monotonic());
quit;
Haikuo
So I tried the trick out and must be doing something wrong.
I was expecting the same results from a proc sort nodupkey process but didnt get that. What am I missing? The made up data is attached
LOG:
15 proc sql;
16 create table no_dup as
17 select *
18 from duptest
19 group by name, date
20 having monotonic()=min(monotonic());
NOTE: The query requires remerging summary statistics back with the
original data.
NOTE: Table WORK.NO_DUP created, with 1 rows and 5 columns.
21 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
22 proc sort data=duptest out=out nodupkeys; by name date; run;
NOTE: There were 12 observations read from the data set WORK.DUPTEST.
NOTE: 1 observations with duplicate key values were deleted.
NOTE: The data set WORK.OUT has 11 observations and 5 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
It is probably better to use the MONOTONIC() function in a subquery and assign it to a variable rather than try to call the function in two different places in the same SQL code.
create table no_dup (drop=_n_) as
select *
from (select *,monotonic() as _n_ from duptest)
group by name,date
having _n_ = min(_n_)
order by name,date
;
This is better, though the subquery part doesn't appear to be necessary. I think with your example the result of the monotonic() function being executed prior to the having clause is that the group by statement is able to kick in and take effect with the summary function. When it's part of the having clause it appears to evaluate across all observations without any grouping.
Thanks that does seem to work for my example. My goal was to understand this method see it was new to me. Something to file away for future use!
23 proc sql;
24 create table no_dup (drop=_n_) as
25 select *
26 from (select *,monotonic() as _n_ from duptest)
27 group by name,date
28 having _n_ = min(_n_)
29 order by name,date
30 ;
NOTE: The query requires remerging summary statistics back with the
original data.
NOTE: Table WORK.NO_DUP created, with 11 rows and 5 columns.
31 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.02 seconds
cpu time 0.04 seconds
32
33 proc sort data=duptest out=out nodupkeys; by name date; run;
NOTE: There were 12 observations read from the data set WORK.DUPTEST.
NOTE: 1 observations with duplicate key values were deleted.
NOTE: The data set WORK.OUT has 11 observations and 5 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
Strongly agree with Tom, and Thanks to Eric for pointing this out. I only did one test run on the data presented, and it seems to work, I guess I just got lucky:
data mydataset;
input acct_no name $ date date9.;
datalines;
0001 Mike 01JUN2013
0002 Mike 01JUN2013
0003 Fred 01APR2013
0003 Fred 01MAY2013
0003 Fred 01JUN2013
0004 Will 01SEP2013
run;
proc sql;
create table no_dup as
select *
from mydataset
group by acct_no, name
having monotonic()=min(monotonic());
quit;
Haikuo
These are great ideas; but if he's using a passthru query, will any these SAS functions work?
Not good for pass-through. Where did OP mention the target is RDBMS? The thread is getting long, I must have missed it.
3.Re: remove duplicates but keep all fields (using proc sql)
podarum Sep 5, 2013 11:09 AM (in response to esjackso1)
Currently Being Moderated
no it doesn't matter, as long as I get only one by Acct_No
and jpstats, the data is from a server, and it's large that I have to use proc sql (rsubmit)
Its been awhile since I used rsubmit but isnt that only for remote submits to SAS on a server? Which wouldnt be the same as a pass thru query to a different SQL platform.
Eric
Yes, Eric. I use "rsubmit" on a daily basis, mostly to a server, once for while to some PCs which have SAS running (at least some sort of spawner is active, connect spawner, job spawner or object spawner, not sure). I believe "rsubmit" is for SAS/CONNECT.
Haikuo
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.