Hi, is there a way to remove duplicates but keep all fields.. Example below (but keep in mind there are 30 fields in total, and I want to capture all and the table is a huge data warehouse table) :
HAVE:
Acct_No Name Date
0001 Mike Jun-13
0002 Mike Jun-13
0003 Fred Apr-13
0003 Fred May-13
0003 Fred Jun-13
0004 Will Sep-13
WANT:
Acct_No Name Date
0001 Mike Jun-13
0002 Mike Jun-13
0003 Fred Apr-13
0004 Will Sep-13
Does the date matter? IE your example chose the first one ... was that by design?
EJ
no it doesn't matter, as long as I get only one by Acct_No
and jpstats, the data is from a server, and it's large that I have to use proc sql (rsubmit)
Are trying to update the table in place or just get a sas dataset to work with?
EJ
a dataset to work with
This seems to work with your example assuming that the date is numeric:
proc sql;
create table want as
select acct_no, name, min(date) as date
from have
group by acct_no, name
;
quit;
EJ
I may have misunderstood, but I think he wants the resulting set to have all the fields...
that is correct, and since theer are about 30 fields, I don't want to type each one.. a dup may occur though in another field, my example only shows date, but it could happen for any other of the 30 fields... I may have to comprimise.
Just use -
proc sort
data = mydataset nodupkey;
by acct_no name;
run;
Or, if you have to use SQL -
proc sql noprint;
create table key_keep as
select distinct acct_no, name, date
from mydataset
group by acct_no, name
having date = min(date);
create table mydataset2 as
select a.*
from mydataset a
INNER JOIN
key_keep b
ON
a.acct_no = b.acct_no and
a.name = b.name and
a.date = b.date;
quit;
I think that should do the trick.
That doesnt actually get what they ask for a couple of reasons:
- doesnt use SQL
- it really is only a no dup on acct_no since the dates are actually unique
Right, good catch.
With SQL you will have somehow to tell the procedure which value of your fields to keep. In your example, it seems that you want to keep the minimum Date. If your choice of value to keep depends on the order of the data in the table (e.g. the first Date), then SQL is not a good tool to perform the operation and proc SORT or a datastep would do a better job.
PG
You may try the following:
[insert your passthru code here]
proc sql;
create table newtable as
select a.acct_no, a.name, a.date
from dataset a
where a.date = (select min(a1.date)
from dataset a1
where a1.name = a.name
and a1.acct_no = a.acct_no)
quit;
You'll want to list all of your fields of course or use the a.* method. There are other methods which are available to Oracle DB but this should work on most platforms.
Good luck!
Yes, if you could have differences in all the columns; you'd have to replicate the subquery for each column. This may be something which knocks the snot out of your DB server and may timeout.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.