- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi All,
Please help me to find out 5-6 ways of removing duplicates from a dataset.
Regards
Andy
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi Andy,
I may not be much familiar with all ways but trying my best.....
(1) proc sort data=DSN noduprecs;
by _all_ ; Run;
(2) Proc sort data=DSN out=sample nodupkey dupout=Duplicate;
by var1; Run;
(3) Proc Sql noprint;
create table unique as select distinct (*) from DSN;
quit;
(4) proc freq data=DSN noprint;
tables variable/out=unique (Keep=variable count where=(count=1));
Run;
(5) Proc sort data=DSN out=temp;
by variable ; Run;
Data Unique;
set temp;
by variable;
if not first.variable and last.variable;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi..
Normally we use bellow method's to remove duplicate observations from the dataset.
1.Nodup(Proc Sort)
2.nodupkey(Proc Sort)
3.Disticnt(Proc Sql )
4.First. and last . variables(In Datastep)
Regards.
Sanjeev.K
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
5. proc freq ;
6.proc means;
7.proc summary;
8.HashTable
9. proc tabulate
10.proc report
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
What is that mean ?
select distinct id,name ?
or
select distinct cats(id,name) ?
proc sql;
select distinct sex,age from sashelp.class;
quit;
Message was edited by: xia keshan
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi..
I will try to elaborate my question with bellow example .
If Dataset looks like this.
ID Fee
100 2000
100 3000
101 4000
101 3200
If we use
Proc Sql;
Select Distinct ID,Fee from XX;
quit;
(OR)
Proc sort data=XX nodup;
by ID;
run;
above code will display all the data.because fee is different with in the same ID.
but in the case of
Proc Sort Data=XX nodupkey;
by ID;
run;
above code will generate only 2 rows (100 and 101).
How can we achieve the same by using Proc SQL??
Regards.
Sanjeev.K
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
select Distinct ID,Fee from XX group by id having fee=max(fee) ;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ksharp..
Its looks fine only when i have two variables in the dataset.but what in the case of dataset set having 10 or more variables and still i want to remove observation based on one variable in Proc SQL ???
Regards.
Sanjeev.K
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Just list it in SELECT clause.
select id,free,xx,xx,.......
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You mean
Select Distinct Id,fee,xx,xx,xx,xx,,,,,?????
Will It remove Duplicates based on ID variable? Not sure..
Regards.
Sanjeev.K
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Yes. Why not try it in person.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ksharp..
Let me explain in more detail ...
*****************************METHOD 1************************************
Data Test(keep=Actlevel Fee Height weight) ;
set Sasuser.admit;
run;
Proc Sort Data=test Nodupkey;
by Actlevel;
run;
proc print;
run;
I want to get above output by using Proc SQL. i used bellow method (as mention by you) but not working properly .
*****************************METHOD 2************************************
Proc Sql;
Select Distinct Actlevel,Fee,Height,weight From SASUSER.ADMIT;
quit;
Regards.
Sanjeev.K
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Proc Sql;
Select Distinct Actlevel,Fee,Height,weight From SASUSER.ADMIT
group by Actlevel
having fee=max(fee);
quit;
Ksharp
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Tried with your code.
Still i can able to see duplicates in Actlevel Variable.
Query producing 9 observations instead of 3..
Because Every Actlevel Having more then one MAX fee..:-)
Regards.
Sanjeev.K
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Proc Sql;
Select Distinct Actlevel,Fee,Height,weight From SASUSER.ADMIT
group by Actlevel
having fee=max(fee) and max(height);
quit;
Ksharp