- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi, I am new to Proc SQL, I would like to know how we can remove the duplicates from dataset and achieve the exactly same thing that I can achieve through the PROC SORT. In my example I want achieve the output exactly looks like CLASS1 dataset using PROC SQL. I am trying with sql but having difficulty to achieve the same., I am able to get age, and sex but not the remaining variable. Can you please let me know what I am doing wrong. Thanks in advance.
data class;
set sashelp.class;
run;
proc sort data= class out= class1 nodupkey; by sex age ; run;
proc sql;
create table clsql as select distinct age,sex
from class group by age,sex order by age,sex;
quit;
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here is an SQL approach for the rule I previously described:
proc sql;
create table want as
select A.*
from sashelp.class as A
inner join
(select age
,sex
,max(height) as Max_Height
from sashelp.class
group by age
,sex
) as B
on A.age = B.age
and A.sex = B.sex
and A.height = B.Max_Height
order by sex, age
;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Since Proc Sort will create different datasets given a different order of the data I think you need to consider and describe in much more excruciating detail what your real use case may be.
The initial order of the data set in Proc sort with NODUPKEY will affect the result. Here is an example.
data class; set sashelp.class; run; proc sort data= class out= class1 nodupkey; by sex age ; run; proc sort data=class; by height; run; proc sort data= class out= class2 nodupkey; by sex age ; run;
If Class1 and Class2 are the same when you run this code I would be very surprised.
SQL has a very similar data order issue as it is not designed to process data in any given sequence. Sometimes with large data sets processing the same code on unchanged data sets will result in different results if order is considered.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @ballardw . Thank you for response. I am not pro with SQL, So I am learning how I can I achieve the same that we can do by proc sort to know how we can avoid duplicates.
For your question, When I ran your code, I see Class1 and Class2 are not same.
Is it possible for you provide and example where we can remove the duplicates and achieve same using proc sort and proc sql, That will be greatly appreciated. I think class dataset may not be good .Thanks again.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You need to define a deduplication rule that precisely selects the rows you want so that it will work the same in PROC SORT and PROC SQL.
For example with the CLASS dataset you could say: create a table that contains the tallest student for each age and sex value combination. As long as the data has unique values for height (which I think it has), you now have a precise definition that you can code in PROC SORT and PROC SQL and you will get the same result.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Got it thank you. can you provide an example how to remove duplicates using the Proc sql, may at two variable level. I do have idea using 'distinct' for single variable level duplication removal. Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If you want the same result in SORT and SQL, you need to design a rule for which duplicate to select, and then you can force both to implement that rule. Without forcing a specific rule, the results will be indeterminate (even in SORT, depending on storage engines).
So we first need that rule.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here is an SQL approach for the rule I previously described:
proc sql;
create table want as
select A.*
from sashelp.class as A
inner join
(select age
,sex
,max(height) as Max_Height
from sashelp.class
group by age
,sex
) as B
on A.age = B.age
and A.sex = B.sex
and A.height = B.Max_Height
order by sex, age
;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks