SAS Programming

SASuserlot · Posted 02-17-2022 02:55 PM

Hi, I am new to Proc SQL, I would like to know how we can remove the duplicates from dataset and achieve the exactly same thing that I can achieve through the PROC SORT. In my example I want achieve the output exactly looks like CLASS1 dataset using PROC SQL. I am trying with sql but having difficulty to achieve the same., I am able to get age, and sex but not the remaining variable. Can you please let me know what I am doing wrong. Thanks in advance.

data class;
set sashelp.class;
run;

proc sort data= class out= class1 nodupkey; by sex age ; run;

proc sql;
create table clsql as select distinct age,sex
from class  group by age,sex order by age,sex;
quit;

SASKiwi · Posted 02-17-2022 06:22 PM

Here is an SQL approach for the rule I previously described:

proc sql;
  create table want as
  select A.*
  from sashelp.class as A
  inner join
  (select  age
          ,sex 
          ,max(height) as Max_Height
   from sashelp.class
   group by age
           ,sex
  ) as B 
  on A.age = B.age
  and A.sex = B.sex
  and A.height = B.Max_Height
  order by sex, age
  ;
quit;

View solution in original post

ballardw · Posted 02-17-2022 03:23 PM

Since Proc Sort will create different datasets given a different order of the data I think you need to consider and describe in much more excruciating detail what your real use case may be.

The initial order of the data set in Proc sort with NODUPKEY will affect the result. Here is an example.

data class;
set sashelp.class;
run;

proc sort data= class out= class1 nodupkey; 
   by sex age ; 
run;

proc sort data=class;
   by height;
run;

proc sort data= class out= class2 nodupkey; 
   by sex age ; 
run;

If Class1 and Class2 are the same when you run this code I would be very surprised.

SQL has a very similar data order issue as it is not designed to process data in any given sequence. Sometimes with large data sets processing the same code on unchanged data sets will result in different results if order is considered.

SASuserlot · Posted 02-17-2022 03:37 PM

Hi @ballardw . Thank you for response. I am not pro with SQL, So I am learning how I can I achieve the same that we can do by proc sort to know how we can avoid duplicates.

For your question, When I ran your code, I see Class1 and Class2 are not same.

Is it possible for you provide and example where we can remove the duplicates and achieve same using proc sort and proc sql, That will be greatly appreciated. I think class dataset may not be good .Thanks again.

SASKiwi · Posted 02-17-2022 04:07 PM

You need to define a deduplication rule that precisely selects the rows you want so that it will work the same in PROC SORT and PROC SQL.

For example with the CLASS dataset you could say: create a table that contains the tallest student for each age and sex value combination. As long as the data has unique values for height (which I think it has), you now have a precise definition that you can code in PROC SORT and PROC SQL and you will get the same result.

SASuserlot · Posted 02-17-2022 05:02 PM

Got it thank you. can you provide an example how to remove duplicates using the Proc sql, may at two variable level. I do have idea using 'distinct' for single variable level duplication removal. Thanks

Kurt_Bremser · Posted 02-17-2022 05:34 PM

If you want the same result in SORT and SQL, you need to design a rule for which duplicate to select, and then you can force both to implement that rule. Without forcing a specific rule, the results will be indeterminate (even in SORT, depending on storage engines).

So we first need that rule.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

SASKiwi · Posted 02-17-2022 06:22 PM

Here is an SQL approach for the rule I previously described:

proc sql;
  create table want as
  select A.*
  from sashelp.class as A
  inner join
  (select  age
          ,sex 
          ,max(height) as Max_Height
   from sashelp.class
   group by age
           ,sex
  ) as B 
  on A.age = B.age
  and A.sex = B.sex
  and A.height = B.Max_Height
  order by sex, age
  ;
quit;

SASuserlot · Posted 02-18-2022 12:22 PM

Thanks

SAS Programming

How to remove duplicates using Proc Sql

Re: How to remove duplicates using Proc Sql

Re: How to remove duplicates using Proc Sql

Re: How to remove duplicates using Proc Sql

Re: How to remove duplicates using Proc Sql

Re: How to remove duplicates using Proc Sql

Re: How to remove duplicates using Proc Sql

Re: How to remove duplicates using Proc Sql

Re: How to remove duplicates using Proc Sql

PROC SQL - Insert

SAS Viya 3.5: Remove Duplicate Records in SAS Data Studio

PROC SQL- Order by

remove duplicates with out sort

Remove duplicates

Follow Us

What is...

SAS Programming

Special offer for SAS Communities members

SAS Training: Just a Click Away

Follow Us

What is...