DATA Step, Macro, Functions and more

Select distinct on different columns

Reply
Occasional Contributor
Posts: 11

Select distinct on different columns

[ Edited ]
 
Hi,

I am trying to create a query in sas from postgresql which eliminates duplicate rows based on the value on a subset of columns. Let's assume that my database is as follows:

name var1 var2 var3
   a    1   2   10   
a
1 2 26
b
3 56 47
c
4 78 50

For my purposes I would like to drop the second row (or the first one, it doesn't really matter) in order to have the table as:

name var1 var2 var3
   a    1   2   10   
b
3 56 47
c
4 78 50

I have tried withe the following:

SELECT DISTINCT on (name, var1, var2)
name, var1, var2, var3 FROM table;

but without any success, can anyone help?

In Postgresq this would perform the distinct operation considering only the variables name, var1, var2 but keeping name,var1,var2,var3

 

Thanks

PROC Star
Posts: 1,833

Re: Select distinct on different columns

[ Edited ]
Posted in reply to Marco_park

I do not know what is Postgresq

 

In SAS, you can achieve with a proc sort or data step

 

proc sort data=your_dataset nodupkey;

by name;

run;

 

Occasional Contributor
Posts: 11

Re: Select distinct on different columns

Posted in reply to novinosrin

Can I do that into a proc sql? I need it to be done in proc sql?

 

Thank you

PROC Star
Posts: 1,833

Re: Select distinct on different columns

Posted in reply to Marco_park

Something like this?

 

data have;
input name $ var1 var2 var3;
cards;
   a    1   2   10   
   a    1   2   26   
   b    3   56  47   
   c    4   78  50
   ;


proc sql;
create table want as
select distinct name, var1, var2, var3
from have
group by name
having var3=min(var3);
quit;
Frequent Contributor
Posts: 101

Re: Select distinct on different columns

[ Edited ]
Posted in reply to novinosrin

I'm used to that functionality in PostgreSQL as well and had to resort to using PROC SORT.  However it does look like it works with PROC FEDSQL:

 

PROC FEDSQL;
	DROP TABLE example FORCE;
	CREATE TABLE example (
		id INTEGER,
		col1 INTEGER
	);
	INSERT INTO example(id, col1) VALUES (1, 2);
	INSERT INTO example(id, col1) VALUES (1, 3);
	INSERT INTO example(id, col1) VALUES (1, 4);
	SELECT DISTINCT ON (id) *
	FROM example;
QUIT;

 

Result:

 

1  4

Occasional Contributor
Posts: 11

Re: Select distinct on different columns

Posted in reply to tomcmacdonald

Basically I have to translate this code

 

proc sort data=my_db nodupkey;
	by var1 var2 var3 var4;
run;

in a proc sql, however the database has more then only 4 variables.

How can I do it?

Frequent Contributor
Posts: 77

Re: Select distinct on different columns

[ Edited ]
Posted in reply to novinosrin

This might work only if there are distinct values in Var3 for each group in name variable because if the values of Var3 are 10 and 10 against the data values "a" in name variable then you will get both the records.

Occasional Contributor
Posts: 11

Re: Select distinct on different columns

Posted in reply to MadhuKorni

Yes the entries in var1, var2 and var3 are all different. This is not the case in the var 4-10 but I don't care which ones will be dropped.

Is it possible to do it in a proc sql?

 

Thanks

PROC Star
Posts: 1,833

Re: Select distinct on different columns

Posted in reply to MadhuKorni

@MadhuKorni No, you obviously haven't tested my code. Your thought is handled by my having clause. Please test 

Frequent Contributor
Posts: 77

Re: Select distinct on different columns

Posted in reply to novinosrin

If the first three variables of a by group are same then your code works.

What If the value in var2 of the 2nd row is changed to 1. Try executing your code with the below dataset.

data have;
input name $ var1 var2 var3;
cards;
a 1 2 10
a 1 1 10
b 3 56 47
c 4 78 50
;

PROC Star
Posts: 1,833

Re: Select distinct on different columns

[ Edited ]
Posted in reply to MadhuKorni

@MadhuKorni  Fair point, that leads me to now think statistically,

 

Please test with some intuitive combinations and let me know

 

data have;
input name $ var1 var2 var3;
cards;
a 1 2 10 
a 1 1 10 
b 3 56 47 
c 4 78 50
;

proc sql;
create table want(drop=std) as
select  distinct name, var1, var2, var3,std(var1,var2,var3) as std
from have
group by name
having std=min(std)  ;
quit;

 

OK, The above may still cause a problem for values like

a 1 2 1
a 2 1 1

 

This can be fixed by weigting on of the variables multiplied with a constant. For example,

 

proc sql;
create table want(drop=std) as
select  distinct name, var1, var2, var3,std(var1*2,var2,var3) as std
from have
group by name
having min(std)=std;
quit;

 

 

Ask a Question
Discussion stats
  • 10 replies
  • 130 views
  • 0 likes
  • 4 in conversation