topic Re: Select distinct on different columns in SAS Programming

Select distinct on different columns

Marco_park — Mon, 18 Jun 2018 15:25:09 GMT

Hi,

I am trying to create a query in sas from postgresql which eliminates duplicate rows based on the value on a subset of columns. Let's assume that my database is as follows:

name var1 var2 var3
   a    1   2   10   
   a    1   2   26   
   b    3   56  47   
   c    4   78  50

For my purposes I would like to drop the second row (or the first one, it doesn't really matter) in order to have the table as:

name var1 var2 var3
   a    1   2   10   
   b    3   56  47   
   c    4   78  50

I have tried withe the following:

SELECT DISTINCT on (name, var1, var2)
name, var1, var2, var3
FROM table;

but without any success, can anyone help?

In Postgresq this would perform the distinct operation considering only the variables name, var1, var2 but keeping name,var1,var2,var3

Thanks

Re: Select distinct on different columns

novinosrin — Mon, 18 Jun 2018 15:34:00 GMT

I do not know what is Postgresq

In SAS, you can achieve with a proc sort or data step

proc sort data=your_dataset nodupkey;

by name;

run;

Re: Select distinct on different columns

Marco_park — Mon, 18 Jun 2018 15:34:12 GMT

Can I do that into a proc sql? I need it to be done in proc sql?

Thank you

Re: Select distinct on different columns

novinosrin — Mon, 18 Jun 2018 15:42:08 GMT

Something like this?

data have;
input name $ var1 var2 var3;
cards;
   a    1   2   10   
   a    1   2   26   
   b    3   56  47   
   c    4   78  50
   ;


proc sql;
create table want as
select distinct name, var1, var2, var3
from have
group by name
having var3=min(var3);
quit;

Re: Select distinct on different columns

tomcmacdonald — Mon, 18 Jun 2018 15:53:04 GMT

I'm used to that functionality in PostgreSQL as well and had to resort to using PROC SORT. However it does look like it works with PROC FEDSQL:

PROC FEDSQL;
	DROP TABLE example FORCE;
	CREATE TABLE example (
		id INTEGER,
		col1 INTEGER
	);
	INSERT INTO example(id, col1) VALUES (1, 2);
	INSERT INTO example(id, col1) VALUES (1, 3);
	INSERT INTO example(id, col1) VALUES (1, 4);
	SELECT DISTINCT ON (id) *
	FROM example;
QUIT;

Result:

1 4

Re: Select distinct on different columns

Marco_park — Tue, 19 Jun 2018 12:51:41 GMT

Basically I have to translate this code

proc sort data=my_db nodupkey;
	by var1 var2 var3 var4;
run;

in a proc sql, however the database has more then only 4 variables.

How can I do it?

Re: Select distinct on different columns

MadhuKorni — Tue, 19 Jun 2018 14:01:49 GMT

This might work only if there are distinct values in Var3 for each group in name variable because if the values of Var3 are 10 and 10 against the data values "a" in name variable then you will get both the records.

Re: Select distinct on different columns

Marco_park — Tue, 19 Jun 2018 14:02:22 GMT

Yes the entries in var1, var2 and var3 are all different. This is not the case in the var 4-10 but I don't care which ones will be dropped.

Is it possible to do it in a proc sql?

Thanks

Re: Select distinct on different columns

novinosrin — Tue, 19 Jun 2018 14:09:12 GMT

@MadhuKorni No, you obviously haven't tested my code. Your thought is handled by my having clause. Please test

Re: Select distinct on different columns

MadhuKorni — Tue, 19 Jun 2018 14:31:16 GMT

If the first three variables of a by group are same then your code works.

What If the value in var2 of the 2nd row is changed to 1. Try executing your code with the below dataset.

data have;
input name $ var1 var2 var3;
cards;
a 1 2 10
a 1 1 10
b 3 56 47
c 4 78 50
;

Re: Select distinct on different columns

novinosrin — Tue, 19 Jun 2018 14:59:46 GMT

@MadhuKorni Fair point, that leads me to now think statistically,

Please test with some intuitive combinations and let me know

data have;
input name $ var1 var2 var3;
cards;
a 1 2 10 
a 1 1 10 
b 3 56 47 
c 4 78 50
;

proc sql;
create table want(drop=std) as
select  distinct name, var1, var2, var3,std(var1,var2,var3) as std
from have
group by name
having std=min(std)  ;
quit;

OK, The above may still cause a problem for values like

a 1 2 1
a 2 1 1

This can be fixed by weigting on of the variables multiplied with a constant. For example,

proc sql;
create table want(drop=std) as
select  distinct name, var1, var2, var3,std(var1*2,var2,var3) as std
from have
group by name
having min(std)=std;
quit;