Solved: Re: PROC SQL or SAS Distinct Clause

MikeCarter · Posted 06-29-2015 03:16 PM

Hello,

Is there a way in SAS or Proc SQL, where we can not only select Distinct Rows, but also, the individual column values are not equal to each other? Below, an example-

City 1 City 2

A1 B1

A1 B2

A2 B1

A2 B2

Select "Distinct City 1, City 2"- would give me all the 4 rows. But not only the rows to be distinct, but also the values individually should be different from their column values. There may be multiple such combinations. Here, acceptable combinations would be rows 1&3 OR rows 2&4. So two such sets possible here. Is it possible to extract out such sets? Where, not only entire rows are distinct, but individual values are also distinct? SAS or SQL, whichever way.

Ksharp · Posted 06-30-2015 10:07 AM

Interesting Question.

Code: Program

data pairs;
input City1 $ City2 $;
datalines;
A1 B1  
A1 B2  
A2 B1
A2 B2
;
run;
data want;
 if _n_ eq 1 then do;
  declare hash h1();
  h1.definekey('city1');
  h1.definedone();
  declare hash h2();
  h2.definekey('city2');
  h2.definedone();
end;
set pairs;
if h1.check() ne 0 and h2.check() ne 0 then do;
h1.replace();h2.replace();output;
end;
run;

Xia Keshan

View solution in original post

Haikuo · Posted 06-29-2015 03:21 PM

What will be your wanted outcome and by what rules?

MikeCarter · Posted 06-29-2015 03:35 PM

Sure, as I said- wanted outcome would be subset of rows, where individual column values are not equal to each other.

City 1 City 2

A1 B1

A2 B2

Or

City 1 City 2

A1 B2

A2 B1

Both these outputs are acceptable, because values in individual columns (City 1, City2) are not repeated. To extract both of them would be great.

Proc SQL "Select Distinct City1, City2"- doesn't necessarily guarantee this, because although it may get us to distinct rows, the individual values will be different. But that's the subset I need.

ballardw · Posted 06-29-2015 06:38 PM

Are any of the values for City1 also in City2? How to handle if so.

Tom · Posted 06-29-2015 07:21 PM

Not sure exactly what you want. Sounds like you want to find all possible values of CITY1 and all possible values of CITY2. But I don't think that in general it would be possible to generate such a set only by selecting rows from the original file. Why not just generate two separate tables?

select distinct city1 from have;

select distinct city2 from have;

Perhaps you could generate an iterative process to delete the rows where both CITY1 and CITY2 are already included in the previous rows selected. That would probably work best by using hash tables in a data step. Not sure how you could find the minimum such set of rows.

PGStats · Posted 06-29-2015 11:04 PM

Looks like an assignment (matching) problem to me. Not trivial. If you have SAS/OR licenced, you could use something like :

data pairs;

input City1 $ City2 $;

datalines;

A1 B1

A1 B2

A2 B1

A2 B2

;

proc optnet

graph_direction = directed

data_links = pairs;

data_links_var

from = City1

to = City2;

linear_assignment

out = uniquePairs(drop=cost);

run;

proc print data=uniquePairs; run;

PG

MikeCarter · Posted 06-30-2015 09:05 PM

Hello PGStats- this is great. Yes, I think it worked. I have never used OPTNET, great application. I think, I'll do a bit more research on this procedure to use it more effectively.

A quick question- is there a way to apply Linear Assignment, if there were 3 or 4 columns? It's like solving, min wgt matching problem on tripartite or qudra-partite graphs? Not sure, if this is even practical. But evetually, I need to extend this technique to 4 columns-- where individual values in those 4 columns should be different. Does that make sense? Basically- the same output needed, with same condition, when applied over 3 or 4 columns.

-Mike

PGStats · Posted 06-30-2015 10:20 PM

Linear assignment is only defined between two sets. The clever hash method proposed by can be extended to more sets, but even for two sets it will not always give optimal results. Try for example :

data pairs;

input City1 $ City2 $;

datalines;

A1 B1

A1 B2

A1 B3

A2 B1

;

proc optnet

graph_direction = directed

data_links = pairs;

data_links_var

from = City1

to = City2;

linear_assignment

out = uniquePairs(drop=cost);

run;

title "Linear assignment pairs";

proc print data=uniquePairs; run;

data HashPairs;

if _n_ eq 1 then do;

declare hash h1();

h1.definekey('city1');

h1.definedone();

declare hash h2();

h2.definekey('city2');

h2.definedone();

end;

set pairs;

if h1.check() ne 0 and h2.check() ne 0 then do;

h1.replace();h2.replace();output;

end;

run;

title "Hash pairs";

proc print data=HashPairs; run;

PG

MikeCarter · Posted 07-01-2015 09:34 AM

@ PGStats, Okay, I see what you mean now. Got it. This makes sense. Important to note, Hash method doesn't always guarantee optimal (max) number of rows, although it can be feasible.

Thanks to you & xia keshan again for your both your ideas, will think on these directions further.

-Mike

Ksharp · Posted 07-02-2015 08:00 AM

PG,

Hash Table also can get such max(optimal) sub-set .

Code: Program

data have;
input a $ b $;
cards;
A1 B1
A1 B2
A1 B3
A2 B1
;
run;
data _null_;
 if 0 then set have;
 declare hash h1();
 h1.definekey('a');
 h1.definedone();
 
 declare hash h2();
 h2.definekey('b');
 h2.definedone();
max=0;
do i=1 to nobs;
 set have nobs=nobs point=i;
 n=0;
 if h1.check() ne 0 and h2.check() ne 0 then do;
  h1.add();h2.add();n+1;
 end;
 do j=1 to nobs;
   set have point=j;
   if h1.check() ne 0 and h2.check() ne 0 then do;
   h1.add();h2.add();n+1;
   end;
 end;
 if n gt max then do;max=n;call symputx('point',i);end;
 h1.clear();h2.clear();
end;
stop;
run;



data want;
 if 0 then set have;
 declare hash h1();
 h1.definekey('a');
 h1.definedone();
 
 declare hash h2();
 h2.definekey('b');
 h2.definedone();

point=&point;
set have point=point;
 if h1.check() ne 0 and h2.check() ne 0 then do;
  h1.add();h2.add();output;
 end;
 do j=1 to nobs;
   set have nobs=nobs point=j;
   if h1.check() ne 0 and h2.check() ne 0 then do;
   h1.add();h2.add();output;
   end;
 end;
stop;
run;

Xia Keshan

MikeCarter · Posted 07-02-2015 10:46 AM

Thanks Xia. A little complicated to understand in first-go, but will study this further. Ability to extract Max subset, for upto 3 or 4 columns, is what's needed. Thanks again for your time,

-Mike

Ksharp · Posted 07-02-2015 10:58 AM

Yeah. That is what you want - optimal(max) number of rows . Code could be shorter.

Code: Program

data have;
input a $ b $;
cards;
A1 B1
A1 B2
A1 B3
A2 B1
;
run;
data _null_;
 if 0 then set have;
 declare hash h1();
 h1.definekey('a');
 h1.definedone();
 
 declare hash h2();
 h2.definekey('b');
 h2.definedone();
max=0;
do i=1 to nobs;
 set have nobs=nobs point=i;
 n=0;
 
  h1.add();h2.add();n+1;
 
 do j=1 to nobs;
   set have point=j;
   if h1.check() ne 0 and h2.check() ne 0 then do;
   h1.add();h2.add();n+1;
   end;
 end;
 if n gt max then do;max=n;call symputx('point',i);end;
 h1.clear();h2.clear();
end;
stop;
run;



data want;
 if 0 then set have;
 declare hash h1();
 h1.definekey('a');
 h1.definedone();
 
 declare hash h2();
 h2.definekey('b');
 h2.definedone();

point=&point;
set have point=point;
 
  h1.add();h2.add();output;
 
 do j=1 to nobs;
   set have nobs=nobs point=j;
   if h1.check() ne 0 and h2.check() ne 0 then do;
   h1.add();h2.add();output;
   end;
 end;
stop;
run;

MikeCarter · Posted 07-02-2015 11:28 AM

Sure, thanks again. I shall extend this to 3 or 4 columns further. This is really helpful.

-Mike

Ksharp · Posted 06-30-2015 10:07 AM

Interesting Question.

Code: Program

data pairs;
input City1 $ City2 $;
datalines;
A1 B1  
A1 B2  
A2 B1
A2 B2
;
run;
data want;
 if _n_ eq 1 then do;
  declare hash h1();
  h1.definekey('city1');
  h1.definedone();
  declare hash h2();
  h2.definekey('city2');
  h2.definedone();
end;
set pairs;
if h1.check() ne 0 and h2.check() ne 0 then do;
h1.replace();h2.replace();output;
end;
run;

Xia Keshan

MikeCarter · Posted 06-30-2015 09:07 PM

Hi Xia- this worked. Thank you for neat code, never used hash in SAS, will do a bit more reading on this technique. But thank you so much for your help. This worked for me.

Apprecite it,

-Mike

Code: Program

Code: Program

Code: Program

Code: Program

Registration is open

SAS Training: Just a Click Away