BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
MikeCarter
Calcite | Level 5

Hello,

Is there a way in SAS or Proc SQL, where we can not only select Distinct Rows, but also, the individual column values are not equal to each other? Below, an example-


City 1           City 2

A1               B1            

A1               B2           

A2               B1

A2               B2

Select "Distinct City 1, City 2"- would give me all the 4 rows. But not only the rows to be distinct, but also the values individually should be different from their column values. There may be multiple such combinations. Here, acceptable combinations would be rows 1&3 OR rows 2&4. So two such sets possible here. Is it possible to extract out such sets? Where, not only entire rows are distinct, but individual values are also distinct? SAS or SQL, whichever way.

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

Interesting Question.

Code: Program

data pairs;
input City1 $ City2 $;
datalines;
A1 B1 
A1 B2 
A2 B1
A2 B2
;
run;
data want;
if _n_ eq 1 then do;
  declare hash h1();
  h1.definekey('city1');
  h1.definedone();
  declare hash h2();
  h2.definekey('city2');
  h2.definedone();
end;
set pairs;
if h1.check() ne 0 and h2.check() ne 0 then do;
h1.replace();h2.replace();output;
end;
run;

Xia Keshan

View solution in original post

15 REPLIES 15
Haikuo
Onyx | Level 15

What will be your wanted outcome and by what rules?

MikeCarter
Calcite | Level 5

Sure, as I said- wanted outcome would be subset of rows, where individual column values are not equal to each other.

City 1         City 2

A1               B1

A2               B2

Or

City 1          City 2

A1               B2

A2               B1

Both these outputs are acceptable, because values in individual columns (City 1, City2) are not repeated. To extract both of them would be great.

Proc SQL "Select Distinct City1, City2"- doesn't necessarily guarantee this, because although it may get us to distinct rows, the individual values will be different. But that's the subset I need.

ballardw
Super User

Are any of the values for City1 also in City2? How to handle if so.

Tom
Super User Tom
Super User

Not sure exactly what you want.  Sounds like you want to find all possible values of CITY1 and all possible values of CITY2.  But I don't think that in general it would be possible to generate such a set only by selecting rows from the original file.  Why not just generate two separate tables?

select distinct city1 from have;

select distinct city2 from have;

Perhaps you could generate an iterative process to delete the rows where both CITY1 and CITY2 are already included in the previous rows selected.  That would probably work best by using hash tables in a data step.  Not sure how you could find the minimum such set of rows.

PGStats
Opal | Level 21

Looks like an assignment (matching) problem to me. Not trivial. If you have SAS/OR licenced, you could use something like :

data pairs;

input City1 $ City2 $;

datalines;

A1               B1          

A1               B2         

A2               B1

A2               B2

;

proc optnet

   graph_direction = directed

   data_links      = pairs;

   data_links_var

      from      = City1

      to           = City2;

   linear_assignment

      out         = uniquePairs(drop=cost);

run;

proc print data=uniquePairs; run;

PG

PG
MikeCarter
Calcite | Level 5

Hello PGStats- this is great. Yes, I think it worked. I have never used OPTNET, great application. I think, I'll do a bit more research on this procedure to use it more effectively.

A quick question- is there a way to apply Linear Assignment, if there were 3 or 4 columns? It's like solving, min wgt matching problem on tripartite or qudra-partite graphs? Not sure, if this is even practical. But evetually, I need to extend this technique to 4 columns-- where individual values in those 4 columns should be different. Does that make sense? Basically- the same output needed, with same condition, when applied over 3 or 4 columns.

-Mike

PGStats
Opal | Level 21

Linear assignment is only defined between two sets. The clever hash method proposed by can be extended to more sets, but even for two sets it will not always give optimal results. Try for example :

data pairs;

input City1 $ City2 $;

datalines;

A1 B1

A1 B2

A1 B3

A2 B1

;

proc optnet

   graph_direction = directed

   data_links      = pairs;

   data_links_var

      from         = City1

      to           = City2;

   linear_assignment

      out          = uniquePairs(drop=cost);

run;

title "Linear assignment pairs";

proc print data=uniquePairs; run;

data HashPairs;

if _n_ eq 1 then do;

  declare hash h1();

  h1.definekey('city1');

  h1.definedone();

  declare hash h2();

  h2.definekey('city2');

  h2.definedone();

end;

set pairs;

if h1.check() ne 0 and h2.check() ne 0 then do;

h1.replace();h2.replace();output;

end;

run;

title "Hash pairs";

proc print data=HashPairs; run;

PG

PG
MikeCarter
Calcite | Level 5

@ PGStats, Okay, I see what you mean now. Got it. This makes sense. Important to note, Hash method doesn't always guarantee optimal (max) number of rows, although it can be feasible.

Thanks to you & xia keshan again for your both your ideas, will think on these directions further.

-Mike

Ksharp
Super User

PG,

Hash Table also can get such max(optimal) sub-set .

Code: Program

data have;
input a $ b $;
cards;
A1 B1
A1 B2
A1 B3
A2 B1
;
run;
data _null_;
if 0 then set have;
declare hash h1();
h1.definekey('a');
h1.definedone();

declare hash h2();
h2.definekey('b');
h2.definedone();
max=0;
do i=1 to nobs;
set have nobs=nobs point=i;
n=0;
if h1.check() ne 0 and h2.check() ne 0 then do;
  h1.add();h2.add();n+1;
end;
do j=1 to nobs;
   set have point=j;
   if h1.check() ne 0 and h2.check() ne 0 then do;
   h1.add();h2.add();n+1;
   end;
end;
if n gt max then do;max=n;call symputx('point',i);end;
h1.clear();h2.clear();
end;
stop;
run;



data want;
if 0 then set have;
declare hash h1();
h1.definekey('a');
h1.definedone();

declare hash h2();
h2.definekey('b');
h2.definedone();

point=&point;
set have point=point;
if h1.check() ne 0 and h2.check() ne 0 then do;
  h1.add();h2.add();output;
end;
do j=1 to nobs;
   set have nobs=nobs point=j;
   if h1.check() ne 0 and h2.check() ne 0 then do;
   h1.add();h2.add();output;
   end;
end;
stop;
run;

Xia Keshan

MikeCarter
Calcite | Level 5

Thanks Xia. A little complicated to understand in first-go, but will study this further. Ability to extract Max subset, for upto 3 or 4 columns, is what's needed. Thanks again for your time,

-Mike

Ksharp
Super User

Yeah. That is what you want - optimal(max) number of rows . Code could be shorter.

Code: Program

data have;
input a $ b $;
cards;
A1 B1
A1 B2
A1 B3
A2 B1
;
run;
data _null_;
if 0 then set have;
declare hash h1();
h1.definekey('a');
h1.definedone();

declare hash h2();
h2.definekey('b');
h2.definedone();
max=0;
do i=1 to nobs;
set have nobs=nobs point=i;
n=0;

  h1.add();h2.add();n+1;

do j=1 to nobs;
   set have point=j;
   if h1.check() ne 0 and h2.check() ne 0 then do;
   h1.add();h2.add();n+1;
   end;
end;
if n gt max then do;max=n;call symputx('point',i);end;
h1.clear();h2.clear();
end;
stop;
run;



data want;
if 0 then set have;
declare hash h1();
h1.definekey('a');
h1.definedone();

declare hash h2();
h2.definekey('b');
h2.definedone();

point=&point;
set have point=point;

  h1.add();h2.add();output;

do j=1 to nobs;
   set have nobs=nobs point=j;
   if h1.check() ne 0 and h2.check() ne 0 then do;
   h1.add();h2.add();output;
   end;
end;
stop;
run;
MikeCarter
Calcite | Level 5

Sure, thanks again. I shall extend this to 3 or 4 columns further. This is really helpful.

-Mike

Ksharp
Super User

Interesting Question.

Code: Program

data pairs;
input City1 $ City2 $;
datalines;
A1 B1 
A1 B2 
A2 B1
A2 B2
;
run;
data want;
if _n_ eq 1 then do;
  declare hash h1();
  h1.definekey('city1');
  h1.definedone();
  declare hash h2();
  h2.definekey('city2');
  h2.definedone();
end;
set pairs;
if h1.check() ne 0 and h2.check() ne 0 then do;
h1.replace();h2.replace();output;
end;
run;

Xia Keshan

MikeCarter
Calcite | Level 5

Hi Xia- this worked. Thank you for neat code, never used hash in SAS, will do a bit more reading on this technique. But thank you so much for your help. This worked for me.

Apprecite it,

-Mike

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 15 replies
  • 2030 views
  • 6 likes
  • 6 in conversation