Hello,
Is there a way in SAS or Proc SQL, where we can not only select Distinct Rows, but also, the individual column values are not equal to each other? Below, an example-
City 1 City 2
A1 B1
A1 B2
A2 B1
A2 B2
Select "Distinct City 1, City 2"- would give me all the 4 rows. But not only the rows to be distinct, but also the values individually should be different from their column values. There may be multiple such combinations. Here, acceptable combinations would be rows 1&3 OR rows 2&4. So two such sets possible here. Is it possible to extract out such sets? Where, not only entire rows are distinct, but individual values are also distinct? SAS or SQL, whichever way.
Interesting Question.
data pairs;
input City1 $ City2 $;
datalines;
A1 B1
A1 B2
A2 B1
A2 B2
;
run;
data want;
if _n_ eq 1 then do;
declare hash h1();
h1.definekey('city1');
h1.definedone();
declare hash h2();
h2.definekey('city2');
h2.definedone();
end;
set pairs;
if h1.check() ne 0 and h2.check() ne 0 then do;
h1.replace();h2.replace();output;
end;
run;
Xia Keshan
What will be your wanted outcome and by what rules?
Sure, as I said- wanted outcome would be subset of rows, where individual column values are not equal to each other.
City 1 City 2
A1 B1
A2 B2
Or
City 1 City 2
A1 B2
A2 B1
Both these outputs are acceptable, because values in individual columns (City 1, City2) are not repeated. To extract both of them would be great.
Proc SQL "Select Distinct City1, City2"- doesn't necessarily guarantee this, because although it may get us to distinct rows, the individual values will be different. But that's the subset I need.
Are any of the values for City1 also in City2? How to handle if so.
Not sure exactly what you want. Sounds like you want to find all possible values of CITY1 and all possible values of CITY2. But I don't think that in general it would be possible to generate such a set only by selecting rows from the original file. Why not just generate two separate tables?
select distinct city1 from have;
select distinct city2 from have;
Perhaps you could generate an iterative process to delete the rows where both CITY1 and CITY2 are already included in the previous rows selected. That would probably work best by using hash tables in a data step. Not sure how you could find the minimum such set of rows.
Looks like an assignment (matching) problem to me. Not trivial. If you have SAS/OR licenced, you could use something like :
data pairs;
input City1 $ City2 $;
datalines;
A1 B1
A1 B2
A2 B1
A2 B2
;
proc optnet
graph_direction = directed
data_links = pairs;
data_links_var
from = City1
to = City2;
linear_assignment
out = uniquePairs(drop=cost);
run;
proc print data=uniquePairs; run;
PG
Hello PGStats- this is great. Yes, I think it worked. I have never used OPTNET, great application. I think, I'll do a bit more research on this procedure to use it more effectively.
A quick question- is there a way to apply Linear Assignment, if there were 3 or 4 columns? It's like solving, min wgt matching problem on tripartite or qudra-partite graphs? Not sure, if this is even practical. But evetually, I need to extend this technique to 4 columns-- where individual values in those 4 columns should be different. Does that make sense? Basically- the same output needed, with same condition, when applied over 3 or 4 columns.
-Mike
Linear assignment is only defined between two sets. The clever hash method proposed by can be extended to more sets, but even for two sets it will not always give optimal results. Try for example :
data pairs;
input City1 $ City2 $;
datalines;
A1 B1
A1 B2
A1 B3
A2 B1
;
proc optnet
graph_direction = directed
data_links = pairs;
data_links_var
from = City1
to = City2;
linear_assignment
out = uniquePairs(drop=cost);
run;
title "Linear assignment pairs";
proc print data=uniquePairs; run;
data HashPairs;
if _n_ eq 1 then do;
declare hash h1();
h1.definekey('city1');
h1.definedone();
declare hash h2();
h2.definekey('city2');
h2.definedone();
end;
set pairs;
if h1.check() ne 0 and h2.check() ne 0 then do;
h1.replace();h2.replace();output;
end;
run;
title "Hash pairs";
proc print data=HashPairs; run;
PG
@ PGStats, Okay, I see what you mean now. Got it. This makes sense. Important to note, Hash method doesn't always guarantee optimal (max) number of rows, although it can be feasible.
Thanks to you & xia keshan again for your both your ideas, will think on these directions further.
-Mike
PG,
Hash Table also can get such max(optimal) sub-set .
data have;
input a $ b $;
cards;
A1 B1
A1 B2
A1 B3
A2 B1
;
run;
data _null_;
if 0 then set have;
declare hash h1();
h1.definekey('a');
h1.definedone();
declare hash h2();
h2.definekey('b');
h2.definedone();
max=0;
do i=1 to nobs;
set have nobs=nobs point=i;
n=0;
if h1.check() ne 0 and h2.check() ne 0 then do;
h1.add();h2.add();n+1;
end;
do j=1 to nobs;
set have point=j;
if h1.check() ne 0 and h2.check() ne 0 then do;
h1.add();h2.add();n+1;
end;
end;
if n gt max then do;max=n;call symputx('point',i);end;
h1.clear();h2.clear();
end;
stop;
run;
data want;
if 0 then set have;
declare hash h1();
h1.definekey('a');
h1.definedone();
declare hash h2();
h2.definekey('b');
h2.definedone();
point=&point;
set have point=point;
if h1.check() ne 0 and h2.check() ne 0 then do;
h1.add();h2.add();output;
end;
do j=1 to nobs;
set have nobs=nobs point=j;
if h1.check() ne 0 and h2.check() ne 0 then do;
h1.add();h2.add();output;
end;
end;
stop;
run;
Xia Keshan
Thanks Xia. A little complicated to understand in first-go, but will study this further. Ability to extract Max subset, for upto 3 or 4 columns, is what's needed. Thanks again for your time,
-Mike
Yeah. That is what you want - optimal(max) number of rows . Code could be shorter.
data have;
input a $ b $;
cards;
A1 B1
A1 B2
A1 B3
A2 B1
;
run;
data _null_;
if 0 then set have;
declare hash h1();
h1.definekey('a');
h1.definedone();
declare hash h2();
h2.definekey('b');
h2.definedone();
max=0;
do i=1 to nobs;
set have nobs=nobs point=i;
n=0;
h1.add();h2.add();n+1;
do j=1 to nobs;
set have point=j;
if h1.check() ne 0 and h2.check() ne 0 then do;
h1.add();h2.add();n+1;
end;
end;
if n gt max then do;max=n;call symputx('point',i);end;
h1.clear();h2.clear();
end;
stop;
run;
data want;
if 0 then set have;
declare hash h1();
h1.definekey('a');
h1.definedone();
declare hash h2();
h2.definekey('b');
h2.definedone();
point=&point;
set have point=point;
h1.add();h2.add();output;
do j=1 to nobs;
set have nobs=nobs point=j;
if h1.check() ne 0 and h2.check() ne 0 then do;
h1.add();h2.add();output;
end;
end;
stop;
run;
Sure, thanks again. I shall extend this to 3 or 4 columns further. This is really helpful.
-Mike
Interesting Question.
data pairs;
input City1 $ City2 $;
datalines;
A1 B1
A1 B2
A2 B1
A2 B2
;
run;
data want;
if _n_ eq 1 then do;
declare hash h1();
h1.definekey('city1');
h1.definedone();
declare hash h2();
h2.definekey('city2');
h2.definedone();
end;
set pairs;
if h1.check() ne 0 and h2.check() ne 0 then do;
h1.replace();h2.replace();output;
end;
run;
Xia Keshan
Hi Xia- this worked. Thank you for neat code, never used hash in SAS, will do a bit more reading on this technique. But thank you so much for your help. This worked for me.
Apprecite it,
-Mike
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.