Hi, I need your help!
Suppose I have the following dataset,
data Network;
input Cusip Analyst;
cards;
1 1
1 2
1 3
2 8
2 9
2 10
2 11
2 12
3 45
3 46
;
run;
I want to create a new dataset that looks like below,
__________________________
Cusip Var1 Var2
1 1 2
1 2 3
1 3 1
2 8 9
2 9 10
2 10 11
2 11 12
2 12 8
2 8 11
2 8 10
2 9 12
2 9 11
2 10 12
3 45 46
________________________
In other words, I want to create nC2 for each Cusip.
I tried using Cartesian production (full merging), but cleaning the duplicates again throws me into the mire.
Many thanks, in advance!
Sincerely,
KS -,
If you only care about combinations, and don't care about permutations (ie. (var1,var2) can be either (8,9) or (9,8), then this can be straightforward in a single data step, using hash objects that allow duplicates for each key value:
data Network;
input Cusip Analyst;
cards;
1 1
1 2
1 3
2 8
2 9
2 10
2 11
2 12
3 45
3 46
run;
data want (drop=rc);
set network (rename=(analyst=var1)) network (obs=0 rename=(analyst=var2));
if _n_=1 then do;
declare hash h (dataset:'network (rename=(analyst=var2))',multidata:'y');
h.definekey('cusip');
h.definedata('var2');
h.definedone();
end;
do rc=h.find() by 0 until (h.find_next()^=0);
if var1<var2 then output;
end;
run;
Note that it doesn't matter what order your original data has.
Based on the solution given in next link:
https://communities.sas.com/t5/SAS-Programming/Permutations/m-p/325680
I suggest to solve your issue with next steps:
1) Transpose your data to have one observation per CUSIP with all ANALYST values. Of course, some of them may become missing value.
2) Adapt the proposed code in the link, by defining array of all values.
Make sure all values are at the beginning and missing values at the end of the array. Adapt the DO loop to end at the last non missing value.
Hi @KS99 ,
Just to clarify, you are looking for all 2 element combinations for each group? If yes, then the DFA(Dynamic Function Array) package may be help here. If you have any questions - ask.
All the best
Bart
/* data */
data Network;
input Cusip Analyst;
cards;
1 1
1 2
1 3
2 8
2 9
2 10
2 11
2 12
3 45
3 46
;
run;
/* use the DFA package */
filename packages "%sysfunc(pathname(work))"; /* setup directory for packages */
filename SPFinit url "https://raw.githubusercontent.com/yabwon/SAS_PACKAGES/main/SPF/SPFinit.sas";
%include SPFinit; /* enable the framework */
%installPackage(DFA) /* install the package */
%loadPackage(DFA) /* load the package content into the SAS session */
data want;
call SmpMtbArray("Allocate", 1, 1);
do until(last.Cusip);
set Network;
by Cusip;
i+1;
call SmpMtbArray("Input", i, Analyst);
end;
call SmpMtbArray("Dim", L, H);
put L= H=;
do i = L to H;
do j = i+1 to H;
call SmpMtbArray("Output", i, Var1);
call SmpMtbArray("Output", j, Var2);
output;
end;
end;
keep Cusip Var1 Var2;
call missing (i, j, L, H, Var1, Var2);
run;
here is the code to compare the result and the expected data
proc sort data = want;
by Cusip Var1 Var2;
run;
title "Want";
proc print data = want;
run;
/* your expected data to compare */
data have;
input Cusip x1 x2;
Var1 = x1 min x2;
Var2 = x1 max x2;
keep Cusip Var1 Var2;
cards;
1 1 2
1 2 3
1 3 1
2 8 9
2 9 10
2 10 11
2 11 12
2 12 8
2 8 11
2 8 10
2 9 12
2 9 11
2 10 12
3 45 46
;
run;
title "Have";
proc sort data = have;
by Cusip Var1 Var2;
run;
proc print data = have;
run;
title "Compare";
proc compare base = have compare = want;
run;
Hi @KS99,
Here's a similar array-based solution:
data want(drop=_: analyst);
array _a[100] _temporary_; /* choose dimension>=max(#analysts for a cusip) */
do _k=1 by 1 until(last.cusip);
set network;
by cusip;
_a[_k]=analyst;
end;
do _i=1 to _k-1;
do _j=_i+1 to _k;
var1=_a[_i];
var2=_a[_j];
output;
end;
end;
run;
Note that the above code consistently creates (var1, var2) pairs with var1<var2.
Counterexample to both your: "Note that the above code consistently creates (var1, var2) pairs with var1<var2."
and the "comparing" part of my code:
data Network;
input Cusip Analyst;
cards;
1 3
1 2
1 1
;
run;
🙂
But still both codes provides correct results.
All the best
Bart
Oh yes. I made some tacit assumptions regarding sort order, no duplicates, ...
I think both your code and mine assumes only no duplicates. Order of values is irrelevant since both 1,2,3 and 3,2,1 will generate the same number of pairs.
Bart
An alternative (but slower) solution using PROC SQL:
proc sql;
create table want as
select a.cusip, a.analyst as var1, b.analyst as var2
from network a, network b
where a.cusip=b.cusip & a.analyst<b.analyst
order by 1,2,3;
quit;
@yabwon wrote:
I think both your code and mine assumes only no duplicates. Order of values is irrelevant since both 1,2,3 and 3,2,1 will generate the same number of pairs.
@yabwon: Sure. The "assumptions" referred to my "var1<var2" statement.
Next code is tested. I need replace the ALLPERM() function into LEXPERM() function in order to omit missing values created by PROC TRANSPOSE.
data Network;
input Cusip Analyst;
cards;
1 1
1 2
1 3
2 8
2 9
2 10
2 11
2 12
3 45
3 46
;
run;
proc transpose data=Network out=data1;
by cusip;
var Analyst;
run;
data want;
set data1;
array colx {*} col:;
lasti = dim(colx);
do i=1 to dim(colx);
if missing(colx(i)) then do;
lasti = i-1; leave;
end;
end;
nfact = fact(lasti); put _N_= lasti= nfact=;
do i=1 to nfact;
call LEXPERM (i, of colx[*]);
output;
end;
drop lasti i nfact;
run;
If you only care about combinations, and don't care about permutations (ie. (var1,var2) can be either (8,9) or (9,8), then this can be straightforward in a single data step, using hash objects that allow duplicates for each key value:
data Network;
input Cusip Analyst;
cards;
1 1
1 2
1 3
2 8
2 9
2 10
2 11
2 12
3 45
3 46
run;
data want (drop=rc);
set network (rename=(analyst=var1)) network (obs=0 rename=(analyst=var2));
if _n_=1 then do;
declare hash h (dataset:'network (rename=(analyst=var2))',multidata:'y');
h.definekey('cusip');
h.definedata('var2');
h.definedone();
end;
do rc=h.find() by 0 until (h.find_next()^=0);
if var1<var2 then output;
end;
run;
Note that it doesn't matter what order your original data has.
Mark, great solution!
If I may, with small extension, it is also "doubled-data-proof"
data Network;
input Cusip Analyst;
cards;
1 3
1 3
1 2
1 1
1 3
run;
data _null_;
set network (rename=(analyst=var1)) network (obs=0 rename=(analyst=var2)) end=eof;
if _n_=1 then do;
declare hash h (dataset:'network (rename=(analyst=var2))',multidata:'y');
h.definekey('cusip');
h.definedata('var2');
h.definedone();
declare hash W(ordered:"A");
W.definekey('cusip', 'var1', 'var2');
W.definedone();
end;
do rc=h.find() by 0 until (h.find_next()^=0);
if var1<var2 then W.replace();
end;
if EOF then W.output(dataset:"want");
run;
All the best
Bart
Dear mkeintz,
Thank you for helping me a second time!
Your codes work perfectly.
The codes you wrote are very short and exquisite.
Do you mind if I ask you a further question?
Today I studied hash object for the first time, and looked at your codes and comments again.
But what I don't understand from your comments are:
How does h.find() function work exactly?
It seems to be used in a variety of merging, left full join, etc.
You can give me a quick answer, cos I am ashamed to take your precious time;)
Many thanks!
KS -
@KS99 wrote:
Dear mkeintz,
Thank you for helping me a second time!
Your codes work perfectly.
The codes you wrote are very short and exquisite.
Do you mind if I ask you a further question?
Today I studied hash object for the first time, and looked at your codes and comments again.
But what I don't understand from your comments are:
- The h.find() method returns a zero for success and a non-zero for failures. This will retrieve the first instance of each CUSIP.
- The "until (h.find.next)^=0" performs a h.find_next() method at the bottom of each loop iteration, and returns a zero for successful retrieved of the next item of the same cusip.
How does h.find() function work exactly?
It seems to be used in a variety of merging, left full join, etc.
You can give me a quick answer, cos I am ashamed to take your precious time;)
Many thanks!
KS -
I don't know much about hash functionns - I only know how to use hash objects in SAS. But, according to the wikipedia entry for Hash Function,
Hash functions and their associated hash tables are used in data storage and retrieval applications to access data in a small and nearly constant time per retrieval, and require an amount of storage space only fractionally greater than the total space required for the data or records themselves. Hashing is a computationally and storage space efficient form of data access which avoids the non-linear access time of ordered and unordered lists and structured trees, and the often exponential storage requirements of direct access of state spaces of large or variable-length keys.
Another thing that might be a bit surprising: the "by 0" in "do rc=h.find by 0 until (h.find_next()^=0". This is just shorthand for
rc=h.find();
do until (rc^=0);
if ... then output;
rc=h.find_next();
end;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.