BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
KS99
Obsidian | Level 7

Hi, I need your help! 

 

Suppose I have the following dataset, 

 

 

data Network;
input  Cusip  Analyst;
cards;
1  1
1  2  
1  3 
2  8
2  9
2  10
2  11
2  12
3  45
3  46
;
run; 

 

I want to create a new dataset that looks like below, 

__________________________

Cusip Var1 Var2 

1   1    2
1   2    3  
1   3    1
2   8    9
2   9  10
2 10  11
2 11  12 
2 12    8

2   8   11
2   8   10
2   9   12

2   9   11
2  10  12 
3  45  46
________________________ 

 

In other words, I want to create nC2 for each Cusip. 

 

I tried using Cartesian production (full merging), but cleaning the duplicates again throws me into the mire. 

 

Many thanks, in advance! 

 

Sincerely, 

KS -, 

 

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

If you only care about combinations, and don't care about permutations (ie.    (var1,var2) can be either (8,9) or (9,8), then this can be straightforward in a single data step, using hash objects that allow duplicates for each key value: 

data Network;
input  Cusip  Analyst;
cards;
1  1
1  2  
1  3 
2  8
2  9
2  10
2  11
2  12
3  45
3  46
run; 

data want (drop=rc);
  set network (rename=(analyst=var1)) network (obs=0 rename=(analyst=var2));

  if _n_=1 then do;
    declare hash h (dataset:'network (rename=(analyst=var2))',multidata:'y');
      h.definekey('cusip');
      h.definedata('var2');
      h.definedone();
  end;
  do rc=h.find() by 0 until (h.find_next()^=0);
    if var1<var2 then output;
  end;
run;
    

Note that it doesn't matter what order your original data has.

 

  1.   The SET statement has data set network specified a second time, but with OBS=0.  That's just to provide the rename of analyst to var2, thereby insuring that var2 has all the attributes of the original variable.
  2. The hash object gets declared once, during the first iteration of the data step.  Think of it as a lookup table, keyed on the CUSIP variable.  But since CUSIP:ANALYST is not 1 to 1, but 1 to many, you have to tell the sas that each CUSIP key can have multiple data items (multiple ANALYST values).
  3. The hash object in this case causes a complete pass through the NETWORK dataset to pre-populate the object.
  4. The definedata method tell sas what vars shold be retrieved (and stored in) the hash object.

  5. The h.find() method returns a zero for success and a non-zero for failures.  This will retrieve the first instance of each CUSIP.
  6. Inside the loop   "if var1<var2" prevent duplicates, regardless of var1,var2 permutation.
  7. The "until (h.find.next)^=0" performs a h.find_next() method at the bottom of each loop iteration, and returns a zero for successful retrieved of the next item of the same cusip.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

16 REPLIES 16
Shmuel
Garnet | Level 18

Based on the solution given in next link:

https://communities.sas.com/t5/SAS-Programming/Permutations/m-p/325680 

I suggest to solve your issue with next steps:

1) Transpose your data to have one observation per CUSIP with all ANALYST values. Of course, some of them may become missing value.

2) Adapt the proposed code in the link, by defining array of all values.

Make sure all values are at the beginning and missing values at the end of the array. Adapt the DO loop to end at the last non missing value.

 

 

yabwon
Onyx | Level 15

Hi @KS99 ,

 

Just to clarify, you are looking for all 2 element combinations for each group? If yes, then the DFA(Dynamic Function Array) package may be help here. If you have any questions - ask.

 

All the best

Bart

 

/* data */
data Network;
input  Cusip  Analyst;
cards;
1  1
1  2  
1  3 
2  8
2  9
2  10
2  11
2  12
3  45
3  46
;
run; 


/* use the DFA package */
filename packages "%sysfunc(pathname(work))"; /* setup directory for packages */
filename SPFinit url "https://raw.githubusercontent.com/yabwon/SAS_PACKAGES/main/SPF/SPFinit.sas";
%include SPFinit; /* enable the framework */
%installPackage(DFA) /* install the package */
%loadPackage(DFA)    /* load the package content into the SAS session */

data want;
  call SmpMtbArray("Allocate", 1, 1);

  do until(last.Cusip);
    set Network;
    by Cusip;  

    i+1;
    call SmpMtbArray("Input", i, Analyst);
  end;

  call SmpMtbArray("Dim", L, H);
  put L= H=;

  do i = L to H;
    do j = i+1 to H;
      call SmpMtbArray("Output", i, Var1);
      call SmpMtbArray("Output", j, Var2);
      output;
    end;
  end;

  keep Cusip Var1 Var2;
  call missing (i, j, L, H, Var1, Var2);
run;

here is the code to compare the result and the expected data

proc sort data = want;
  by Cusip Var1 Var2;
run;
title "Want";
proc print data = want;
run;


/* your expected data  to compare */
data have;
input Cusip x1 x2;
  Var1 = x1 min x2;
  Var2 = x1 max x2;
keep Cusip Var1 Var2;
cards;
1   1    2
1   2    3  
1   3    1
2   8    9
2   9  10
2 10  11
2 11  12 
2 12    8
2   8   11
2   8   10
2   9   12
2   9   11
2  10  12 
3  45  46
;
run;
title "Have";
proc sort data = have;
  by Cusip Var1 Var2;
run;
proc print data = have;
run;

title "Compare";
proc compare base = have compare = want;
run;

 

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



KS99
Obsidian | Level 7
Hi, yabwon!

Thank you very much, yabwon!
I will download the package and carefully study them.
I really appreciate your help and discussions!

Sincerely,
KS -,
FreelanceReinh
Jade | Level 19

Hi @KS99,

 

Here's a similar array-based solution:

data want(drop=_: analyst);
array _a[100] _temporary_; /* choose dimension>=max(#analysts for a cusip) */
do _k=1 by 1 until(last.cusip);
  set network;
  by cusip;
  _a[_k]=analyst;
end;
do _i=1 to _k-1;
  do _j=_i+1 to _k;
    var1=_a[_i];
    var2=_a[_j];
    output;
  end;
end;
run;

Note that the above code consistently creates (var1, var2) pairs with var1<var2.

yabwon
Onyx | Level 15

@FreelanceReinh 

 

Counterexample to both your: "Note that the above code consistently creates (var1, var2) pairs with var1<var2.

and the "comparing" part of my code:

 

data Network;
input  Cusip  Analyst;
cards;
1  3
1  2  
1  1 
;
run; 

🙂

 

But still both codes provides correct results.

 

All the best

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



FreelanceReinh
Jade | Level 19

Oh yes. I made some tacit assumptions regarding sort order, no duplicates, ...

yabwon
Onyx | Level 15

I think both your code and mine assumes only no duplicates. Order of values is irrelevant since both 1,2,3 and 3,2,1 will generate the same number of pairs.

 

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



KS99
Obsidian | Level 7
Thank you Reinhard!
I will apply them for my future analysis!

Sincerely,
KS -,
FreelanceReinh
Jade | Level 19

An alternative (but slower) solution using PROC SQL:

proc sql;
create table want as
select a.cusip, a.analyst as var1, b.analyst as var2
from network a, network b
where a.cusip=b.cusip & a.analyst<b.analyst
order by 1,2,3;
quit;

 

@yabwon wrote:

I think both your code and mine assumes only no duplicates. Order of values is irrelevant since both 1,2,3 and 3,2,1 will generate the same number of pairs.


@yabwon: Sure. The "assumptions" referred to my "var1<var2" statement.

Shmuel
Garnet | Level 18

Next code is tested. I need replace the ALLPERM() function into LEXPERM() function in order to omit missing values created by PROC TRANSPOSE.

data Network;
input  Cusip  Analyst;
cards;
1  1
1  2  
1  3 
2  8
2  9
2  10
2  11
2  12
3  45
3  46
;
run; 

proc transpose data=Network out=data1;
by cusip;
var Analyst;
run;

data want;
 set data1;
     array colx {*} col:;
     lasti = dim(colx);
     do i=1 to dim(colx);
        if missing(colx(i)) then do;
           lasti = i-1; leave;
        end;
     end;
     
     nfact = fact(lasti); put _N_= lasti= nfact=;
     do i=1 to nfact;
        call LEXPERM (i, of colx[*]);
        output;
     end;
     drop lasti i nfact;
run;
     
     
     
mkeintz
PROC Star

If you only care about combinations, and don't care about permutations (ie.    (var1,var2) can be either (8,9) or (9,8), then this can be straightforward in a single data step, using hash objects that allow duplicates for each key value: 

data Network;
input  Cusip  Analyst;
cards;
1  1
1  2  
1  3 
2  8
2  9
2  10
2  11
2  12
3  45
3  46
run; 

data want (drop=rc);
  set network (rename=(analyst=var1)) network (obs=0 rename=(analyst=var2));

  if _n_=1 then do;
    declare hash h (dataset:'network (rename=(analyst=var2))',multidata:'y');
      h.definekey('cusip');
      h.definedata('var2');
      h.definedone();
  end;
  do rc=h.find() by 0 until (h.find_next()^=0);
    if var1<var2 then output;
  end;
run;
    

Note that it doesn't matter what order your original data has.

 

  1.   The SET statement has data set network specified a second time, but with OBS=0.  That's just to provide the rename of analyst to var2, thereby insuring that var2 has all the attributes of the original variable.
  2. The hash object gets declared once, during the first iteration of the data step.  Think of it as a lookup table, keyed on the CUSIP variable.  But since CUSIP:ANALYST is not 1 to 1, but 1 to many, you have to tell the sas that each CUSIP key can have multiple data items (multiple ANALYST values).
  3. The hash object in this case causes a complete pass through the NETWORK dataset to pre-populate the object.
  4. The definedata method tell sas what vars shold be retrieved (and stored in) the hash object.

  5. The h.find() method returns a zero for success and a non-zero for failures.  This will retrieve the first instance of each CUSIP.
  6. Inside the loop   "if var1<var2" prevent duplicates, regardless of var1,var2 permutation.
  7. The "until (h.find.next)^=0" performs a h.find_next() method at the bottom of each loop iteration, and returns a zero for successful retrieved of the next item of the same cusip.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
yabwon
Onyx | Level 15

@mkeintz 

Mark, great solution!

If I may, with small extension, it is also "doubled-data-proof"

data Network;
input  Cusip  Analyst;
cards;
1  3
1  3
1  2
1  1
1  3
run; 

data _null_;
  set network (rename=(analyst=var1)) network (obs=0 rename=(analyst=var2)) end=eof;

  if _n_=1 then do;
    declare hash h (dataset:'network (rename=(analyst=var2))',multidata:'y');
      h.definekey('cusip');
      h.definedata('var2');
      h.definedone();

      declare hash W(ordered:"A");
      W.definekey('cusip', 'var1', 'var2');
      W.definedone();
  end;
  do rc=h.find() by 0 until (h.find_next()^=0);
    if var1<var2 then W.replace();
  end;

  if EOF then W.output(dataset:"want");
run;

 

All the best

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



KS99
Obsidian | Level 7

Dear mkeintz, 

 

Thank you for helping me a second time! 

Your codes work perfectly. 

 

The codes you wrote are very short and exquisite. 

Do you mind if I ask you a further question? 

Today I studied hash object for the first time, and looked at your codes and comments again. 

But what I don't understand from your comments are: 

  1. The h.find() method returns a zero for success and a non-zero for failures.  This will retrieve the first instance of each CUSIP.
  2. The "until (h.find.next)^=0" performs a h.find_next() method at the bottom of each loop iteration, and returns a zero for successful retrieved of the next item of the same cusip. 

How does h.find() function work exactly? 

It seems to be used in a variety of merging, left full join, etc. 

You can give me a quick answer, cos I am ashamed to take your precious time;) 

 

Many thanks! 

KS - 

 

mkeintz
PROC Star

@KS99 wrote:

Dear mkeintz, 

 

Thank you for helping me a second time! 

Your codes work perfectly. 

 

The codes you wrote are very short and exquisite. 

Do you mind if I ask you a further question? 

Today I studied hash object for the first time, and looked at your codes and comments again. 

But what I don't understand from your comments are: 

  1. The h.find() method returns a zero for success and a non-zero for failures.  This will retrieve the first instance of each CUSIP.
  2. The "until (h.find.next)^=0" performs a h.find_next() method at the bottom of each loop iteration, and returns a zero for successful retrieved of the next item of the same cusip. 

How does h.find() function work exactly? 

It seems to be used in a variety of merging, left full join, etc. 

You can give me a quick answer, cos I am ashamed to take your precious time;) 

 

Many thanks! 

KS - 

 


I don't know much about hash functionns - I only know how to use hash objects in SAS.   But, according to the wikipedia entry for Hash Function, 

Hash functions and their associated hash tables are used in data storage and retrieval applications to access data in a small and nearly constant time per retrieval, and require an amount of storage space only fractionally greater than the total space required for the data or records themselves. Hashing is a computationally and storage space efficient form of data access which avoids the non-linear access time of ordered and unordered lists and structured trees, and the often exponential storage requirements of direct access of state spaces of large or variable-length keys.

  1. This is my crude understanding:  A hash transformation is applied to the key variable(s)  - CUSIP in this case - to assign it a place in the hash table.  This is used when adding data to the hash object, and when retrieving, or even just checking whether data for the specified CUSIP is in the object  The advantage is that as your data collection grows, most other retrieval methods (like binary search) will take increasing amounts of time, whereas hash-based methods take almost no more time - even when the data is larger by an order of magnitude.  The FIND method takes the CUSIP value, applies a hash function to it, and uses the result to effectively directly address the corresponding data item in the hash - i.e. it does not "search" for the item.

    The find method is like a function in that it generates a return code, zero for success, non-zeroes for various types of failure.  As far as I can tell "failure" means only one thing - the particular key value (the hash-transformed value of your CUSIP) is not (yet) in the hash object.  The other thing that a successful find does is to retrieve the all the variables listed in the ".definedata" method from the hash object into those variables in your program.

  2. The default condition for SAS hash objects is one dataitem per key value.  That is, it would hold only one analyst for each cusip.  So the "multidata:'y'" parameter tells SAS to make the hash object provide for multiple analysts per cusip.  But then the problem is that FIND will only ever retrieve whichever analyst is the first in that series of analysts - all the others are unavailable via FIND.  So another method is needed to traverse all the dataitems for a given cusip value - that's find_next.  Like all other hash methods it returns a zero when successful, non-zero otherwise.   That means, when it's not a zero, there are no further analysts for that given cusip in the hash object.  BTW, you can go forward via find_next,  and backwards via find_prev.

    The "do until" is just a way to tell sas to evaluate the "until" expression at the bottom of each loop iteration.  But the nice thing about the find_next method is that not only does it return a zero or non-zero for evaluation by until, it also transcribes data from the object to the corresponding sas variables (the "program data vector").

 

Another thing that might be a bit surprising: the "by 0" in "do rc=h.find by 0 until (h.find_next()^=0".  This is just shorthand for

rc=h.find();
do until (rc^=0);
   if ... then output;
   rc=h.find_next();
end;

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 16 replies
  • 1898 views
  • 8 likes
  • 6 in conversation