BookmarkSubscribeRSS Feed
Obsidian | Level 7

Hello - This is a continuation from an earlier post I had.  Again, thank you in advance for any help you can provide me.  I am trying to group smaller subsets of a larger population that net (or sum if you will) down to $0.00.   My data has limited available fields, so I can only use the numeric field to complete this.   In my earlier post, it looked like a "brute force" option was the best choice for me.  I am hoping the community can help me improve upon the logic that was provided to me.  Big thanks to all the earlier contributors, my knowledge of arrays, while very limited, has grown over the past few days.  A few notes...


1) Yes, the overall population does net to $0.00, but I need to try to group it into smaller subsets for the purpose of my analysis (new data attached)

2.) I DO NOT have SAS/OR, so I cannot use the OPTMODEL procedure.  I am running on SAS EG.  I can provide a list of the licenses I do have if that helps.  

3.) One thing I noticed when testing the script below on a new data set is that if there's a repeating value sequentially, the code may add the original value, plus itself when looping.  If possible, I'd like to avoid scenarios like this.   For example, with {-3,-3,6} this code is adding the first element twice and then summing to zero with the 6, instead of taking elements 1 and 2.  

4.) For some additional background, the populations I'm dealing with can go up to 6000 lines.  I tested one of these larger populations for sets of 3.  It worked but took 5 hours to process.  I'm ok with the processing time, but am curious if there efficiencies which can applied, especially since I could have multiple larger datasets.  

5.) Example code I've been playing with is below.  Special thanks to @ballardw for writing this originally.  


My questions to the community...


a.) can the logic below be updated so that the array containing the values "shrinks" after subsets have been made?  I think this may reduce the number of iterations.  

b.) along the same idea as question a., is it possible for the array to not sum the same element against itself?  this may reduce the number of iterations as well (though not by much).  simply put, each element can only be used once when summing

c.) does this code need to be sequentially run the way it is currently set up?  say I want to search for subsets of 50 lines, do I need 50 steps going sequentially from 2-50, or can it be set up to run just the 50 lines matches first, and within the results are the subsets of 2, 3,4...n.  


Thank you all again for you help.  This has been very insightful so far and I've been learning a lot.  If there are any other questions I can answer, or background I can provide to add additional clarity, please let me know.  



proc transpose data=have out=work.trans
var base_amt;

data work.pairs ( keep=iout jout ival jval)
     work.leftover (keep=ba:)
   set work.trans end=last;
   array b ba:;
   do i= 1 to (dim(b)-1);
      do j= 2 to dim(b);
         if sum(b[i],b[j])= 0 then do;
            if not missing(b[i]) then iout=i;
            if not missing(b[j]) then jout=j;
            ival = b[i];
            jval = b[j];
            output work.pairs;
            call missing(b[i],b[j],iout,jout);
   if last then output work.leftover;

data work.trios ( keep=iout jout kout ival jval kval)
     work.leftover2 (keep=ba:)
   set work.leftover end=last;
   array b ba:;
   do i= 1 to (dim(b)-2);
      do j= 2 to (dim(b)-1);
         do k= 3 to dim(b);
            if sum(b[i],b[j],b[k])= 0 then do;
               if not missing(b[i]) then iout=i;
               if not missing(b[j]) then jout=j;
               if not missing(b[k]) then kout=k;
               ival = b[i];
               jval = b[j];
               kval = b[k];
               output work.trios;
               call missing(b[i],b[j],b[k],iout,jout,kout);
   if last then output work.leftover2;



For #3, I think the j=2 should instead be j=i+1 in both places.  Similarly, k=3 should be k=j+1.


Can you please post the 6000-observation data set you mentioned in #4?

Obsidian | Level 7

Rob - thanks again for looking at this.  Attached is another example with just under 6000 lines.  I'm testing your notes for #3 right now.  

Jade | Level 19

Hi @adornodj,


Thanks for providing the "full-size" dataset.


I think the increase from about 300 to almost 6000 observations is bad news with regard to processing time, especially for any kind of brute-force approach. Please note that, for example, the number of triples from a set of 6000 elements (approx. 3.6E10) is more than 8000 times the number of triples from a set of 300 elements (approx. 4.5E6). For quadruples the factor is already >160000, for quintuples 3.3 million, ...


So, without sophisticated algorithms like those implemented in SAS/OR, it seems challenging to me to get much further than triples.


As to your question c: I think it's wise to stick to the sequential processing (singletons, pairs, triples, ...) and to remove any zero-sum subsets found. If you were to start with 50-element subsets, just consider comb(300, 50)=3.1E57.


SAS hash objects combined with some basic mathematical considerations might be useful to reduce processing time. Unlike arrays hash objects can grow and shrink as needed (cf. your question a.).


After removing singletons, i.e. zeros, and pairs {x, -x} (which is easy), triples could perhaps more efficiently be searched as follows (rough, untested ideas, I will try to refine and implement them over the weekend):

  1. Split positive amounts and the absolute values of negative amounts into two hash objects POS and NEG (both with options multidata: 'y' [because of duplicates], ordered: 'y'). Note that any triple summing to zero must consist of either two positive elements and one negative element or vice versa.
  2. Traverse POS (using a hash iterator) and for each element a>0 traverse NEG as well, but only as long as the values b (>0!) there satisfy b<=a/2 (DO-WHILE loop).
  3. In each step search NEG for a-b using the CHECK and HAS_NEXT methods (special case: a-b=b).
  4. If found, {a, -b, -(a-b)} form a zero-sum triple to be output. Delete the three items from the hash objects using REMOVE/REMOVEDUP (possible conflict with iterator).
  5. Once finished, POS and NEG switch their roles.

Jade | Level 19

Hi @adornodj,


As promised, here is the implementation of the 5-step plan I outlined in my previous post. For simplicity, it assumes that the variable containing the amounts in dataset HAVE is named X (rather than BASE_AMT or TR_AMT). First I implemented the search for pairs using a similar technique (as a preliminary practice). The algorithm is greedy in the sense that zero-sum groups are extracted right after detection, although (in the case of triples) this might sacrifice maximization of the number of groups in the end.


/* Macro for extracting zero-sum pairs
   (parameters: hash objects with absolute values of pos./neg. amounts) */

%macro pairs(h1, h2);
do while(&; /* traverse hash object 1 */
  if &h2..check()=0 then do; /* search hash object 2 */
    y=x; /* save value x before PREV() overwrites it */
    &h1..check(); /* set pointer for the other deletion */
    rc=&h1.i.prev(); /* move iterator aside to prepare deletion */
    &h1..removedup(key: y);
%mend pairs;

/* Macro for extracting zero-sum triples
   (parameters: hash objects with absolute values of pos./neg. amounts) */

%macro triples(h1, h2);
%local m;
%if %upcase(&h1)=POS %then %let m=-;
do while(rc1=0); /* traverse hash object 1 */
  do while(rc2=0 & x<=a/2); /* partially traverse hash object 2 */
    if &h2..check(key: d)=0 then do; /* search hash object 2 */
      &h2..has_next(result: r); /* take care of special case d=b */
      if d ne b | r then do;
        if d=b then rc=&h2.i.prev(); /* move iterator aside if necessary */
        &h2..removedup(key: d);
        &h1..check(key: a); /* set pointer for 2nd deletion */
        rc=&h1.i.prev(); /* move iterator aside */
        &h1..removedup(key: a);
        &h2..check(key: b); /* set pointer for 3rd deletion */
        if d ne b then rc=&h2.i.prev(); /* move iterator aside if necessary */
        &h2..removedup(key: b);
        leave; /* stop traversing hash object 2 */
%mend triples;

/* Create dataset with variables GRP_SIZE (1, 2, 3), GRP (group no.) and X
   (formerly TR_AMT or BASE_AMT) containing singletons, pairs and triples
   of X values summing to zero */

data groups;
dcl hash pos(multidata: 'y', ordered: 'y');
dcl hiter posi('pos');

dcl hash neg(multidata: 'y', ordered: 'y');
dcl hiter negi('neg');

do until(last);
  set have end=last;

  /* Write singletons {0} to output dataset and absolute values of
     non-zero elements to hash objects */
  if x=0 then do;
  else if x>0 then pos.add();
  else if x<0 then neg.add(key: abs(x), data: abs(x));

/* Write pairs {x, -x} to output dataset and remove their components
   from hash objects */
if pos.num_items<neg.num_items /* IF condition for efficiency */
  then %pairs(pos, neg)
  else %pairs(neg, pos)

/* Write zero-sum triples to output dataset and remove their components
   from hash objects */
%triples(pos, neg) /* one positive and two negative components */
%triples(neg, pos) /* one negative and two positive components */
keep x grp:;

/* Create input dataset for future search for zero-sum quadruples etc. */

proc sql;
create table leftover as
(select x from have)
except all
(select x from groups);

With the "large" input file (5398 obs.) you provided I obtained 296 pairs and 17 triples (as did @RobPratt -- thanks for sharing the summarized results!), resulting in 296*2+17*3=643 observations in my output dataset GROUPS. (Results with your first input file [335 obs.] match as well.)


NOTE: There were 5398 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.GROUPS has 643 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.06 seconds
      cpu time            0.06 seconds

As you can see, the code runs so quickly that going a step further (to quadruples) using similar techniques seems realistic. The code amendment to achieve this is not quite straightforward, though.



For comparison, here's what the greedy approach from PROC OPTMODEL yields:

group_size group_size_freq
2 296
3 17
4 9
5 2
8 1
9 1
13 1
30 1
63 1
373 1
4213 1

Because only 335 observations have nonnegative values and each group must contain at least one such observation, 335 is an upper bound on the number of groups.  So you cannot improve much beyond the greedy 331-group solution for this instance.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 3 in conversation