SAS Programming

DATA Step, Macro, Functions and more
BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
adornodj
Obsidian | Level 7

Hello - thanks in advance for any help.  I have a data table with 300+ records.  I'm trying to identify combinations of lines that net to $0.00 automatically.  Is there a way to do this within SAS?  Thank you again for any input you may have.  A simplified example is below...

 

1/ -5

2/ 2

3/ 3

4/ -10

5/ 4

6/ 6

 

In the above example, rows 1-3 and 4-6 would get grouped together since they net to $0.00 each.  (-5 + 2 + 3) and (-10 + 4 + 6).  Is there a way to identify these combinations automatically within SAS? 

1 ACCEPTED SOLUTION

Accepted Solutions
RobPratt
SAS Super FREQ

I'm not sure yet whether it is optimal, but the following greedy heuristic (which tries to partition the observations into a maximum number of groups) yields 86 groups:

 

proc optmodel;
   set OBS_ALL;
   num x {OBS_ALL};
   read data have into OBS_ALL=[_N_] x=BASE_AMT;
   set OBS init OBS_ALL;

   /* Select[i] = 1 if observation i is selected to be in the current group; 0 otherwise */
   var Select {OBS} binary;

   /* selected observations sum to 0 */
   con SumToZero:
      sum {i in OBS} x[i] * Select[i] = 0;

   /* select at least one observation */
   con Nonempty:
      sum {i in OBS} Select[i] >= 1;

   /* minimize number selected */
   min NumSelected = sum {i in OBS} Select[i];

   num group {OBS_ALL};
   set SELECTED = {i in OBS: Select[i].sol > 0.5};
   num g init 0;

   /* greedy heuristic */
   do until (_solution_status_ = 'INFEASIBLE' or card(OBS) = 0);
      /* call MILP solver */
      solve;
      put SELECTED=;
      g = g + 1;
      for {i in SELECTED} group[i] = g;
      OBS = OBS diff SELECTED;
   end;

   create data want(drop=i) from [i] x group;
quit;

 

The resulting group size frequencies are:

group_size group_size_freq
1 6
2 50
3 11
4 12
5 5
8 1
115 1

 

That is, 6 singletons (the ones with BASE_AMT = 0), 50 pairs, 11 triples, ..., and one big group of 115. 

View solution in original post

17 REPLIES 17
ChrisNZ
Tourmaline | Level 20

1. Please change the title of your question to something meaningful such as "Combination of numbers that sum to a target value" to increase the chances of getting an answer.

 

2. I am not aware of any such feature in SAS, though SAS/IML can probably find such combinations.

 

3. I found these solutions using Excel and C++ in case it helps.

http://www.k2e.com/tech-update/tips/147-using-excel-to-identify-entries-that-add-to-a-specific-value

https://leetcode.com/problems/combination-sum/description/

 

 

 
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Why doesn't all 6 get grouped together, all in they all sum up to 0?  Need to define your logic a bit better.  Anyways, a retained value should be enough - not not tested, post test data in the form of a datastep and what the output should look like!

data want;
  set have;
  retain grp sm;
  if _n_=1 then do;
    grp=1;
    sm=val;
  end;
  else do;
    sm=sum(sm,val);
    if sm=0 then grp=grp+1;
  end;
run;
  
s_lassen
Meteorite | Level 14

Interesting...

I assume that you only want a series of subsequent lines that add to 0?

But how to group? there can still be many possibilities. I did an example program like this:

data have;
  do i=1 to 1000;
    x=ceil(rand('UNIFORM')*21)-11;
    output;
    end;
run;

data groups;
  set have nobs=nobs;
  sum=0;
  do _P_=_N_ to nobs;
    set have(keep=x rename=(x=x2)) point=_P_;
    sum+x2;
    if sum=0 then do;
      shortest_group=_P_-_N_+1;
      leave;
      end;
    end;
  keep i x shortest_group;
run;

So that for each line, you get the length of the shortest group that can start here. It is easy to see that a lot of them overlap; which ones would you choose for creating your group variable? Do you want the largest number of (non-overlapping) groups possible, or do you want to have as many lines as possible inside groups, or something else?

FreelanceReinh
Jade | Level 19

I fully agree with RW9 and s_lassen that the specifications should be more precise. Restricting the groups to consecutive records would indeed avoid substantial complexity which otherwise might require SAS/OR capabilities.

 

The possibility of overlapping groups has already been mentioned (like if in your own example 4 was replaced by 7).

 

If this is real-world data, aren't there other variables indicating likely or impossible groupings?

 

What if a group of values happen to sum to zero by chance?

 

In any case, make sure you avoid numeric representation issues if you are dealing with non-integer values.

 

Example:

data caveat;
input amt;
s+amt;
if s=0 then put 'This might NOT occur!';
if round(s,.001)=0 then put 'Rounding helps!';
cards;
4.90
1.90
-6.80
;
adornodj
Obsidian | Level 7

Hello All - First off, thank you for the responses. I've updated to subject line to better reflect my question. I was working on two items late last night and apparently grouped both thoughts into this thread by mistake. A little more information on my question...

 

-Yes, the file nets down to $0.00 naturally as a whole. That said, we are trying to get it down into smaller subsets.

-Unfortunately, there are no additional character fields to utilize in order to group down the data smaller. We are left only with the amount field.  

-Yes, it is plausible that some lines may group together to form a bigger cluster (like my example, 6 lines, versus splitting into 2 groups of 3).  This is somewhat unlikely based on how the amounts are created though.  

Ksharp
Super User

I am afraid you have to resort to SAS/OR .

Post it at OR forum and calling @RobPratt

ballardw
Super User

Do the values have to be grouped in sequence?

 

How do you expect the result to look for your given example?

adornodj
Obsidian | Level 7

No, the data does not (and often will not be) grouped in sequence.  

 

Honestly, the results will vary.  I've manually gone through a few iterations (I have several files to do this on with 300+ records).  I had one  table that nearly all rows paired off and an equal net value (ex: +100 / -100) and another where I was able to find 5/300 rows that netted and the remainder had to be made into one big group. 

 

I am working on Enterprise Guide, should have mentioned that earlier.

RobPratt
SAS Super FREQ

Can you please share a representative data set, or maybe the most difficult one?

adornodj
Obsidian | Level 7

All -as requested, here's a dataset. It's just a single column of amounts.  

ballardw
Super User

@adornodj wrote:

All -as requested, here's a dataset. It's just a single column of amounts.  


Your data just threw another curve: What do you want with 0 values? Since 0 doesn't actually affect sums very much do they require special treatment? Such as all zero values in one "group" or a maximum of one zero per "group" or any other rules?

RobPratt
SAS Super FREQ

I'm not sure yet whether it is optimal, but the following greedy heuristic (which tries to partition the observations into a maximum number of groups) yields 86 groups:

 

proc optmodel;
   set OBS_ALL;
   num x {OBS_ALL};
   read data have into OBS_ALL=[_N_] x=BASE_AMT;
   set OBS init OBS_ALL;

   /* Select[i] = 1 if observation i is selected to be in the current group; 0 otherwise */
   var Select {OBS} binary;

   /* selected observations sum to 0 */
   con SumToZero:
      sum {i in OBS} x[i] * Select[i] = 0;

   /* select at least one observation */
   con Nonempty:
      sum {i in OBS} Select[i] >= 1;

   /* minimize number selected */
   min NumSelected = sum {i in OBS} Select[i];

   num group {OBS_ALL};
   set SELECTED = {i in OBS: Select[i].sol > 0.5};
   num g init 0;

   /* greedy heuristic */
   do until (_solution_status_ = 'INFEASIBLE' or card(OBS) = 0);
      /* call MILP solver */
      solve;
      put SELECTED=;
      g = g + 1;
      for {i in SELECTED} group[i] = g;
      OBS = OBS diff SELECTED;
   end;

   create data want(drop=i) from [i] x group;
quit;

 

The resulting group size frequencies are:

group_size group_size_freq
1 6
2 50
3 11
4 12
5 5
8 1
115 1

 

That is, 6 singletons (the ones with BASE_AMT = 0), 50 pairs, 11 triples, ..., and one big group of 115. 

RobPratt
SAS Super FREQ

Your problem is to partition the observations into zero-sum groups.  If the objective is to maximize the number of groups, it is clear that you want to treat each singleton (BASE_AMT = 0) as its own group, because any group that contains more than one observation and includes a 0 can be split into two groups, each with zero sum.  Less obvious is the fact that you can treat each pair x and -x as their own group, without loss of optimality.  There are two cases to consider, given any partition of the observations into zero-sum groups:

1. If a group contains more than two observations and includes a pair x and -x, you can split off the pair into its own group {x,-x}, preserving zero sums, and thereby increase the number of groups by one.

2. If x and -x appear in two different groups, say G1 and G2, you can reorganize as two groups {x,-x} and (G1 diff {x}) union (G2 diff {-x}), each with zero sum, and thereby preserve the number of groups.  Explicitly, G1 diff {x} sums to -x (because G1 sums to 0), and G2 diff {-x} sums to x, so their union sums to -x + x = 0.

 

This means that the first two rounds of the greedy algorithm are optimal in the sense that removing singletons and pairs does not ruin your chances of maximizing the number of groups.

ballardw
Super User

Here is an example of a brute force not terribly elegant using only data step code approach and has the potential of not actually working depending on your actual data. This is an example of find the pairs that total 0, then find the groups of 3 that total 0. The pattern could be continued. This captures both the index (original row number) as well as the values on those rows with one record per group. I did two levels of this.

data have;
   input BASE_AMT;
datalines;
-758086.38
-715706.92
-308866.51
-65557.44
-39680
-22150.98
-20967.45
-18659
-13091.9
-8400
-7800
-5902.99
-4900
-3800
-3567.64
-3472
-3201.42
-3108.97
-2462
-2400
-1975
-1975
-1964
-1425.95
-1256.03
-1200
-7492.68
7492.68
-7301.35
7301.35
-1200
-1200
-3346.65
3346.65
-1026.9
-912.5
-912.5
-2862.83
2862.83
-849.57
-2499
2499
-822.87
-781.76
-2357.55
2357.55
-781.76
-2272.4
2272.4
-2060.21
2060.21
-750
-750
-750
-750
-1893.63
1893.63
-1739.47
1739.47
-690.63
-1509.29
1509.29
-1492.8
1492.8
-673.44
-673.44
-1399
1399
-673.44
-673.44
-1226.26
1226.26
-663.34
-663.33
-663.33
-658.53
-971.51
971.51
-945.45
945.45
-600.4
-600
-884.25
884.25
-593.74
-576.89
-566.91
-799.59
799.59
-554.47
-544.28
-526.75
-519.78
-515.62
-510
-510
-500
-692.2
692.2
-500
-469.08
-468.45
-394
-372.26
-369.08
-361.42
-361.42
-347.89
-347.89
-325.4
-318.6
-300
-581.25
581.25
-300
-296.86
-560
560
-288.43
-279.55
-540.45
540.45
-261.84
-524.6
524.6
-255.09
-251.25
-250
-247.08
-226.67
-217
-189.1
-497.2
497.2
-181.82
-485.35
485.35
-181.53
-180.08
-455.1
455.1
-154
-446.77
446.77
-418.2
418.2
-147.16
-384.7
384.7
-146
-137.84
-132.43
-130.62
-129.96
-129
-116.1
-313.01
313.01
-302.95
302.95
-113.75
-104.34
-101.32
-99
-94.73
-280.75
280.75
-90
-87.84
-83.28
-81
-79.77
-79.24
-78.84
-77.71
-73.26
-219.96
219.96
-68.11
-68.09
-216.86
216.86
-64.28
-64.12
-185.6
185.6
-62.91
-62.64
-62.37
174.14
-174.14
-173.85
173.85
-62.35
-59.76
-149.85
149.85
-59.04
-59.01
-56.43
-55.68
-54.5
-54.04
-52.8
-126
126
-120.55
120.55
-52.57
-52.52
-50
-49.36
-47.25
-47.22
-44.56
-43.21
-42
-40.46
-36.3
-88.3
88.3
-35.4
-34.61
-33.42
-31.66
-30.96
-30.67
-29.12
-73.85
73.85
-28.9
-68.72
68.72
-28.24
-28.09
-27.56
-26.23
-25.65
-25.1
-24.44
-24.22
-59.55
59.55
-23.42
-20.74
-57.9
57.9
-19.41
-56.24
56.24
-18.65
-18.41
-17.12
-16.65
-15.72
-14.75
-13.54
-13.38
-12.16
-11.8
-11.7
-11.66
-11.53
-42.28
42.28
-11.45
-11.22
-10.62
-10
-8.9
-34.35
34.35
-8.77
-8.64
-8.44
-7.89
-6.24
-6
-5.64
-5.35
-4.76
-4.52
-4.48
-3.51
-3.41
-2.99
-2.85
-2.65
-2.53
-18.9
18.9
-2.46
-2.37
-1.96
-1.76
-1.63
46.23
61.35
88.78
89.77
92
107.49
167.52
212.34
218.12
229.44
267.52
296.38
451.3
485.73
510
629.13
790.82
799.37
835.03
1329.06
1443.56
1581.69
1928.62
2274.62
2726.63
3000.4
4093.26
5426.75
9014.1
25414.75
43152
44807.36
1906992.59
0
0
0
0
0
0
;
run;

proc transpose data=have out=work.trans
  prefix=ba
;
var base_amt;
run;

data work.pairs ( keep=iout jout ival jval)
     work.leftover (keep=ba:)
;
   set work.trans end=last;
   array b ba:;
   do i= 1 to (dim(b)-1);
      do j= 2 to dim(b);
         if sum(b[i],b[j])= 0 then do;
            if not missing(b[i]) then iout=i;
            if not missing(b[j]) then jout=j;
            ival = b[i];
            jval = b[j];
            output work.pairs;
            call missing(b[i],b[j],iout,jout);
            leave;
         end;
      end;
   end;
  
   if last then output work.leftover;
run;

data work.trios ( keep=iout jout kout ival jval kval)
     work.leftover2 (keep=ba:)
;
   set work.leftover end=last;
   array b ba:;
   do i= 1 to (dim(b)-2);
      do j= 2 to (dim(b)-1);
         do k= 3 to dim(b);
            if sum(b[i],b[j],b[k])= 0 then do;
               if not missing(b[i]) then iout=i;
               if not missing(b[j]) then jout=j;
               if not missing(b[k]) then kout=k;
               ival = b[i];
               jval = b[j];
               kval = b[k];
               output work.trios;
               call missing(b[i],b[j],b[k],iout,jout,kout);
               leave;
            end;
         end;
      end;
   end;
   if last then output work.leftover2;
run;

Actually one data step with "large" number of nested do loops should work but the actual number of nested levels will depend on the number of rows of data and the actual values.

 

The potential of not getting everything assigned to a group comes about when a value that is needed to complete a "complex" group of maybe 6 or 7 values gets used in a smaller set.

sas-innovate-wordmark-2025-midnight.png

Register Today!

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.


Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 17 replies
  • 2788 views
  • 4 likes
  • 8 in conversation