Solved: Re: Net Groups of Lines in a Table to $0.00

adornodj · Posted 06-12-2018 12:40 AM

Hello - thanks in advance for any help. I have a data table with 300+ records. I'm trying to identify combinations of lines that net to $0.00 automatically. Is there a way to do this within SAS? Thank you again for any input you may have. A simplified example is below...

1/ -5

2/ 2

3/ 3

4/ -10

5/ 4

6/ 6

In the above example, rows 1-3 and 4-6 would get grouped together since they net to $0.00 each. (-5 + 2 + 3) and (-10 + 4 + 6). Is there a way to identify these combinations automatically within SAS?

RobPratt · Posted 06-12-2018 03:11 PM

I'm not sure yet whether it is optimal, but the following greedy heuristic (which tries to partition the observations into a maximum number of groups) yields 86 groups:

proc optmodel;
   set OBS_ALL;
   num x {OBS_ALL};
   read data have into OBS_ALL=[_N_] x=BASE_AMT;
   set OBS init OBS_ALL;

   /* Select[i] = 1 if observation i is selected to be in the current group; 0 otherwise */
   var Select {OBS} binary;

   /* selected observations sum to 0 */
   con SumToZero:
      sum {i in OBS} x[i] * Select[i] = 0;

   /* select at least one observation */
   con Nonempty:
      sum {i in OBS} Select[i] >= 1;

   /* minimize number selected */
   min NumSelected = sum {i in OBS} Select[i];

   num group {OBS_ALL};
   set SELECTED = {i in OBS: Select[i].sol > 0.5};
   num g init 0;

   /* greedy heuristic */
   do until (_solution_status_ = 'INFEASIBLE' or card(OBS) = 0);
      /* call MILP solver */
      solve;
      put SELECTED=;
      g = g + 1;
      for {i in SELECTED} group[i] = g;
      OBS = OBS diff SELECTED;
   end;

   create data want(drop=i) from [i] x group;
quit;

The resulting group size frequencies are:

group_size	group_size_freq
1	6
2	50
3	11
4	12
5	5
8	1
115	1

That is, 6 singletons (the ones with BASE_AMT = 0), 50 pairs, 11 triples, ..., and one big group of 115.

View solution in original post

ChrisNZ · Posted 06-12-2018 01:28 AM

1. Please change the title of your question to something meaningful such as "Combination of numbers that sum to a target value" to increase the chances of getting an answer.

2. I am not aware of any such feature in SAS, though SAS/IML can probably find such combinations.

3. I found these solutions using Excel and C++ in case it helps.

http://www.k2e.com/tech-update/tips/147-using-excel-to-identify-entries-that-add-to-a-specific-value

https://leetcode.com/problems/combination-sum/description/

High-Performance SAS Coding - Third Edition

RW9 · Posted 06-12-2018 04:03 AM

Why doesn't all 6 get grouped together, all in they all sum up to 0? Need to define your logic a bit better. Anyways, a retained value should be enough - not not tested, post test data in the form of a datastep and what the output should look like!

data want;
  set have;
  retain grp sm;
  if _n_=1 then do;
    grp=1;
    sm=val;
  end;
  else do;
    sm=sum(sm,val);
    if sm=0 then grp=grp+1;
  end;
run;

s_lassen · Posted 06-12-2018 07:41 AM

Interesting...

I assume that you only want a series of subsequent lines that add to 0?

But how to group? there can still be many possibilities. I did an example program like this:

data have;
  do i=1 to 1000;
    x=ceil(rand('UNIFORM')*21)-11;
    output;
    end;
run;

data groups;
  set have nobs=nobs;
  sum=0;
  do _P_=_N_ to nobs;
    set have(keep=x rename=(x=x2)) point=_P_;
    sum+x2;
    if sum=0 then do;
      shortest_group=_P_-_N_+1;
      leave;
      end;
    end;
  keep i x shortest_group;
run;

So that for each line, you get the length of the shortest group that can start here. It is easy to see that a lot of them overlap; which ones would you choose for creating your group variable? Do you want the largest number of (non-overlapping) groups possible, or do you want to have as many lines as possible inside groups, or something else?

FreelanceReinh · Posted 06-12-2018 07:57 AM

I fully agree with RW9 and s_lassen that the specifications should be more precise. Restricting the groups to consecutive records would indeed avoid substantial complexity which otherwise might require SAS/OR capabilities.

The possibility of overlapping groups has already been mentioned (like if in your own example 4 was replaced by 7).

If this is real-world data, aren't there other variables indicating likely or impossible groupings?

What if a group of values happen to sum to zero by chance?

In any case, make sure you avoid numeric representation issues if you are dealing with non-integer values.

Example:

data caveat;
input amt;
s+amt;
if s=0 then put 'This might NOT occur!';
if round(s,.001)=0 then put 'Rounding helps!';
cards;
4.90
1.90
-6.80
;

adornodj · Posted 06-12-2018 08:30 AM

Hello All - First off, thank you for the responses. I've updated to subject line to better reflect my question. I was working on two items late last night and apparently grouped both thoughts into this thread by mistake. A little more information on my question...

-Yes, the file nets down to $0.00 naturally as a whole. That said, we are trying to get it down into smaller subsets.

-Unfortunately, there are no additional character fields to utilize in order to group down the data smaller. We are left only with the amount field.

-Yes, it is plausible that some lines may group together to form a bigger cluster (like my example, 6 lines, versus splitting into 2 groups of 3). This is somewhat unlikely based on how the amounts are created though.

Ksharp · Posted 06-12-2018 09:18 AM

I am afraid you have to resort to SAS/OR .

Post it at OR forum and calling @RobPratt

ballardw · Posted 06-12-2018 10:33 AM

Do the values have to be grouped in sequence?

How do you expect the result to look for your given example?

adornodj · Posted 06-12-2018 11:14 AM

No, the data does not (and often will not be) grouped in sequence.

Honestly, the results will vary. I've manually gone through a few iterations (I have several files to do this on with 300+ records). I had one table that nearly all rows paired off and an equal net value (ex: +100 / -100) and another where I was able to find 5/300 rows that netted and the remainder had to be made into one big group.

I am working on Enterprise Guide, should have mentioned that earlier.

RobPratt · Posted 06-12-2018 11:49 AM

Can you please share a representative data set, or maybe the most difficult one?

adornodj · Posted 06-12-2018 12:05 PM

All -as requested, here's a dataset. It's just a single column of amounts.

ballardw · Posted 06-12-2018 03:08 PM

@adornodj wrote:

All -as requested, here's a dataset. It's just a single column of amounts.

Your data just threw another curve: What do you want with 0 values? Since 0 doesn't actually affect sums very much do they require special treatment? Such as all zero values in one "group" or a maximum of one zero per "group" or any other rules?

RobPratt · Posted 06-12-2018 03:11 PM

I'm not sure yet whether it is optimal, but the following greedy heuristic (which tries to partition the observations into a maximum number of groups) yields 86 groups:

proc optmodel;
   set OBS_ALL;
   num x {OBS_ALL};
   read data have into OBS_ALL=[_N_] x=BASE_AMT;
   set OBS init OBS_ALL;

   /* Select[i] = 1 if observation i is selected to be in the current group; 0 otherwise */
   var Select {OBS} binary;

   /* selected observations sum to 0 */
   con SumToZero:
      sum {i in OBS} x[i] * Select[i] = 0;

   /* select at least one observation */
   con Nonempty:
      sum {i in OBS} Select[i] >= 1;

   /* minimize number selected */
   min NumSelected = sum {i in OBS} Select[i];

   num group {OBS_ALL};
   set SELECTED = {i in OBS: Select[i].sol > 0.5};
   num g init 0;

   /* greedy heuristic */
   do until (_solution_status_ = 'INFEASIBLE' or card(OBS) = 0);
      /* call MILP solver */
      solve;
      put SELECTED=;
      g = g + 1;
      for {i in SELECTED} group[i] = g;
      OBS = OBS diff SELECTED;
   end;

   create data want(drop=i) from [i] x group;
quit;

The resulting group size frequencies are:

group_size	group_size_freq
1	6
2	50
3	11
4	12
5	5
8	1
115	1

That is, 6 singletons (the ones with BASE_AMT = 0), 50 pairs, 11 triples, ..., and one big group of 115.

RobPratt · Posted 06-13-2018 11:01 PM

Your problem is to partition the observations into zero-sum groups. If the objective is to maximize the number of groups, it is clear that you want to treat each singleton (BASE_AMT = 0) as its own group, because any group that contains more than one observation and includes a 0 can be split into two groups, each with zero sum. Less obvious is the fact that you can treat each pair x and -x as their own group, without loss of optimality. There are two cases to consider, given any partition of the observations into zero-sum groups:

1. If a group contains more than two observations and includes a pair x and -x, you can split off the pair into its own group {x,-x}, preserving zero sums, and thereby increase the number of groups by one.

2. If x and -x appear in two different groups, say G1 and G2, you can reorganize as two groups {x,-x} and (G1 diff {x}) union (G2 diff {-x}), each with zero sum, and thereby preserve the number of groups. Explicitly, G1 diff {x} sums to -x (because G1 sums to 0), and G2 diff {-x} sums to x, so their union sums to -x + x = 0.

This means that the first two rounds of the greedy algorithm are optimal in the sense that removing singletons and pairs does not ruin your chances of maximizing the number of groups.

ballardw · Posted 06-12-2018 03:44 PM

Here is an example of a brute force not terribly elegant using only data step code approach and has the potential of not actually working depending on your actual data. This is an example of find the pairs that total 0, then find the groups of 3 that total 0. The pattern could be continued. This captures both the index (original row number) as well as the values on those rows with one record per group. I did two levels of this.

data have;
   input BASE_AMT;
datalines;
-758086.38
-715706.92
-308866.51
-65557.44
-39680
-22150.98
-20967.45
-18659
-13091.9
-8400
-7800
-5902.99
-4900
-3800
-3567.64
-3472
-3201.42
-3108.97
-2462
-2400
-1975
-1975
-1964
-1425.95
-1256.03
-1200
-7492.68
7492.68
-7301.35
7301.35
-1200
-1200
-3346.65
3346.65
-1026.9
-912.5
-912.5
-2862.83
2862.83
-849.57
-2499
2499
-822.87
-781.76
-2357.55
2357.55
-781.76
-2272.4
2272.4
-2060.21
2060.21
-750
-750
-750
-750
-1893.63
1893.63
-1739.47
1739.47
-690.63
-1509.29
1509.29
-1492.8
1492.8
-673.44
-673.44
-1399
1399
-673.44
-673.44
-1226.26
1226.26
-663.34
-663.33
-663.33
-658.53
-971.51
971.51
-945.45
945.45
-600.4
-600
-884.25
884.25
-593.74
-576.89
-566.91
-799.59
799.59
-554.47
-544.28
-526.75
-519.78
-515.62
-510
-510
-500
-692.2
692.2
-500
-469.08
-468.45
-394
-372.26
-369.08
-361.42
-361.42
-347.89
-347.89
-325.4
-318.6
-300
-581.25
581.25
-300
-296.86
-560
560
-288.43
-279.55
-540.45
540.45
-261.84
-524.6
524.6
-255.09
-251.25
-250
-247.08
-226.67
-217
-189.1
-497.2
497.2
-181.82
-485.35
485.35
-181.53
-180.08
-455.1
455.1
-154
-446.77
446.77
-418.2
418.2
-147.16
-384.7
384.7
-146
-137.84
-132.43
-130.62
-129.96
-129
-116.1
-313.01
313.01
-302.95
302.95
-113.75
-104.34
-101.32
-99
-94.73
-280.75
280.75
-90
-87.84
-83.28
-81
-79.77
-79.24
-78.84
-77.71
-73.26
-219.96
219.96
-68.11
-68.09
-216.86
216.86
-64.28
-64.12
-185.6
185.6
-62.91
-62.64
-62.37
174.14
-174.14
-173.85
173.85
-62.35
-59.76
-149.85
149.85
-59.04
-59.01
-56.43
-55.68
-54.5
-54.04
-52.8
-126
126
-120.55
120.55
-52.57
-52.52
-50
-49.36
-47.25
-47.22
-44.56
-43.21
-42
-40.46
-36.3
-88.3
88.3
-35.4
-34.61
-33.42
-31.66
-30.96
-30.67
-29.12
-73.85
73.85
-28.9
-68.72
68.72
-28.24
-28.09
-27.56
-26.23
-25.65
-25.1
-24.44
-24.22
-59.55
59.55
-23.42
-20.74
-57.9
57.9
-19.41
-56.24
56.24
-18.65
-18.41
-17.12
-16.65
-15.72
-14.75
-13.54
-13.38
-12.16
-11.8
-11.7
-11.66
-11.53
-42.28
42.28
-11.45
-11.22
-10.62
-10
-8.9
-34.35
34.35
-8.77
-8.64
-8.44
-7.89
-6.24
-6
-5.64
-5.35
-4.76
-4.52
-4.48
-3.51
-3.41
-2.99
-2.85
-2.65
-2.53
-18.9
18.9
-2.46
-2.37
-1.96
-1.76
-1.63
46.23
61.35
88.78
89.77
92
107.49
167.52
212.34
218.12
229.44
267.52
296.38
451.3
485.73
510
629.13
790.82
799.37
835.03
1329.06
1443.56
1581.69
1928.62
2274.62
2726.63
3000.4
4093.26
5426.75
9014.1
25414.75
43152
44807.36
1906992.59
0
0
0
0
0
0
;
run;

proc transpose data=have out=work.trans
  prefix=ba
;
var base_amt;
run;

data work.pairs ( keep=iout jout ival jval)
     work.leftover (keep=ba:)
;
   set work.trans end=last;
   array b ba:;
   do i= 1 to (dim(b)-1);
      do j= 2 to dim(b);
         if sum(b[i],b[j])= 0 then do;
            if not missing(b[i]) then iout=i;
            if not missing(b[j]) then jout=j;
            ival = b[i];
            jval = b[j];
            output work.pairs;
            call missing(b[i],b[j],iout,jout);
            leave;
         end;
      end;
   end;
  
   if last then output work.leftover;
run;

data work.trios ( keep=iout jout kout ival jval kval)
     work.leftover2 (keep=ba:)
;
   set work.leftover end=last;
   array b ba:;
   do i= 1 to (dim(b)-2);
      do j= 2 to (dim(b)-1);
         do k= 3 to dim(b);
            if sum(b[i],b[j],b[k])= 0 then do;
               if not missing(b[i]) then iout=i;
               if not missing(b[j]) then jout=j;
               if not missing(b[k]) then kout=k;
               ival = b[i];
               jval = b[j];
               kval = b[k];
               output work.trios;
               call missing(b[i],b[j],b[k],iout,jout,kout);
               leave;
            end;
         end;
      end;
   end;
   if last then output work.leftover2;
run;

Actually one data step with "large" number of nested do loops should work but the actual number of nested levels will depend on the number of rows of data and the actual values.

The potential of not getting everything assigned to a group comes about when a value that is needed to complete a "complex" group of maybe 6 or 7 values gets used in a smaller set.

Net Groups of Lines in a Table to $0.00

Re: Net Groups of Lines in a Table to $0.00

Re: Unique Iterations Combination of numbers that sum or match a target value

Re: Unique Iterations

Re: Unique Iterations

Re: Unique Iterations

Net Groups of Lines in a Table to $0.00

Re: Net Groups of Lines in a Table to $0.00

Re: Net Groups of Lines in a Table to $0.00

Re: Net Groups of Lines in a Table to $0.00

Re: Net Groups of Lines in a Table to $0.00

Re: Net Groups of Lines in a Table to $0.00

Re: Net Groups of Lines in a Table to $0.00

Re: Net Groups of Lines in a Table to $0.00

Re: Net Groups of Lines in a Table to $0.00

Re: Net Groups of Lines in a Table to $0.00

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!