Solved: cumulative sum by group and under some criteria

jojo · Posted 09-23-2012 05:06 PM

I need to create a new variable 'cumulative'.

Within each by group, which has same n1, m1,I need to look at the p2, p12, and term, if there is any obs that p2 <= current p2, p12> current p12, then sum term. for example, the first obs, p2= 0.75,p12 = 0.1,so for all other sets in the by group (n1 = 3 m1 =16),if p2 <=0.75 and p12 > 0.1, then sum the 'term',in this example, only the 2nd obs satisfies the criteria,so cumulative= 0.3;for second obs,p2= 0.75,p12= 0.2, because no obs satisfy the criteria (p2<=0.75 and p12 >0.2) in the by group , so the cumulative = 0.

Old data:

Obs n1 m1 x P2 p12 Term

1 3 16 4 0.75 0.1 5

2 3 16 5 0.75 0.2 3

3 3 16 6 0.76 0.5 6

4 3 16 7 0.76 0.1 1

5 3 16 9 0.76 0.6 8

6 3 16 4 0.78 0.3 5

7 3 16 7 0.78 0.2 8

8 3 17 4 0.25 0.5 4

9 3 17 3 0.25 0.2 3

10 3 17 2 0.35 0.3 9

11 3 17 1 0.35 0.5 9

12 3 17 0 0.46 0.6 5

13 3 17 7 0.46 0.5 7

New data:

Obs n1 m1 x P2 p12l Term cumulative

1 3 16 4 0.75 0.1 5 3

2 3 16 5 0.75 0.2 3 0

3 3 16 6 0.76 0.5 6 8

4 3 16 7 0.76 0.1 1 17

5 3 16 9 0.76 0.6 8 0

6 3 16 4 0.78 0.3 5 14

7 3 16 7 0.78 0.2 8 19

8 3 17 4 0.25 0.5 4 0

9 3 17 3 0.25 0.2 3 4

10 3 17 2 0.35 0.3 9 13

11 3 17 1 0.35 0.5 9 0

12 3 17 0 0.46 0.6 5 0

13 3 17 7 0.46 0.5 7 5

Haikuo · Posted 09-23-2012 07:01 PM

And you feel SQL has a performance issue, then try following hash():

data want;

if _n_=1 then do;

if 0 then set have(rename=(term=bterm p2=bp2 p12=bp12));

dcl hash h(dataset:'have(rename=(term=bterm p2=bp2 p12=bp12))', multidata:'y');

h.definekey('n1','m1');

h.definedata('bp2','bp12','bterm') ;

h.definedone();

end;

set have;

cumulative=0;

rc=h.find();

do rc=0 by 0 while (rc=0);

if bp2<=p2 and bp12>p12 then cumulative+bterm;

rc=h.find_next();

end;

drop b: rc;

run;

Haikuo

View solution in original post

Haikuo · Posted 09-23-2012 06:33 PM

Here is one Proc SQL approach:

data have;

input Obs n1 m1 x P2 p12 Term;

cards;

1 3 16 4 0.75 0.1 5

2 3 16 5 0.75 0.2 3

3 3 16 6 0.76 0.5 6

4 3 16 7 0.76 0.1 1

5 3 16 9 0.76 0.6 8

6 3 16 4 0.78 0.3 5

7 3 16 7 0.78 0.2 8

8 3 17 4 0.25 0.5 4

9 3 17 3 0.25 0.2 3

10 3 17 2 0.35 0.3 9

11 3 17 1 0.35 0.5 9

12 3 17 0 0.46 0.6 5

13 3 17 7 0.46 0.5 7

;

proc sql noprint;

create table want as

select distinct Obs ,n1, m1, x, P2, p12, case when sum(bterm)>0 then sum(bterm) else 0 end as cumulative from

( select distinct a.*, b.term as bterm from

have a

left join

have b

on a.n1=b.n1 and a.m1=b.m1 and b.p2<=a.p2 and b.p12>a.p12 )

group by obs;

quit;

proc print;run;

Haikuo

Haikuo · Posted 09-23-2012 07:01 PM

And you feel SQL has a performance issue, then try following hash():

data want;

if _n_=1 then do;

if 0 then set have(rename=(term=bterm p2=bp2 p12=bp12));

dcl hash h(dataset:'have(rename=(term=bterm p2=bp2 p12=bp12))', multidata:'y');

h.definekey('n1','m1');

h.definedata('bp2','bp12','bterm') ;

h.definedone();

end;

set have;

cumulative=0;

rc=h.find();

do rc=0 by 0 while (rc=0);

if bp2<=p2 and bp12>p12 then cumulative+bterm;

rc=h.find_next();

end;

drop b: rc;

run;

Haikuo

jojo · Posted 09-24-2012 02:01 PM

Thank you!

TomKari · Posted 09-23-2012 07:33 PM

I don't really understand how you're deriving your "cumulative", but here's some code that I think does some of the things you want.

Tom

DATA HAVE;
INPUT n1 m1 x P2 p12 Term;
DATALINES;
3 16 4 0.75 0.1 5
3 16 5 0.75 0.2 3
3 16 6 0.76 0.5 6
3 16 7 0.76 0.1 1
3 16 9 0.76 0.6 8
3 16 4 0.78 0.3 5
3 16 7 0.78 0.2 8
3 17 4 0.25 0.5 4
3 17 3 0.25 0.2 3
3 17 2 0.35 0.3 9
3 17 1 0.35 0.5 9
3 17 0 0.46 0.6 5
3 17 7 0.46 0.5 7
RUN;

DATA WANT;
RETAIN P2_REFERENCE P12_REFERENCE CUMULATIVE;
SET HAVE;
BY n1 m1;
IF FIRST.m1 THEN
DO;
P2_REFERENCE = P2;
P12_REFERENCE = P12;
CUMULATIVE = 0;
END;
ELSE IF P2 <= P2_REFERENCE & P12 > P12_REFERENCE THEN CUMULATIVE = CUMULATIVE + TERM;
RUN;

jojo · Posted 09-24-2012 11:54 AM

Thank you. The reference value should be the value of p2 and p12 for the obs itself, not the first obs in the by group.

jojo · Posted 10-23-2012 05:03 PM

this probably is the most efficient method, it works very well with a part of my data, with the whole data, I got a message "computer is out of resources'. So I want to separate the data to several subsets of data(for example, every 5 blocks in one dataset) and try to create a macro using the code you provide, I am not familar with the hash table code, for the following statement : dcl hash h(dataset:'have(rename=(term=bterm p2=bp2 p12=bp12))', multidata:'y'); , can the 'have' be a macro variable?

Haikuo · Posted 10-23-2012 05:12 PM

It sure can. And then you need to replace the single quote to double quote so that macro variable can resolve.

dcl hash h(dataset:"&var.(rename=(term=bterm p2=bp2 p12=bp12))", multidata:'y');

Haikuo

jojo · Posted 10-25-2012 12:40 PM

is it possible to include the hash table while reading each observation and find the cumulative sum for each observation? I am trying to use more efficient code, I need to run this on a giant dataset that includes over 500 million obs. Specifically, can the second data step be included in the first step?

%let userN = 10;

%let p01 = 0.05;

%let p02= 0.05;

%let p11 = 0.25;

%let p12 = 0.25;

data s1_A;

do n1 = 2 to &userN-2;

do m1 = 2 to &userN-2 while ((m1 + n1) < &userN-2);

do y1 = 0 to n1-1 ;

do x1 = 0 to m1-1 ;

p11 = x1/m1;

p12 = y1/n1;

if p11 > p12 then do;

p11_final = (m1 * x1/m1 + n1 * y1/n1)/(m1+n1);

p12_final = p11_final;

end;

else do;

P11_final = p11;

P12_final = p12;

end;

term1_p1 = pdf('BINOMIAL', x1, &p11, m1)* pdf('BINOMIAL', y1, &p12, n1);

term1_p0 = pdf('BINOMIAL', x1, &p01, m1)* pdf('BINOMIAL', y1, &p02, n1);

term1_p0p1 = pdf('BINOMIAL', x1, &p01, m1)* pdf('BINOMIAL', y1, &p12, n1);

output;

end;

run;

data s1_B;

if _n_=1 then do;

if 0 then set s1_A(rename=(term1_p0=bt1p0 term1_p1=bt1p1 term1_p0p1=bt1p0p1 p12_final=bp12 p11_final=bp11));

dcl hash h(dataset:'s1_A(rename=(term1_p0=bt1p0 term1_p1=bt1p1 term1_p0p1=bt1p0p1 p12_final=bp12 p11_final=bp11))', multidata:'y');

h.definekey('n1','m1');

h.definedata('bp12','bp11', 'bt1p0', 'bt1p1', 'bt1p0p1') ;

h.definedone();

end;

set s1_A;

ct1p0H=0;

ct1p1H=0;

ct1p0p1H=0;

ct1p0L=0;

ct1p1L=0;

ct1p0p1L=0;

rc=h.find();

do rc=0 by 0 while (rc=0);

if bp12<=p12_final then do;

ct1p1H + bt1p1;

ct1p0H + bt1p0;

ct1p0p1H + bt1p0p1;

end;

if bp11<=p11_final then do;

ct1p1L + bt1p1;

ct1p0L + bt1p0;

ct1p0p1L + bt1p0p1;

end;

rc=h.find_next();

end;

drop b: rc;

if ct1p1H <=0.15 or ct1p1L <= 0.15;

run;

cumulative sum by group and under some criteria

Re: cumulative sum by group and under some criteria

Re: cumulative sum by group and under some criteria

Re: cumulative sum by group and under some criteria

Re: cumulative sum by group and under some criteria

Re: cumulative sum by group and under some criteria

Re: cumulative sum by group and under some criteria

Re: cumulative sum by group and under some criteria

Re: cumulative sum by group and under some criteria

Re: cumulative sum by group and under some criteria

Catch up on SAS Innovate 2026