I need to create a new variable 'cumulative'.
Within each by group, which has same n1, m1,I need to look at the p2, p12, and term, if there is any obs that p2 <= current p2, p12> current p12, then sum term. for example, the first obs, p2= 0.75,p12 = 0.1,so for all other sets in the by group (n1 = 3 m1 =16),if p2 <=0.75 and p12 > 0.1, then sum the 'term',in this example, only the 2nd obs satisfies the criteria,so cumulative= 0.3;for second obs,p2= 0.75,p12= 0.2, because no obs satisfy the criteria (p2<=0.75 and p12 >0.2) in the by group , so the cumulative = 0.
Old data:
Obs n1 m1 x P2 p12 Term
1 3 16 4 0.75 0.1 5
2 3 16 5 0.75 0.2 3
3 3 16 6 0.76 0.5 6
4 3 16 7 0.76 0.1 1
5 3 16 9 0.76 0.6 8
6 3 16 4 0.78 0.3 5
7 3 16 7 0.78 0.2 8
8 3 17 4 0.25 0.5 4
9 3 17 3 0.25 0.2 3
10 3 17 2 0.35 0.3 9
11 3 17 1 0.35 0.5 9
12 3 17 0 0.46 0.6 5
13 3 17 7 0.46 0.5 7
New data:
Obs n1 m1 x P2 p12l Term cumulative
1 3 16 4 0.75 0.1 5 3
2 3 16 5 0.75 0.2 3 0
3 3 16 6 0.76 0.5 6 8
4 3 16 7 0.76 0.1 1 17
5 3 16 9 0.76 0.6 8 0
6 3 16 4 0.78 0.3 5 14
7 3 16 7 0.78 0.2 8 19
8 3 17 4 0.25 0.5 4 0
9 3 17 3 0.25 0.2 3 4
10 3 17 2 0.35 0.3 9 13
11 3 17 1 0.35 0.5 9 0
12 3 17 0 0.46 0.6 5 0
13 3 17 7 0.46 0.5 7 5
And you feel SQL has a performance issue, then try following hash():
data want;
if _n_=1 then do;
if 0 then set have(rename=(term=bterm p2=bp2 p12=bp12));
dcl hash h(dataset:'have(rename=(term=bterm p2=bp2 p12=bp12))', multidata:'y');
h.definekey('n1','m1');
h.definedata('bp2','bp12','bterm') ;
h.definedone();
end;
set have;
cumulative=0;
rc=h.find();
do rc=0 by 0 while (rc=0);
if bp2<=p2 and bp12>p12 then cumulative+bterm;
rc=h.find_next();
end;
drop b: rc;
run;
Haikuo
Here is one Proc SQL approach:
data have;
input Obs n1 m1 x P2 p12 Term;
cards;
1 3 16 4 0.75 0.1 5
2 3 16 5 0.75 0.2 3
3 3 16 6 0.76 0.5 6
4 3 16 7 0.76 0.1 1
5 3 16 9 0.76 0.6 8
6 3 16 4 0.78 0.3 5
7 3 16 7 0.78 0.2 8
8 3 17 4 0.25 0.5 4
9 3 17 3 0.25 0.2 3
10 3 17 2 0.35 0.3 9
11 3 17 1 0.35 0.5 9
12 3 17 0 0.46 0.6 5
13 3 17 7 0.46 0.5 7
;
proc sql noprint;
create table want as
select distinct Obs ,n1, m1, x, P2, p12, case when sum(bterm)>0 then sum(bterm) else 0 end as cumulative from
( select distinct a.*, b.term as bterm from
have a
left join
have b
on a.n1=b.n1 and a.m1=b.m1 and b.p2<=a.p2 and b.p12>a.p12 )
group by obs;
quit;
proc print;run;
Haikuo
And you feel SQL has a performance issue, then try following hash():
data want;
if _n_=1 then do;
if 0 then set have(rename=(term=bterm p2=bp2 p12=bp12));
dcl hash h(dataset:'have(rename=(term=bterm p2=bp2 p12=bp12))', multidata:'y');
h.definekey('n1','m1');
h.definedata('bp2','bp12','bterm') ;
h.definedone();
end;
set have;
cumulative=0;
rc=h.find();
do rc=0 by 0 while (rc=0);
if bp2<=p2 and bp12>p12 then cumulative+bterm;
rc=h.find_next();
end;
drop b: rc;
run;
Haikuo
Thank you!
I don't really understand how you're deriving your "cumulative", but here's some code that I think does some of the things you want.
Tom
DATA HAVE;
INPUT n1 m1 x P2 p12 Term;
DATALINES;
3 16 4 0.75 0.1 5
3 16 5 0.75 0.2 3
3 16 6 0.76 0.5 6
3 16 7 0.76 0.1 1
3 16 9 0.76 0.6 8
3 16 4 0.78 0.3 5
3 16 7 0.78 0.2 8
3 17 4 0.25 0.5 4
3 17 3 0.25 0.2 3
3 17 2 0.35 0.3 9
3 17 1 0.35 0.5 9
3 17 0 0.46 0.6 5
3 17 7 0.46 0.5 7
RUN;
DATA WANT;
RETAIN P2_REFERENCE P12_REFERENCE CUMULATIVE;
SET HAVE;
BY n1 m1;
IF FIRST.m1 THEN
DO;
P2_REFERENCE = P2;
P12_REFERENCE = P12;
CUMULATIVE = 0;
END;
ELSE IF P2 <= P2_REFERENCE & P12 > P12_REFERENCE THEN CUMULATIVE = CUMULATIVE + TERM;
RUN;
Thank you. The reference value should be the value of p2 and p12 for the obs itself, not the first obs in the by group.
this probably is the most efficient method, it works very well with a part of my data, with the whole data, I got a message "computer is out of resources'. So I want to separate the data to several subsets of data(for example, every 5 blocks in one dataset) and try to create a macro using the code you provide, I am not familar with the hash table code, for the following statement : dcl hash h(dataset:'have(rename=(term=bterm p2=bp2 p12=bp12))', multidata:'y'); , can the 'have' be a macro variable?
It sure can. And then you need to replace the single quote to double quote so that macro variable can resolve.
dcl hash h(dataset:"&var.(rename=(term=bterm p2=bp2 p12=bp12))", multidata:'y');
Haikuo
is it possible to include the hash table while reading each observation and find the cumulative sum for each observation? I am trying to use more efficient code, I need to run this on a giant dataset that includes over 500 million obs. Specifically, can the second data step be included in the first step?
%let userN = 10;
%let p01 = 0.05;
%let p02= 0.05;
%let p11 = 0.25;
%let p12 = 0.25;
data s1_A;
do n1 = 2 to &userN-2;
do m1 = 2 to &userN-2 while ((m1 + n1) < &userN-2);
do y1 = 0 to n1-1 ;
do x1 = 0 to m1-1 ;
p11 = x1/m1;
p12 = y1/n1;
if p11 > p12 then do;
p11_final = (m1 * x1/m1 + n1 * y1/n1)/(m1+n1);
p12_final = p11_final;
end;
else do;
P11_final = p11;
P12_final = p12;
end;
term1_p1 = pdf('BINOMIAL', x1, &p11, m1)* pdf('BINOMIAL', y1, &p12, n1);
term1_p0 = pdf('BINOMIAL', x1, &p01, m1)* pdf('BINOMIAL', y1, &p02, n1);
term1_p0p1 = pdf('BINOMIAL', x1, &p01, m1)* pdf('BINOMIAL', y1, &p12, n1);
output;
end;
end;
end;
end;
run;
data s1_B;
if _n_=1 then do;
if 0 then set s1_A(rename=(term1_p0=bt1p0 term1_p1=bt1p1 term1_p0p1=bt1p0p1 p12_final=bp12 p11_final=bp11));
dcl hash h(dataset:'s1_A(rename=(term1_p0=bt1p0 term1_p1=bt1p1 term1_p0p1=bt1p0p1 p12_final=bp12 p11_final=bp11))', multidata:'y');
h.definekey('n1','m1');
h.definedata('bp12','bp11', 'bt1p0', 'bt1p1', 'bt1p0p1') ;
h.definedone();
end;
set s1_A;
ct1p0H=0;
ct1p1H=0;
ct1p0p1H=0;
ct1p0L=0;
ct1p1L=0;
ct1p0p1L=0;
rc=h.find();
do rc=0 by 0 while (rc=0);
if bp12<=p12_final then do;
ct1p1H + bt1p1;
ct1p0H + bt1p0;
ct1p0p1H + bt1p0p1;
end;
if bp11<=p11_final then do;
ct1p1L + bt1p1;
ct1p0L + bt1p0;
ct1p0p1L + bt1p0p1;
end;
rc=h.find_next();
end;
drop b: rc;
if ct1p1H <=0.15 or ct1p1L <= 0.15;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.