I have
data have;
input ID Date :date9. SIn Date30 :date9. ;
format date: date9.;
cards;
1 12Jul2017 318 11Aug2017
1 11Aug2017 197 10Sep2017
1 10Sep2017 700 10Oct2017
1 10Oct2017 524 09Nov2017
1 09Nov2017 702 09Dec2017
2 29Sep2017 642 29Oct2017
2 09Dec2017 446 08Jan2018
3 04Dec2010 182 03Jan2011
3 02Feb2011 0 04Mar2011
3 03Apr2011 10 20May2011
4 05Dec2010 317 04Jan2011
4 06Dec2010 272 05Jan2011
4 07Dec2010 648 06Jan2011
4 08Dec2010 778 07Jan2011
4 03Mar2011 839 02Apr2011
4 27May2011 784 26Jun2011
;
Date30 is just the variable date + 30 days
I want to vertically sum the SIN variable By ID where the new variable SIN30 is the cumulative sum of all SIN values for the 30 days after the current date. the output should look like this
ID | Date | SIn | Date30 | SIN30 |
1 | 12Jul2017 | 318 | 11Aug2017 | 515 |
1 | 11Aug2017 | 197 | 10Sep2017 | 897 |
1 | 10Sep2017 | 700 | 10Oct2017 | 1224 |
1 | 10Oct2017 | 524 | 09Nov2017 | 1226 |
1 | 09Nov2017 | 702 | 09Dec2017 | 702 |
2 | 29Sep2017 | 642 | 29Oct2017 | 642 |
2 | 09Dec2017 | 446 | 08Jan2018 | 446 |
3 | 04Dec2010 | 182 | 03Jan2011 | 182 |
3 | 02Feb2011 | 0 | 04Mar2011 | 0 |
3 | 03Apr2011 | 10 | 20May2011 | 10 |
4 | 05Dec2010 | 317 | 04Jan2011 | 2015 |
4 | 06Dec2010 | 272 | 05Jan2011 | 1698 |
4 | 07Dec2010 | 648 | 06Jan2011 | 1426 |
4 | 08Dec2010 | 778 | 07Jan2011 | 778 |
4 | 03Mar2011 | 839 | 02Apr2011 | 839 |
4 | 27May2011 | 784 | 26Jun2011 | 784 |
In earlier post https://communities.sas.com/t5/Base-SAS-Programming/Conditional-Vertical-Summation/m-p/478242#M12331... , I was able to find a solution to this issue using the following code.
data want;
if _N_ = 1 then do;
if 0 then set have(rename=(sin=_sin));
declare hash h(dataset:'have(rename=(sin=_sin)');
h.defineKey('id','date');
h.defineData('_sin');
h.defineDone();
end;
set have;
by id;
sin30=0;
do n=date to date30;
if h.find(key:id,key:n)=0 then sin30+_sin;
end;
drop n _sin;
run;
While the code works perfectly, I can not use it because my computer memory is insufficient to execute the hash based code (i'm also creating 36 variables similar to SIN30). I was wondering if there is any other way to solve this problem without using the hash function.
Thanks for all your help in advance.
If the dates are a sparsely spaced as in your example, then for each record you're looking for 30 dates just to harvest 2 or 3 for SIN30.
If it weren't for memory problems, you could just make the hash an ordered hash (by id date), and associate a hash iterator with it. Then you could use the SETCUR method to get the current date and SIN. Then a sequence of NEXT methods to get successive observations until you bypass DATE30. This way there is no excess of unsatisfied FIND methods.
As to memory issues, you probably could populate the hash one ID at a time, process all the records for that id, and then clear the hash object in preparation for the next id. In that case you could just use DATE as the hash key.
However, the program below is almost certainly faster, as it uses an array (less memory overhead than hash), and since the array is accessed for N=1, 2, ..., it is a key-indexed array providing faster lookup than hash:
data want;
array _dat{100} ;
array _sin{100} ;
do _n=1 by 1 until (last.id);
set have;
by id;
_dat{_n}=date;
_sin{_n}=sin;
end;
do _n2=1 to _n;
set have;
sin30=0;
do _n3=_n2 to _n while (_dat{_n3}<=date30);
if _dat{_n3}>date30 then leave;
else sin30=sin30 + _sin{_n3};
end;
output;
end;
drop _: ;
run;
This program assumes no ID has more than 100 dates. If that's not true, just set the dimension of arrays _dat and _sin to exceed the maximum expected date count.
The program reads a single ID once to populate the arrays. Then it re-reads the same ID, and retrieves data from the array as needed. This type of re-read is probably relatively efficient, since the re-reading probably comes from memory buffers rather than from the disk.
Like your program, this reads the dataset twice. but unlike your program it doesn't read the entire dataset once, and then restarts at the beginning to re-read the entire dataset. Much less efficient use of disk channel resources.
Edited improvement. You can replace
do _n3=_n2 to _n while (_dat{_n3}<=date30);
if _dat{_n3}>date30 then leave;
else sin30=sin30 + _sin{_n3};
end;
with:
do _n3=_n2 to _n while (_dat{_n3}<=date30);
sin30=sin30 + _sin{_n3};
end;
The "if test" was needed in an earlier version, but no longer.
In most cases, splitting your data set up into smaller data sets is not recommended, but in this case, where you are running out of memory on the whole data set, I believe this is a good thing to do.
Split your data into smaller data sets by ID, or perhaps several IDs into one data set.
Thank you for your response. I tried doing that, however, it is not practical, as I have to split the data into 800 smaller datasets.
Regards,
@melsaid04 wrote:
Thank you for your response. I tried doing that, however, it is not practical, as I have to split the data into 800 smaller datasets.
A macro would get the job done, and append the results back together as well.
If the dates are a sparsely spaced as in your example, then for each record you're looking for 30 dates just to harvest 2 or 3 for SIN30.
If it weren't for memory problems, you could just make the hash an ordered hash (by id date), and associate a hash iterator with it. Then you could use the SETCUR method to get the current date and SIN. Then a sequence of NEXT methods to get successive observations until you bypass DATE30. This way there is no excess of unsatisfied FIND methods.
As to memory issues, you probably could populate the hash one ID at a time, process all the records for that id, and then clear the hash object in preparation for the next id. In that case you could just use DATE as the hash key.
However, the program below is almost certainly faster, as it uses an array (less memory overhead than hash), and since the array is accessed for N=1, 2, ..., it is a key-indexed array providing faster lookup than hash:
data want;
array _dat{100} ;
array _sin{100} ;
do _n=1 by 1 until (last.id);
set have;
by id;
_dat{_n}=date;
_sin{_n}=sin;
end;
do _n2=1 to _n;
set have;
sin30=0;
do _n3=_n2 to _n while (_dat{_n3}<=date30);
if _dat{_n3}>date30 then leave;
else sin30=sin30 + _sin{_n3};
end;
output;
end;
drop _: ;
run;
This program assumes no ID has more than 100 dates. If that's not true, just set the dimension of arrays _dat and _sin to exceed the maximum expected date count.
The program reads a single ID once to populate the arrays. Then it re-reads the same ID, and retrieves data from the array as needed. This type of re-read is probably relatively efficient, since the re-reading probably comes from memory buffers rather than from the disk.
Like your program, this reads the dataset twice. but unlike your program it doesn't read the entire dataset once, and then restarts at the beginning to re-read the entire dataset. Much less efficient use of disk channel resources.
Edited improvement. You can replace
do _n3=_n2 to _n while (_dat{_n3}<=date30);
if _dat{_n3}>date30 then leave;
else sin30=sin30 + _sin{_n3};
end;
with:
do _n3=_n2 to _n while (_dat{_n3}<=date30);
sin30=sin30 + _sin{_n3};
end;
The "if test" was needed in an earlier version, but no longer.
Thank you so much the code you provided worked perfectly, I just have to replace it to create 30 some similar variables.
In your response you mentioned:
"As to memory issues, you probably could populate the hash one ID at a time, process all the records for that id, and then clear the hash object in preparation for the next id. In that case you could just use DATE as the hash key."
for my learning purposes, how would the hash code provided above change to reflect on the approach you suggested.
Best Regards,
Melsaid
Here's how you would use your original logic and hash programming one ID at a time. Of course this assumes that the data are sorted by ID (although they don't have to be sorted by date within id):
data want2;
if _N_ = 1 then do;
if 0 then set have(rename=(sin=_sin));
declare hash h(dataset:'have(obs=0 rename=(sin=_sin)');
h.defineKey('date');
h.defineData('_sin');
h.defineDone();
end;
do until (last.id);
set have;
by id;
h.add(key:date,data:sin);
end;
do until (last.id);
set have;
by id;
sin30=0;
do n=date to date30;
if h.find(key:n)=0 then sin30+_sin;
end;
output;
end;
h.clear();
drop n _sin;
run;
Of course, this program does 30 FIND methods even if only 2 or 3 actually get data. As I mentioned earlier, you could eliminate that wastage by introducing a hash iterator, and use of the SETCUR and NEXT methods instead of FIND over 30 dates:
data want3;
if _N_ = 1 then do;
if 0 then set have(rename=(sin=_sin date=_date));
declare hash h(dataset:'have(obs=0 rename=(sin=_sin date=_date)',ordered:'a');
h.defineKey('_date');
h.defineData('_date','_sin');
h.defineDone();
declare hiter hi ('h');
end;
do until (last.id);
set have (rename=(sin=_sin date=_date));
by id;
h.add();
end;
do until (last.id);
set have;
by id;
sin30=0;
do rc=hi.setcur(key:date) by 0 while(_date<=date30);
sin30=sin30+_sin;
if hi.next()^=0 then leave;
end;
output;
end;
h.clear();
drop rc _: ;
run;
@mkeintz thank you so much for taking the time to draft the code for the edited hash function. I studied the code and it makes so much sense.
I really appreciate all your help.
Best Regards,
Melsaid
data have;
input ID Date :date9. SIN Date30 :date9. ;
format date: date9.;
cards;
1 12Jul2017 318 11Aug2017
1 11Aug2017 197 10Sep2017
1 10Sep2017 700 10Oct2017
1 10Oct2017 524 09Nov2017
1 09Nov2017 702 09Dec2017
2 29Sep2017 642 29Oct2017
2 09Dec2017 446 08Jan2018
3 04Dec2010 182 03Jan2011
3 02Feb2011 0 04Mar2011
3 03Apr2011 0 03May2011
4 05Dec2010 317 04Jan2011
4 06Dec2010 272 05Jan2011
4 07Dec2010 648 06Jan2011
4 08Dec2010 778 07Jan2011
4 03Mar2011 839 02Apr2011
4 27May2011 784 26Jun2011
;
data w;
set have;
by id;
sin30=0;
do n=date to date30;
output;
end;
format n date9.;
run;
proc sql;
create table want as
select a.*,b.sin30
from have a left join
(select a.id,a.date as date,a.sin, sum(b.sin) as sin30
from w a left join have b
on a.id=b.id
where a.n=b.date
group by a.id, a.date,a.sin) b
on a.id=b.id and a.date=b.date
order by a.id,a.date;
quit;
Depending on how your data is indexed, this simple approach can be quite efficient:
data have;
input ID Date :date9. SIN;
format date: date9.;
cards;
1 12Jul2017 318
1 11Aug2017 197
1 10Sep2017 700
1 10Oct2017 524
1 09Nov2017 702
2 29Sep2017 642
2 09Dec2017 446
3 04Dec2010 182
3 02Feb2011 0
3 03Apr2011 0
4 05Dec2010 317
4 06Dec2010 272
4 07Dec2010 648
4 08Dec2010 778
4 03Mar2011 839
4 27May2011 784
;
proc sql;
create table want as
select
a.ID, a.date, a.SIN,
sum(b.SIN) as SIN30
from
have as a inner join
have as b
on a.ID=b.ID and
b.date between a.date and a.date+30
group by a.ID, a.date, a.SIN;
quit;
@PGStats Sir, I tried a slight tweak to your code taking OP's dataset as is with date30 var from source and tweak in between and adding date30 to group by . I don't know why I didn't think of that earlier
data have;
input ID Date :date9. SIN Date30 :date9. ;
format date: date9.;
cards;
1 12Jul2017 318 11Aug2017
1 11Aug2017 197 10Sep2017
1 10Sep2017 700 10Oct2017
1 10Oct2017 524 09Nov2017
1 09Nov2017 702 09Dec2017
2 29Sep2017 642 29Oct2017
2 09Dec2017 446 08Jan2018
3 04Dec2010 182 03Jan2011
3 02Feb2011 0 04Mar2011
3 03Apr2011 0 03May2011
4 05Dec2010 317 04Jan2011
4 06Dec2010 272 05Jan2011
4 07Dec2010 648 06Jan2011
4 08Dec2010 778 07Jan2011
4 03Mar2011 839 02Apr2011
4 27May2011 784 26Jun2011
;
proc sql;
create table want as
select
a.ID, a.date, a.SIN,
a.date30 ,sum(b.SIN) as SIN30
from
have as a inner join
have as b
on a.ID=b.ID and b.date between a.date and a.date30
group by a.ID, a.date, a.SIN,a.date30;
quit;
@novinosrin the code you provided worked like a charm.
Thank you for taking the time to help out.
Best Regards,
Melsaid
I think this will work too (not tested), but honestly, if memory is not an issue, the SQL solution above is simplest and I've used that technique for very large data.
proc sort data=have; by id date; run;
proc sql noprint;
select max(n) into :maxn from
(select id, count(*) as n from have
group by id);
quit;
data want;
set have;
by id;
array stack {&maxn,2} _temporary_;
length n 3 sln30 d d30 5;
format d d30 date9.;
if first.id then do;
do i=1 to dim(stack,1);
stack[i,1]=.;
stack[i,2]=.;
end;
n=0;
end;
n+1;
stack[n,1]=date;
stack[n,2]=sln;
if last.id then do;
do i=1 to n;
sln30=0;
d=stack[i,1];
d30=d+30;
do j=i to n;
if stack[j,1]>d30 then do;
output;
leave;
end;
sln30+stack[j,2];
if j=n then output;
end;
end;
end;
keep d d30 sln30;
run;
I'm a little confused by the purpose of this, though, as it seems like it double counts stuff, e.g. the '197' in line 2 is counted both in the 12Jul - 11Aug range AND in the 11Aug - 10Sep range. But I'm probably just not following.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.