Solved: Re: Vertical Summation with a Condition

melsaid04 · Posted 07-26-2018 02:59 PM

I have

data have;
input ID Date :date9. SIn Date30 :date9. ;
format date: date9.;
cards;
1 12Jul2017 318 11Aug2017
1 11Aug2017 197 10Sep2017
1 10Sep2017 700 10Oct2017
1 10Oct2017 524 09Nov2017
1 09Nov2017 702 09Dec2017
2 29Sep2017 642 29Oct2017
2 09Dec2017 446 08Jan2018
3 04Dec2010 182 03Jan2011
3 02Feb2011 0 04Mar2011
3 03Apr2011 10 20May2011
4 05Dec2010 317 04Jan2011
4 06Dec2010 272 05Jan2011
4 07Dec2010 648 06Jan2011
4 08Dec2010 778 07Jan2011
4 03Mar2011 839 02Apr2011
4 27May2011 784 26Jun2011

;

Date30 is just the variable date + 30 days

I want to vertically sum the SIN variable By ID where the new variable SIN30 is the cumulative sum of all SIN values for the 30 days after the current date. the output should look like this

ID	Date	SIn	Date30	SIN30
1	12Jul2017	318	11Aug2017	515
1	11Aug2017	197	10Sep2017	897
1	10Sep2017	700	10Oct2017	1224
1	10Oct2017	524	09Nov2017	1226
1	09Nov2017	702	09Dec2017	702
2	29Sep2017	642	29Oct2017	642
2	09Dec2017	446	08Jan2018	446
3	04Dec2010	182	03Jan2011	182
3	02Feb2011	0	04Mar2011	0
3	03Apr2011	10	20May2011	10
4	05Dec2010	317	04Jan2011	2015
4	06Dec2010	272	05Jan2011	1698
4	07Dec2010	648	06Jan2011	1426
4	08Dec2010	778	07Jan2011	778
4	03Mar2011	839	02Apr2011	839
4	27May2011	784	26Jun2011	784

In earlier post https://communities.sas.com/t5/Base-SAS-Programming/Conditional-Vertical-Summation/m-p/478242#M12331... , I was able to find a solution to this issue using the following code.

data want;
if _N_ = 1 then do;
if 0 then set have(rename=(sin=_sin));
   declare hash h(dataset:'have(rename=(sin=_sin)');
  h.defineKey('id','date');
  h.defineData('_sin');
  h.defineDone();
end;
set have;
by id;
sin30=0;
do n=date to date30;
if h.find(key:id,key:n)=0 then sin30+_sin;
end;
drop n _sin;
run;

While the code works perfectly, I can not use it because my computer memory is insufficient to execute the hash based code (i'm also creating 36 variables similar to SIN30). I was wondering if there is any other way to solve this problem without using the hash function.

Thanks for all your help in advance.

mkeintz · Posted 07-26-2018 04:41 PM

If the dates are a sparsely spaced as in your example, then for each record you're looking for 30 dates just to harvest 2 or 3 for SIN30.

If it weren't for memory problems, you could just make the hash an ordered hash (by id date), and associate a hash iterator with it. Then you could use the SETCUR method to get the current date and SIN. Then a sequence of NEXT methods to get successive observations until you bypass DATE30. This way there is no excess of unsatisfied FIND methods.

As to memory issues, you probably could populate the hash one ID at a time, process all the records for that id, and then clear the hash object in preparation for the next id. In that case you could just use DATE as the hash key.

However, the program below is almost certainly faster, as it uses an array (less memory overhead than hash), and since the array is accessed for N=1, 2, ..., it is a key-indexed array providing faster lookup than hash:


data want;

  array _dat{100} ;
  array _sin{100} ;

  do _n=1 by 1 until (last.id);
    set have;
    by id;
    _dat{_n}=date;
    _sin{_n}=sin;
  end;

  do _n2=1 to _n;
    set have;
    sin30=0;
    do _n3=_n2 to _n while (_dat{_n3}<=date30);
      if _dat{_n3}>date30 then leave;
      else sin30=sin30 + _sin{_n3};
    end;
    output;
  end;

  drop _: ;
run;

This program assumes no ID has more than 100 dates. If that's not true, just set the dimension of arrays _dat and _sin to exceed the maximum expected date count.

The program reads a single ID once to populate the arrays. Then it re-reads the same ID, and retrieves data from the array as needed. This type of re-read is probably relatively efficient, since the re-reading probably comes from memory buffers rather than from the disk.

Like your program, this reads the dataset twice. but unlike your program it doesn't read the entire dataset once, and then restarts at the beginning to re-read the entire dataset. Much less efficient use of disk channel resources.

Edited improvement. You can replace

    do _n3=_n2 to _n while (_dat{_n3}<=date30);
      if _dat{_n3}>date30 then leave;
      else sin30=sin30 + _sin{_n3};
    end;

with:

    do _n3=_n2 to _n while (_dat{_n3}<=date30);
      sin30=sin30 + _sin{_n3};
    end;

The "if test" was needed in an earlier version, but no longer.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

PaigeMiller · Posted 07-26-2018 04:14 PM

In most cases, splitting your data set up into smaller data sets is not recommended, but in this case, where you are running out of memory on the whole data set, I believe this is a good thing to do.

Split your data into smaller data sets by ID, or perhaps several IDs into one data set.

--
Paige Miller

melsaid04 · Posted 07-26-2018 04:16 PM

Thank you for your response. I tried doing that, however, it is not practical, as I have to split the data into 800 smaller datasets.

Regards,

PaigeMiller · Posted 07-26-2018 05:28 PM

@melsaid04 wrote:

Thank you for your response. I tried doing that, however, it is not practical, as I have to split the data into 800 smaller datasets.

A macro would get the job done, and append the results back together as well.

--
Paige Miller

mkeintz · Posted 07-26-2018 04:41 PM

If the dates are a sparsely spaced as in your example, then for each record you're looking for 30 dates just to harvest 2 or 3 for SIN30.

If it weren't for memory problems, you could just make the hash an ordered hash (by id date), and associate a hash iterator with it. Then you could use the SETCUR method to get the current date and SIN. Then a sequence of NEXT methods to get successive observations until you bypass DATE30. This way there is no excess of unsatisfied FIND methods.

As to memory issues, you probably could populate the hash one ID at a time, process all the records for that id, and then clear the hash object in preparation for the next id. In that case you could just use DATE as the hash key.

However, the program below is almost certainly faster, as it uses an array (less memory overhead than hash), and since the array is accessed for N=1, 2, ..., it is a key-indexed array providing faster lookup than hash:


data want;

  array _dat{100} ;
  array _sin{100} ;

  do _n=1 by 1 until (last.id);
    set have;
    by id;
    _dat{_n}=date;
    _sin{_n}=sin;
  end;

  do _n2=1 to _n;
    set have;
    sin30=0;
    do _n3=_n2 to _n while (_dat{_n3}<=date30);
      if _dat{_n3}>date30 then leave;
      else sin30=sin30 + _sin{_n3};
    end;
    output;
  end;

  drop _: ;
run;

This program assumes no ID has more than 100 dates. If that's not true, just set the dimension of arrays _dat and _sin to exceed the maximum expected date count.

The program reads a single ID once to populate the arrays. Then it re-reads the same ID, and retrieves data from the array as needed. This type of re-read is probably relatively efficient, since the re-reading probably comes from memory buffers rather than from the disk.

Like your program, this reads the dataset twice. but unlike your program it doesn't read the entire dataset once, and then restarts at the beginning to re-read the entire dataset. Much less efficient use of disk channel resources.

Edited improvement. You can replace

    do _n3=_n2 to _n while (_dat{_n3}<=date30);
      if _dat{_n3}>date30 then leave;
      else sin30=sin30 + _sin{_n3};
    end;

with:

    do _n3=_n2 to _n while (_dat{_n3}<=date30);
      sin30=sin30 + _sin{_n3};
    end;

The "if test" was needed in an earlier version, but no longer.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

melsaid04 · Posted 07-26-2018 05:15 PM

Thank you so much the code you provided worked perfectly, I just have to replace it to create 30 some similar variables.

In your response you mentioned:

"As to memory issues, you probably could populate the hash one ID at a time, process all the records for that id, and then clear the hash object in preparation for the next id. In that case you could just use DATE as the hash key."

for my learning purposes, how would the hash code provided above change to reflect on the approach you suggested.

Best Regards,

Melsaid

mkeintz · Posted 07-26-2018 10:38 PM

Here's how you would use your original logic and hash programming one ID at a time. Of course this assumes that the data are sorted by ID (although they don't have to be sorted by date within id):

data want2;
  if _N_ = 1 then do;
    if 0 then set have(rename=(sin=_sin));
    declare hash h(dataset:'have(obs=0 rename=(sin=_sin)');
      h.defineKey('date');
      h.defineData('_sin');
      h.defineDone();
  end;
  do until (last.id);
    set have;
    by id;
    h.add(key:date,data:sin);
  end;

  do until (last.id);
    set have;
    by id;
    sin30=0;
    do n=date to date30;
      if h.find(key:n)=0 then sin30+_sin;
    end;
    output;
  end;
  h.clear();
  drop n _sin;
run;

Of course, this program does 30 FIND methods even if only 2 or 3 actually get data. As I mentioned earlier, you could eliminate that wastage by introducing a hash iterator, and use of the SETCUR and NEXT methods instead of FIND over 30 dates:

data want3;
  if _N_ = 1 then do;
    if 0 then set have(rename=(sin=_sin date=_date));
    declare hash h(dataset:'have(obs=0 rename=(sin=_sin date=_date)',ordered:'a');
      h.defineKey('_date');
      h.defineData('_date','_sin');
      h.defineDone();
    declare hiter hi ('h');
  end;
  do until (last.id);
    set have (rename=(sin=_sin date=_date));
    by id;
    h.add();
  end;

  do until (last.id);
    set have;
    by id;
    sin30=0;
    do rc=hi.setcur(key:date) by 0 while(_date<=date30);
      sin30=sin30+_sin;
      if hi.next()^=0 then leave;
    end;
    output;
  end;
  h.clear();
  drop rc _: ;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

melsaid04 · Posted 07-29-2018 11:03 AM

@mkeintz thank you so much for taking the time to draft the code for the edited hash function. I studied the code and it makes so much sense.

I really appreciate all your help.

Best Regards,

Melsaid

novinosrin · Posted 07-26-2018 05:07 PM

data have;
input ID	Date :date9.	SIN	Date30 :date9. ;
format date: date9.;
cards;
1	12Jul2017	318	11Aug2017
1	11Aug2017	197	10Sep2017
1	10Sep2017	700	10Oct2017
1	10Oct2017	524	09Nov2017
1	09Nov2017	702	09Dec2017
2	29Sep2017	642	29Oct2017
2	09Dec2017	446	08Jan2018
3	04Dec2010	182	03Jan2011
3	02Feb2011	0	04Mar2011
3	03Apr2011	0	03May2011
4	05Dec2010	317	04Jan2011
4	06Dec2010	272	05Jan2011
4	07Dec2010	648	06Jan2011
4	08Dec2010	778	07Jan2011
4	03Mar2011	839	02Apr2011
4	27May2011	784	26Jun2011
;

data w;
set have;
by id;
sin30=0;
do n=date to date30;
output;
end;
format n date9.;
run;

proc sql;
create table want as
select a.*,b.sin30
from have a left join 
(select a.id,a.date as date,a.sin, sum(b.sin) as sin30
from w a left join have b
on a.id=b.id 
where a.n=b.date
group by a.id, a.date,a.sin) b
on a.id=b.id and a.date=b.date
order by a.id,a.date;
quit;

PGStats · Posted 07-26-2018 10:40 PM

Depending on how your data is indexed, this simple approach can be quite efficient:

data have;
input ID	Date :date9. SIN;
format date: date9.;
cards;
1	12Jul2017	318
1	11Aug2017	197
1	10Sep2017	700
1	10Oct2017	524
1	09Nov2017	702
2	29Sep2017	642
2	09Dec2017	446
3	04Dec2010	182
3	02Feb2011	0
3	03Apr2011	0
4	05Dec2010	317
4	06Dec2010	272
4	07Dec2010	648
4	08Dec2010	778
4	03Mar2011	839
4	27May2011	784
;

proc sql;
create table want as
select 
    a.ID, a.date, a.SIN, 
    sum(b.SIN) as SIN30
from
    have as a inner join
    have as b 
    on  a.ID=b.ID and 
        b.date between a.date and a.date+30
group by a.ID, a.date, a.SIN;
quit;

PG

novinosrin · Posted 07-26-2018 11:59 PM

@PGStats Sir, I tried a slight tweak to your code taking OP's dataset as is with date30 var from source and tweak in between and adding date30 to group by . I don't know why I didn't think of that earlier

data have;
input ID	Date :date9.	SIN	Date30 :date9. ;
format date: date9.;
cards;
1	12Jul2017	318	11Aug2017
1	11Aug2017	197	10Sep2017
1	10Sep2017	700	10Oct2017
1	10Oct2017	524	09Nov2017
1	09Nov2017	702	09Dec2017
2	29Sep2017	642	29Oct2017
2	09Dec2017	446	08Jan2018
3	04Dec2010	182	03Jan2011
3	02Feb2011	0	04Mar2011
3	03Apr2011	0	03May2011
4	05Dec2010	317	04Jan2011
4	06Dec2010	272	05Jan2011
4	07Dec2010	648	06Jan2011
4	08Dec2010	778	07Jan2011
4	03Mar2011	839	02Apr2011
4	27May2011	784	26Jun2011
;

proc sql;
create table want as
select 
    a.ID, a.date, a.SIN, 
  a.date30 ,sum(b.SIN) as SIN30
from
    have as a inner join
    have as b 
    on  a.ID=b.ID  and b.date between a.date and a.date30
group by a.ID, a.date, a.SIN,a.date30;
quit;

melsaid04 · Posted 07-29-2018 11:05 AM

@novinosrin the code you provided worked like a charm.

Thank you for taking the time to help out.

Best Regards,

Melsaid

quickbluefish · Posted 07-27-2018 10:13 PM

I think this will work too (not tested), but honestly, if memory is not an issue, the SQL solution above is simplest and I've used that technique for very large data.

proc sort data=have; by id date; run;

proc sql noprint; 
select max(n) into :maxn from 
	(select id, count(*) as n from have 
	group by id);
quit;

data want;
set have;
by id;
array stack {&maxn,2} _temporary_;
length n 3 sln30 d d30 5;
format d d30 date9.;
if first.id then do;
	do i=1 to dim(stack,1);
		stack[i,1]=.;
		stack[i,2]=.;
	end;
	n=0;
end;
n+1;
stack[n,1]=date;
stack[n,2]=sln;
if last.id then do;
	do i=1 to n;
		sln30=0;
		d=stack[i,1];
		d30=d+30;
		do j=i to n;
			if stack[j,1]>d30 then do;
				output;
				leave;
			end;
			sln30+stack[j,2];
			if j=n then output;
		end;
	end;
end;
keep d d30 sln30;
run;

I'm a little confused by the purpose of this, though, as it seems like it double counts stuff, e.g. the '197' in line 2 is counted both in the 12Jul - 11Aug range AND in the 11Aug - 10Sep range. But I'm probably just not following.

Registration is open

SAS Training: Just a Click Away