BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
melsaid04
Fluorite | Level 6

I have 


data have;
input ID Date :date9. SIn Date30 :date9. ;
format date: date9.;
cards;
1 12Jul2017 318 11Aug2017
1 11Aug2017 197 10Sep2017
1 10Sep2017 700 10Oct2017
1 10Oct2017 524 09Nov2017
1 09Nov2017 702 09Dec2017
2 29Sep2017 642 29Oct2017
2 09Dec2017 446 08Jan2018
3 04Dec2010 182 03Jan2011
3 02Feb2011 0 04Mar2011
3 03Apr2011 10 20May2011
4 05Dec2010 317 04Jan2011
4 06Dec2010 272 05Jan2011
4 07Dec2010 648 06Jan2011
4 08Dec2010 778 07Jan2011
4 03Mar2011 839 02Apr2011
4 27May2011 784 26Jun2011

;

 

Date30 is just the variable date + 30 days

 

I want to vertically sum the SIN variable By ID where the new variable SIN30 is the cumulative sum of all SIN values for the 30 days after the current date. the output should look like this 

 

 

IDDateSInDate30SIN30
112Jul201731811Aug2017515
111Aug201719710Sep2017897
110Sep201770010Oct20171224
110Oct201752409Nov20171226
109Nov201770209Dec2017702
229Sep201764229Oct2017642
209Dec201744608Jan2018446
304Dec201018203Jan2011182
302Feb2011004Mar20110
303Apr20111020May201110
405Dec201031704Jan20112015
406Dec201027205Jan20111698
407Dec201064806Jan20111426
408Dec201077807Jan2011778
403Mar201183902Apr2011839
427May201178426Jun2011784

 

In earlier post https://communities.sas.com/t5/Base-SAS-Programming/Conditional-Vertical-Summation/m-p/478242#M12331... , I was able to find a solution to this issue using the following code.

 

data want;
if _N_ = 1 then do;
if 0 then set have(rename=(sin=_sin));
   declare hash h(dataset:'have(rename=(sin=_sin)');
  h.defineKey('id','date');
  h.defineData('_sin');
  h.defineDone();
end;
set have;
by id;
sin30=0;
do n=date to date30;
if h.find(key:id,key:n)=0 then sin30+_sin;
end;
drop n _sin;
run;

While the code works perfectly, I can not use it because my computer memory is insufficient to execute the hash based code (i'm also creating 36 variables similar to SIN30). I was wondering if there is any other way to solve this problem without using the hash function.

 

Thanks for all your help in advance.

 

     

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

If the dates are a sparsely spaced as in your example, then for each record you're looking for 30 dates just to harvest 2 or 3 for SIN30. 

 

If it weren't for memory problems, you could just make the hash an ordered hash (by id date), and associate a hash iterator with it.  Then you could use the SETCUR method to get the current date and SIN. Then a sequence of NEXT methods to get successive observations until you bypass DATE30.  This way there is no excess of unsatisfied FIND methods.

 

As to memory issues, you probably could populate the hash one ID at a time, process all the records for that id, and then clear the hash object in preparation for the next id.  In that case you could just use DATE as the hash key.

 

However, the program below is almost certainly faster, as it uses an array (less memory overhead than hash), and since the array is accessed for N=1, 2, ..., it is a key-indexed array providing faster lookup than hash:

 


data want;

  array _dat{100} ;
  array _sin{100} ;

  do _n=1 by 1 until (last.id);
    set have;
    by id;
    _dat{_n}=date;
    _sin{_n}=sin;
  end;

  do _n2=1 to _n;
    set have;
    sin30=0;
    do _n3=_n2 to _n while (_dat{_n3}<=date30);
      if _dat{_n3}>date30 then leave;
      else sin30=sin30 + _sin{_n3};
    end;
    output;
  end;

  drop _: ;
run;
 

 

This program assumes no ID has more than 100 dates.  If that's not true, just set the dimension of arrays _dat and _sin to exceed the maximum expected date count.

 

The program reads a single ID once to populate the arrays.  Then it re-reads the same ID, and retrieves data from the array as needed.  This type of re-read is probably relatively efficient, since the re-reading probably comes from memory buffers rather than from the disk.

 

Like your program, this reads the dataset twice. but unlike your program it doesn't read the entire dataset once, and then restarts at the beginning to re-read the entire dataset.   Much less efficient use of disk channel resources.

 

Edited improvement.  You can replace

 

    do _n3=_n2 to _n while (_dat{_n3}<=date30);
      if _dat{_n3}>date30 then leave;
      else sin30=sin30 + _sin{_n3};
    end;

with:

 

    do _n3=_n2 to _n while (_dat{_n3}<=date30);
      sin30=sin30 + _sin{_n3};
    end;

The "if test" was needed in an earlier version, but no longer.

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

12 REPLIES 12
PaigeMiller
Diamond | Level 26

In most cases, splitting your data set up into smaller data sets is not recommended, but in this case, where you are running out of memory on the whole data set, I believe this is a good thing to do. 

 

Split your data into smaller data sets by ID, or perhaps several IDs into one data set.

--
Paige Miller
melsaid04
Fluorite | Level 6

Thank you for your response. I tried doing that, however, it is not practical, as I have to split the data into 800 smaller datasets.

 

Regards, 

PaigeMiller
Diamond | Level 26

@melsaid04 wrote:

Thank you for your response. I tried doing that, however, it is not practical, as I have to split the data into 800 smaller datasets.



A macro would get the job done, and append the results back together as well.

--
Paige Miller
mkeintz
PROC Star

If the dates are a sparsely spaced as in your example, then for each record you're looking for 30 dates just to harvest 2 or 3 for SIN30. 

 

If it weren't for memory problems, you could just make the hash an ordered hash (by id date), and associate a hash iterator with it.  Then you could use the SETCUR method to get the current date and SIN. Then a sequence of NEXT methods to get successive observations until you bypass DATE30.  This way there is no excess of unsatisfied FIND methods.

 

As to memory issues, you probably could populate the hash one ID at a time, process all the records for that id, and then clear the hash object in preparation for the next id.  In that case you could just use DATE as the hash key.

 

However, the program below is almost certainly faster, as it uses an array (less memory overhead than hash), and since the array is accessed for N=1, 2, ..., it is a key-indexed array providing faster lookup than hash:

 


data want;

  array _dat{100} ;
  array _sin{100} ;

  do _n=1 by 1 until (last.id);
    set have;
    by id;
    _dat{_n}=date;
    _sin{_n}=sin;
  end;

  do _n2=1 to _n;
    set have;
    sin30=0;
    do _n3=_n2 to _n while (_dat{_n3}<=date30);
      if _dat{_n3}>date30 then leave;
      else sin30=sin30 + _sin{_n3};
    end;
    output;
  end;

  drop _: ;
run;
 

 

This program assumes no ID has more than 100 dates.  If that's not true, just set the dimension of arrays _dat and _sin to exceed the maximum expected date count.

 

The program reads a single ID once to populate the arrays.  Then it re-reads the same ID, and retrieves data from the array as needed.  This type of re-read is probably relatively efficient, since the re-reading probably comes from memory buffers rather than from the disk.

 

Like your program, this reads the dataset twice. but unlike your program it doesn't read the entire dataset once, and then restarts at the beginning to re-read the entire dataset.   Much less efficient use of disk channel resources.

 

Edited improvement.  You can replace

 

    do _n3=_n2 to _n while (_dat{_n3}<=date30);
      if _dat{_n3}>date30 then leave;
      else sin30=sin30 + _sin{_n3};
    end;

with:

 

    do _n3=_n2 to _n while (_dat{_n3}<=date30);
      sin30=sin30 + _sin{_n3};
    end;

The "if test" was needed in an earlier version, but no longer.

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
melsaid04
Fluorite | Level 6

Thank you so much the code you provided worked perfectly, I just have to replace it to create 30 some similar variables. 

 

In your response you mentioned:

"As to memory issues, you probably could populate the hash one ID at a time, process all the records for that id, and then clear the hash object in preparation for the next id.  In that case you could just use DATE as the hash key."

 

for my learning purposes, how would the hash code provided above change to reflect on the approach you suggested. 

 

Best Regards, 

 

Melsaid

mkeintz
PROC Star

Here's how you would use your original logic and hash programming one ID at a time.  Of course this assumes that the data are sorted by ID (although they don't have to be sorted by date within id):

 

data want2;
  if _N_ = 1 then do;
    if 0 then set have(rename=(sin=_sin));
    declare hash h(dataset:'have(obs=0 rename=(sin=_sin)');
      h.defineKey('date');
      h.defineData('_sin');
      h.defineDone();
  end;
  do until (last.id);
    set have;
    by id;
    h.add(key:date,data:sin);
  end;

  do until (last.id);
    set have;
    by id;
    sin30=0;
    do n=date to date30;
      if h.find(key:n)=0 then sin30+_sin;
    end;
    output;
  end;
  h.clear();
  drop n _sin;
run;

 

Of course, this program does 30 FIND methods even if only 2 or 3 actually get data.  As I mentioned earlier, you could eliminate that wastage by introducing a hash iterator, and use of the SETCUR and NEXT methods instead of FIND over 30 dates:

 

data want3;
  if _N_ = 1 then do;
    if 0 then set have(rename=(sin=_sin date=_date));
    declare hash h(dataset:'have(obs=0 rename=(sin=_sin date=_date)',ordered:'a');
      h.defineKey('_date');
      h.defineData('_date','_sin');
      h.defineDone();
    declare hiter hi ('h');
  end;
  do until (last.id);
    set have (rename=(sin=_sin date=_date));
    by id;
    h.add();
  end;

  do until (last.id);
    set have;
    by id;
    sin30=0;
    do rc=hi.setcur(key:date) by 0 while(_date<=date30);
      sin30=sin30+_sin;
      if hi.next()^=0 then leave;
    end;
    output;
  end;
  h.clear();
  drop rc _: ;
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
melsaid04
Fluorite | Level 6

@mkeintz thank you so much for taking the time to draft the code for the edited hash function. I studied the code and it makes so much sense. 

 

I really appreciate all your help.

 

Best Regards, 

 

Melsaid 

 

novinosrin
Tourmaline | Level 20
data have;
input ID	Date :date9.	SIN	Date30 :date9. ;
format date: date9.;
cards;
1	12Jul2017	318	11Aug2017
1	11Aug2017	197	10Sep2017
1	10Sep2017	700	10Oct2017
1	10Oct2017	524	09Nov2017
1	09Nov2017	702	09Dec2017
2	29Sep2017	642	29Oct2017
2	09Dec2017	446	08Jan2018
3	04Dec2010	182	03Jan2011
3	02Feb2011	0	04Mar2011
3	03Apr2011	0	03May2011
4	05Dec2010	317	04Jan2011
4	06Dec2010	272	05Jan2011
4	07Dec2010	648	06Jan2011
4	08Dec2010	778	07Jan2011
4	03Mar2011	839	02Apr2011
4	27May2011	784	26Jun2011
;

data w;
set have;
by id;
sin30=0;
do n=date to date30;
output;
end;
format n date9.;
run;

proc sql;
create table want as
select a.*,b.sin30
from have a left join 
(select a.id,a.date as date,a.sin, sum(b.sin) as sin30
from w a left join have b
on a.id=b.id 
where a.n=b.date
group by a.id, a.date,a.sin) b
on a.id=b.id and a.date=b.date
order by a.id,a.date;
quit;
PGStats
Opal | Level 21

Depending on how your data is indexed, this simple approach can be quite efficient:

 

data have;
input ID	Date :date9. SIN;
format date: date9.;
cards;
1	12Jul2017	318
1	11Aug2017	197
1	10Sep2017	700
1	10Oct2017	524
1	09Nov2017	702
2	29Sep2017	642
2	09Dec2017	446
3	04Dec2010	182
3	02Feb2011	0
3	03Apr2011	0
4	05Dec2010	317
4	06Dec2010	272
4	07Dec2010	648
4	08Dec2010	778
4	03Mar2011	839
4	27May2011	784
;

proc sql;
create table want as
select 
    a.ID, a.date, a.SIN, 
    sum(b.SIN) as SIN30
from
    have as a inner join
    have as b 
    on  a.ID=b.ID and 
        b.date between a.date and a.date+30
group by a.ID, a.date, a.SIN;
quit;
PG
novinosrin
Tourmaline | Level 20

@PGStats Sir, I tried a slight tweak to your code taking OP's dataset as is with date30 var from source  and tweak in between and adding date30 to group by . I don't know why I didn't think of that earlier

data have;
input ID	Date :date9.	SIN	Date30 :date9. ;
format date: date9.;
cards;
1	12Jul2017	318	11Aug2017
1	11Aug2017	197	10Sep2017
1	10Sep2017	700	10Oct2017
1	10Oct2017	524	09Nov2017
1	09Nov2017	702	09Dec2017
2	29Sep2017	642	29Oct2017
2	09Dec2017	446	08Jan2018
3	04Dec2010	182	03Jan2011
3	02Feb2011	0	04Mar2011
3	03Apr2011	0	03May2011
4	05Dec2010	317	04Jan2011
4	06Dec2010	272	05Jan2011
4	07Dec2010	648	06Jan2011
4	08Dec2010	778	07Jan2011
4	03Mar2011	839	02Apr2011
4	27May2011	784	26Jun2011
;

proc sql;
create table want as
select 
    a.ID, a.date, a.SIN, 
  a.date30 ,sum(b.SIN) as SIN30
from
    have as a inner join
    have as b 
    on  a.ID=b.ID  and b.date between a.date and a.date30
group by a.ID, a.date, a.SIN,a.date30;
quit;
melsaid04
Fluorite | Level 6

@novinosrin the code you provided worked like a charm.

 

Thank you for taking the time to help out.

 

Best Regards, 

 

Melsaid 

quickbluefish
Obsidian | Level 7

I think this will work too (not tested), but honestly, if memory is not an issue, the SQL solution above is simplest and I've used that technique for very large data.

 

proc sort data=have; by id date; run;

proc sql noprint; 
select max(n) into :maxn from 
	(select id, count(*) as n from have 
	group by id);
quit;

data want;
set have;
by id;
array stack {&maxn,2} _temporary_;
length n 3 sln30 d d30 5;
format d d30 date9.;
if first.id then do;
	do i=1 to dim(stack,1);
		stack[i,1]=.;
		stack[i,2]=.;
	end;
	n=0;
end;
n+1;
stack[n,1]=date;
stack[n,2]=sln;
if last.id then do;
	do i=1 to n;
		sln30=0;
		d=stack[i,1];
		d30=d+30;
		do j=i to n;
			if stack[j,1]>d30 then do;
				output;
				leave;
			end;
			sln30+stack[j,2];
			if j=n then output;
		end;
	end;
end;
keep d d30 sln30;
run;

I'm a little confused by the purpose of this, though, as it seems like it double counts stuff, e.g. the '197' in line 2 is counted both in the 12Jul - 11Aug range AND in the 11Aug - 10Sep range.  But I'm probably just not following. 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 2070 views
  • 10 likes
  • 6 in conversation