Solved: Re: Aggregating dataset by date, including missing dates

MikeFranz · Posted 07-10-2017 10:44 AM

Hi,

I have a dataset, work.PIC, which contains transactions ($ amounts) per Address and Date. This dataset is sorted by Address and Date.

Sample Dataset:

  Address    Date      Amount
 Property 1 31/01/2014 -770 
 Property 1 31/01/2014 -550 
 Property 1 31/01/2014 -20000 
 Property 1 31/01/2014 -52329.6 
 Property 1 31/07/2014 -660 
 Property 1 31/10/2014 -1738 
 Property 1 30/06/2015 -845.32 
 Property 1 30/06/2015 845.32 
 Property 1 30/06/2015 -845.32 
 Property 1 31/12/2015 845.32 
 Property 1 31/12/2015 -845.32 
 Property 1 31/12/2015 845.32 
 Property 2 31/01/2014 -1320 
 Property 2 31/01/2014 -2895.99 
 Property 2 31/01/2014 -653.32 
 Property 2 31/01/2014 -40000 
 Property 2 31/01/2014 -121954.8 
 Property 2 30/09/2015 -3500 
 Property 3 31/01/2014 -1200 
 Property 3 31/01/2014 -28080 
 Property 3 31/01/2014 -130483 
 Property 3 30/11/2015 -885 
 Property 3 31/12/2015 885

For each address, I would like to create a running total per Address, with output at each Date. I have accomplished this through the code below:

Current code:

data work.PIC_agg (keep = Address Date Value);
	set work.PIC;
	by Address Date;
	if First.Address then Value = 0;
	value + Amount;
	if Last.Date then output;
run;

Problem:

This issue with the above is that it then only contains information for each property on a date where transactions occurred e.g. Property 1 will only have entries for Jan14, Jul14, Oct14, Jun15, and Dec15.

I would like to have a value for each date i.e. from Jan14 through to Dec15. If there is no transaction in a given month, then the value equals the previous month's total.

Proposed methodology:

Store each unique element of the Date variable in an array (DateArray)

Create a new variable for each element in DateArray

For each element in DataArray, sum over all transactions for a particular Address, where the Date<= DateArray

Store the resultant number in the correct variable for each Address

Any ideas on where the above methodology is sound, and, if so, how to implement. If not, could someone perhaps suggest a better method?

Please let me know if the above is unclear.

Many thanks!

AhmedAl_Attar · Posted 07-19-2017 01:03 PM

Hi,

This newly modified code, has a running total (value), as well as, original monthy total

data have;
length address $10 date 8 amount 8;
format date ddmmyy10.;
INPUT @1 address $10. @12 date :ddmmyy10. amount :best.;
datalines;
Property 1 31/01/2014 -770 
Property 1 31/01/2014 -550 
Property 1 31/01/2014 -20000 
Property 1 31/01/2014 -52329.6 
Property 1 31/07/2014 -660 
Property 1 31/10/2014 -1738 
Property 1 30/06/2015 -845.32 
Property 1 30/06/2015 845.32 
Property 1 30/06/2015 -845.32 
Property 1 31/12/2015 845.32 
Property 1 31/12/2015 -845.32 
Property 1 31/12/2015 845.32 
Property 2 31/01/2014 -1320 
Property 2 31/01/2014 -2895.99 
Property 2 31/01/2014 -653.32 
Property 2 31/01/2014 -40000 
Property 2 31/01/2014 -121954.8 
Property 2 30/09/2015 -3500 
Property 3 31/01/2014 -1200 
Property 3 31/01/2014 -28080 
Property 3 31/01/2014 -130483 
Property 3 30/11/2015 -885 
Property 3 31/12/2015 885
;
run;

/* Create aggregate by address,date */
proc summary data=have nway;
  class address date;
  var	amount;
  output out=have_sum(drop=_:) sum=;
run;

/* For every address, ensure there are no monthly gaps */
data want(KEEP=address date amount value);

	length curr_dt curr_amt curr_val prev_dt prev_amt value prev_val 8;
	retain curr_dt curr_amt curr_val prev_dt prev_amt value prev_val;
	format curr_dt prev_dt ddmmyy10.;

	do until (last.address);

		SET have_sum;
		by address date;

		if (first.address) then
		do;
			value = amount;
			output; /* ensure it doesn't get left out */
		end;
		else if (first.date) then
			value + amount;

		prev_dt= lag(date);
		prev_val= lag(value);
		prev_amt= lag(amount);
		put prev_dt= prev_val= prev_amt=;

		/* Find the monthly gap */
		diff= INTCK('month',prev_dt,date);
		*put diff=;

		if (diff > 0) then
		do; /* Fill-in missing months */

			/* Preserve current record values */
			curr_dt = date;
			curr_val = value;
			curr_amt = amount;

			/* Output missing months */
			do i=1 to (diff-1) by 1;
			  date=intnx('month',prev_dt,i,'E');
			  amount=prev_amt;
			  value=prev_val;
			  OUTPUT;
			end;

			/* Don't forget current record */
			date=curr_dt;
			amount=curr_amt;
			value=curr_val;
			OUTPUT;

		end; /* Fill-in missing months */
	end;
run;
Proc print data=want; run;

Hope if gives you what you are looking for,

Ahmed

View solution in original post

art297 · Posted 07-10-2017 12:21 PM

You say you have a variable PERIOD, but you haven't shown us anything about that variable. You'll likely get more responses if you show the forum your example have and want datasets .. each in the form of datasteps and posted using the {i} icon.

Art, CEO, AnalystFinder.com

MikeZdeb · Posted 07-10-2017 01:44 PM

Hi, this uses an array but not in the way (I think) that you proposed. There probably is a more efficient INTNX-related metod, but this fills in the missing months ...

* aggregate amounts by address and date;

proc summary data=x nway;
class address date;
var amount;
output out=xx (drop=_:) sum=;
run;

* fill in the missing months;

data y;
array _a(0:1000);
do j=1 by 1 until (last.address);
set xx;
by address;
if first.address then first = date;
_a(date - first) = amount;
end;
do j=0 to (date - first);
value = sum(value,_a(j));
date = first+j;
if day(date+1) eq 1 then output;
end;
keep address date value;
run;

Portion of output ...

DATA SET: y

 address           date       value

Property1    31/01/2014     -73649.60
Property1    28/02/2014     -73649.60
Property1    31/03/2014     -73649.60
Property1    30/04/2014     -73649.60
Property1    31/05/2014     -73649.60
Property1    30/06/2014     -73649.60
Property1    31/07/2014     -74309.60
Property1    31/08/2014     -74309.60
Property1    30/09/2014     -74309.60
Property1    31/10/2014     -76047.60
Property1    30/11/2014     -76047.60
Property1    31/12/2014     -76047.60
Property1    31/01/2015     -76047.60
Property1    28/02/2015     -76047.60
Property1    31/03/2015     -76047.60
Property1    30/04/2015     -76047.60
Property1    31/05/2015     -76047.60
Property1    30/06/2015     -76892.92
Property1    31/07/2015     -76892.92
Property1    31/08/2015     -76892.92
Property1    30/09/2015     -76892.92
Property1    31/10/2015     -76892.92
Property1    30/11/2015     -76892.92
Property1    31/12/2015     -76047.60

MikeFranz · Posted 07-10-2017 01:59 PM

Hi,
Thank you for the response, very helpful!
Couple of follow up questions:

1) Is there any way I could get the resultant dataset to contain each month up to a given period? For example, using the code you kindly provided, Property 1 data goes up to 31Dec15, while Property 2 data only goes up to 30Sep15. Ideally I would like all the properties to generate up to the same month, if that makes sense?

2) Would you be able to walk me through what your code is doing? It clearly works, but I'm quite new to SAS and am not able to follow it.

Thanks again for your help!

MikeZdeb · Posted 07-10-2017 02:05 PM

For maximum date in the data set for all PROPERTIES ...

data y;
retain max_date;
array _a(0:1000);
do j=1 by 1 until (last.address);
set xx;
by address;
if first.address then first = date;
_a(date - first) = amount;
max_date = max(max_date, date);
end;
do j=0 to max_date-first;
value = sum(value,_a(j));
date = first+j;
if day(date+1) eq 1 then output;
end;
keep address date value;
run;

Send me an email ... msz03albany.edu ... and I'll do the walk through for you.

MikeFranz · Posted 07-10-2017 02:29 PM

Many thanks, the above works perfectly.

Tried sending you an email by it can't send, as it can't resolve the domain name in msz03albany.edu?

MikeZdeb · Posted 07-10-2017 06:23 PM

Hi, still needs a couple fixes, only worked since the max date is in the first group of addresses.

Will fix later and send to you. My email is msz03@albany.edu

MikeFranz · Posted 07-19-2017 11:30 AM

Hi,

I'm looking at this again today, and just noticed I'm getting some strange results.

If I use the dataset below:

Address                 Date            Amount
1000 Sumac Drive	31/05/2014	250
1000 Sumac Drive	30/06/2014	184252
1000 Sumac Drive	31/07/2014	4100
1000 Sumac Drive	31/08/2014	2572
1000 Sumac Drive	28/02/2015	402.98
1000 Sumac Drive	31/08/2015	-933.87
1000 Sumac Drive	30/09/2015	6520.72
1000 Sumac Drive	31/12/2015	-2632.98

I'm getting the following output:

1000 Sumac Drive	31/05/2014	-2382.98	
1000 Sumac Drive	30/06/2014	181869.02	
1000 Sumac Drive	31/07/2014	185969.02	
1000 Sumac Drive	31/08/2014	188541.02	
1000 Sumac Drive	30/09/2014	188541.02	
1000 Sumac Drive	31/10/2014	188541.02	
1000 Sumac Drive	30/11/2014	188541.02	
1000 Sumac Drive	31/12/2014	188541.02	
1000 Sumac Drive	31/01/2015	188541.02	
1000 Sumac Drive	28/02/2015	188944	
1000 Sumac Drive	31/03/2015	188944	
1000 Sumac Drive	30/04/2015	188944	
1000 Sumac Drive	31/05/2015	188944	
1000 Sumac Drive	30/06/2015	188944	
1000 Sumac Drive	31/07/2015	188944	
1000 Sumac Drive	31/08/2015	188010.13	
1000 Sumac Drive	30/09/2015	194530.85	
1000 Sumac Drive	31/10/2015	194530.85	
1000 Sumac Drive	30/11/2015	194530.85	
1000 Sumac Drive	31/12/2015	191897.87	
1000 Sumac Drive	31/01/2016	191897.87	
1000 Sumac Drive	29/02/2016	191897.87	
1000 Sumac Drive	31/03/2016	191897.87	
1000 Sumac Drive	30/04/2016	191897.87	
1000 Sumac Drive	31/05/2016	191897.87	
1000 Sumac Drive	30/06/2016	191897.87	
1000 Sumac Drive	31/07/2016	191897.87	
1000 Sumac Drive	31/08/2016	191897.87	
1000 Sumac Drive	30/09/2016	191897.87	
1000 Sumac Drive	31/10/2016	191897.87	
1000 Sumac Drive	30/11/2016	191897.87	
1000 Sumac Drive	31/12/2016	191897.87	
1000 Sumac Drive	31/01/2017	191897.87	
1000 Sumac Drive	28/02/2017	191897.87	
1000 Sumac Drive	31/03/2017	191897.87	
1000 Sumac Drive	30/04/2017	191897.87	
1000 Sumac Drive	31/05/2017	191897.87	
1000 Sumac Drive	30/06/2017	191897.87	
1000 Sumac Drive	31/07/2017	191897.87

I can't work out why it is misbehaving?

Any insight would be amazing!

Thanks!

MikeFranz · Posted 07-19-2017 11:34 AM

If it helps at all, it seems to be aggregating the first and last lines of the first dataset into the first line of second dataset, I just don't know why

AhmedAl_Attar · Posted 07-19-2017 12:04 PM

Mike,
Check my 2nd answer (07/10/17 5:xx PM) to your original post. I had included a sample output, that may just match what you are looking for?

Ahmed

AhmedAl_Attar · Posted 07-10-2017 01:46 PM

Hi Mike,

Would the following work for you?

data have;
length address $10 date 8 amount 8;
format date ddmmyy10.;
INPUT @1 address $10. @12 date :ddmmyy10. amount :best.;
datalines;
Property 1 31/01/2014 -770 
Property 1 31/01/2014 -550 
Property 1 31/01/2014 -20000 
Property 1 31/01/2014 -52329.6 
Property 1 31/07/2014 -660 
Property 1 31/10/2014 -1738 
Property 1 30/06/2015 -845.32 
Property 1 30/06/2015 845.32 
Property 1 30/06/2015 -845.32 
Property 1 31/12/2015 845.32 
Property 1 31/12/2015 -845.32 
Property 1 31/12/2015 845.32 
Property 2 31/01/2014 -1320 
Property 2 31/01/2014 -2895.99 
Property 2 31/01/2014 -653.32 
Property 2 31/01/2014 -40000 
Property 2 31/01/2014 -121954.8 
Property 2 30/09/2015 -3500 
Property 3 31/01/2014 -1200 
Property 3 31/01/2014 -28080 
Property 3 31/01/2014 -130483 
Property 3 30/11/2015 -885 
Property 3 31/12/2015 885
;
run;

/* Using the /SPARSE option to fill-in the missing dates for the properties */
proc freq data=have;
  table address*date / SPARSE out=all list noprint;
run;

/* Combine the original data with only the filler records (count=0) */
data final_v;
  SET have all(where=(count=0));
run;

/* Produce the final aggregation table */
proc summary data=final_v nway;
  class address date;
  var	amount;
  output out=want(drop=_:) sum=;
run;

Hope this helps,

Ahmed

MikeFranz · Posted 07-10-2017 01:48 PM

Hey, thanks for the response. It says "Proc freq not found". Is it part of Base SAS?

Reeza · Posted 07-10-2017 01:51 PM

@MikeFranz wrote:
Hey, thanks for the response. It says "Proc freq not found". Is it part of Base SAS?

Yes it is. Check your code for other errors.

MikeFranz · Posted 07-10-2017 01:57 PM

Hi,

My code is:

proc freq data=work.pic;
  table address*Date / SPARSE out=all list noprint;
run;

which is copy paste from the solution, with only the dataset name changed.

I'm getting the following errors:

 ERROR: The FREQ procedure cannot be run.
 ERROR: Procedure FREQ not found.

I have been not been able to run PROC FREQ in the past. I had just assumed it was because it wasn't part of Base SAS.

If it helps, running PROC SETINIT yields the following:

 1          OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 55         
 56         PROC SETINIT;
 57         
 58         OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 Original site validation data
 Current version: 9.04.01M3P062415
 System birthday:   19JAN2017.
 Operating System:   WX64_SV .
 Product expiration dates:
 ---Base SAS Software                                                                                    14JAN2018  
 ---SAS/GRAPH                                                                                            14JAN2018  
 ---SAS Integration Technologies                                                                         14JAN2018  
 ---SAS/Secure 168-bit                                                                                   14JAN2018  
 ---SAS/Secure Windows                                                                                   14JAN2018  
 ---SAS/ACCESS Interface to PC Files                                                                     14JAN2018  
 ---SAS/ACCESS Interface to ODBC                                                                         14JAN2018  
 ---SAS Workspace Server for Local Access                                                                14JAN2018  
 ---High Performance Suite                                                                               14JAN2018  
 ---SAS Search and Indexing Server                                                                       14JAN2018  
 ---SAS Web Crawler Server                                                                               14JAN2018  
 ---SAS LASR Analytic Server                                                                             14JAN2018  
 ---SAS Visual Analytics Hub                                                                             14JAN2018  
 ---SAS Visual Analytics Services                                                                        14JAN2018  
 ---SAS Visual Analytics Base                                                                            13JAN2018  
 ---Advanced Programming for LASR Analytic Server                                                        14JAN2018  
 ---SAS Visual Analytics Server Components                                                               14JAN2018  
 ---Visual Analytics Explorer                                                                            14JAN2018

Thanks!

art297 · Posted 07-10-2017 02:23 PM

Contact SAS tech support. They'll probably just walk you through re-installing SAS, but you definitey should have access to proc freq. I have to wonder what else you might be missing.

Art, CEO, AnalystFinder.com

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away