Aggregating variables (sum or mean) between two specific dates

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 17
Accepted Solution

Aggregating variables (sum or mean) between two specific dates

Hi Everybody!

 

I'm trying to aggregate a variable, let's call it Image, between two specific dates for each company Id. So Imagine I have a database with three columns: Company ID, Date, Image. I want to calculate that for each company ID what is the value of image variable (sum or average) between for instance January 17th of 2008 and February 16th of 2008. It's a panel data. I have multiple observations for each ID daily from 2007 to 2015. I want to aggregate each variable's value between specific dates for each company ID. 

If that helps, I hope I'm clear enough, for company 83976, I want to have a new variable summing the values between 1/1/2007 and 1/30/2007 for image. 

 

I read it should work with proc SQL but I didn't find a good example. 

 Company Date Image 

83976  1/1/2007 12

83976 1/2/2007   9

83976  1/3/2007  3 

 

Thank you!

 

 

 

 


Accepted Solutions
Solution
‎01-19-2016 10:39 AM
Trusted Advisor
Posts: 1,116

Re: Aggregating variables (sum or mean) between two specific dates

[ Edited ]

Hello @HannahPhD,

 

Thanks for providing sample data. I think the code could look like this:

 

/* Create hypothetical event data */

data events;
input CompanyID startdt :mmddyy. enddt :mmddyy.;
format startdt enddt mmddyy10.;
cards;
10107 01/01/2007 01/03/2007
10107 01/11/2007 01/22/2007
11081 03/06/2008 03/07/2008
11081 05/11/2008 05/13/2008
12060 03/14/2007 03/17/2007
;

/* Create the desired summary dataset */

proc sql;
create table want as
select a.CompanyID,
       cat(put(b.startdt,mmddyy10.),' - ',put(b.enddt,mmddyy10.)) as Date_range length=23,
       sum(Images)         as Sum_Images,
       sum(ImagesUniverse) as Sum_ImagesUniverse,
       sum(Microblogs)     as Sum_Microblogs
from have a, events b
where a.CompanyID=b.CompanyID & b.startdt<=a.date<=b.enddt
group by a.CompanyID, Date_range;
quit;

I'll write explanations in a separate post.

 

Edit: Added length specification for variable Date_range to save disk space.
If dataset WANT was to be used for further processing (rather than reporting), startdt and enddt should be selected as separate variables, not combined to a character variable. In this case, Date_range would be replaced by startdt, enddt in the GROUP BY clause.

View solution in original post


All Replies
Super User
Posts: 11,128

Re: Aggregating variables (sum or mean) between two specific dates

Do you have the start/end date pairs for each company in a data set (or if all the date intervals are the same just the date pairs)? What would you like the data to look like in the output? And are the daes in you data SAS date value variables or character?

 

One example that would not be easily extensible but would work for a single date pair across all companies:

proc sql;
   create table want as
   select company, sum(image) as ImageSum
   from have 
   where date between '01JAN2007'd and '30JAN2007'd
   group by company;
run;

If the date pairs are in a data set that could be joined with your existing data to create the groups keeping the start and end dates so you would know which interval the data represents.

 

Occasional Contributor
Posts: 17

Re: Aggregating variables (sum or mean) between two specific dates

Thank you for replying! Your syntax works and very easy for me t understand! However, now I want to change yours to fit my multiple companies with different time frames:

proc sql;
create table want as
select permno, sum(images) as ImagesSum
from sasuser.infegysubset(where=(companyid='83976'))
where date between '30dec2007'd and '30JAN2008'd

where date between '27dec2008'd and '27JAN2009'd

from sasuser.infegysubset(where=(companyid='10107'))

where date between '30June2008'd and '30July2008'd

where date between '27NoV2008'd and '27DEC2009'd

group by permno;
quit;

 

So my question is now how I can do it for each company with different dates(the one I pasted above gives me error because of second where). And how efficiently do it for multiple companies. The codes I pasted above doesnt work. I just wanted to show you what I need. So I have multiple companies each require different dates to be specified to acquire the sum variable.

Trusted Advisor
Posts: 1,116

Re: Aggregating variables (sum or mean) between two specific dates

You're welcome. I'm sure, we'll find a solution. We only need to know the structure of the input tables and the desired output table.

 

Your new PROC SQL step suggests that the WANT table will have only two variables: permno (the meaning of which is not yet clear to me) and ImagesSum. So, no companyid and no dates? In particular, WANT would contain only summary data, not the original records, enriched with summary data?

 

And we have two date ranges per company? That's good to know. The code will then be flexible enough to handle varying numbers of date ranges per company.

 

Would the start and end dates come from a dataset like this?

companyid  startdt      enddt
10107      06/30/2008   07/30/2008
10107      11/27/2008   12/27/2009
83976      12/30/2007   01/30/2008
83976      12/27/2008   01/27/2009

Does SASUSER.INFEGYSUBSET look like the table in your initial post, but with an additional grouping variable permno?

 

Sorry for asking so many questions, but complete specifications facilitate program development.

 

I'll be back tomorrow (CET), but I hope the American and Asian "colleagues" will provide support in the meantime, if necessary.

 

Occasional Contributor
Posts: 17

Re: Aggregating variables (sum or mean) between two specific dates

Hello!

I have uploaded a sample of my data. I want the new data set that SQL creates have company Id and sum variable for each company ID. However, as you can see in my data, for each company ID I have a range from 2007 to 2015 and daily value of each variable(images). Now, I am trying to make another data set based on some events, those events are date ranges for each company like from Dec 31st of 2007 to 30th january 2008 for company 10107. Each company may have different date ranges requiring sum variable for each of these specified ranges. So, the new  table should allow me to see a company ID and sum variable attributable to that date range which I will speicify in the command.

Also, Can I see the date range in the output too?

I was wonderring if I could do multiple dates and multiple companies all in one syntax.

 

Thanks Again!

 

 

Solution
‎01-19-2016 10:39 AM
Trusted Advisor
Posts: 1,116

Re: Aggregating variables (sum or mean) between two specific dates

[ Edited ]

Hello @HannahPhD,

 

Thanks for providing sample data. I think the code could look like this:

 

/* Create hypothetical event data */

data events;
input CompanyID startdt :mmddyy. enddt :mmddyy.;
format startdt enddt mmddyy10.;
cards;
10107 01/01/2007 01/03/2007
10107 01/11/2007 01/22/2007
11081 03/06/2008 03/07/2008
11081 05/11/2008 05/13/2008
12060 03/14/2007 03/17/2007
;

/* Create the desired summary dataset */

proc sql;
create table want as
select a.CompanyID,
       cat(put(b.startdt,mmddyy10.),' - ',put(b.enddt,mmddyy10.)) as Date_range length=23,
       sum(Images)         as Sum_Images,
       sum(ImagesUniverse) as Sum_ImagesUniverse,
       sum(Microblogs)     as Sum_Microblogs
from have a, events b
where a.CompanyID=b.CompanyID & b.startdt<=a.date<=b.enddt
group by a.CompanyID, Date_range;
quit;

I'll write explanations in a separate post.

 

Edit: Added length specification for variable Date_range to save disk space.
If dataset WANT was to be used for further processing (rather than reporting), startdt and enddt should be selected as separate variables, not combined to a character variable. In this case, Date_range would be replaced by startdt, enddt in the GROUP BY clause.

Trusted Advisor
Posts: 1,116

Re: Aggregating variables (sum or mean) between two specific dates

Hello again,

 

You wrote that you would specify the date range "in the command." I'd suggest you rather create a dataset EVENTS containing the date ranges as shown in the code. Thus you have the dates in a dataset where they belong and not in your source code. PROC SQL can then pull them from this dataset and assign them to the records in your large HAVE dataset according to the WHERE condition.

 

As you can see in the hypothetical event data, the number of date ranges can vary from company to company and there may be companies (such as that with CompanyID 11308 in the example) which do not have event data at all. Likewise, it would be no problem if there were event data for companies available without records in your HAVE dataset. Currently, the WANT dataset with summary data would cover only companies with records in both input datasets (but this could be changed, if necessary).

 

I've created three sum variables, one for each original count variable. Or did you want a vertically structured dataset with only one sum variable and an additional variable indicating what the sum variable represents ("Images", "ImagesUniverse" or "Microblogs") in the respective observation?

 

Obviously, the format of the Date_range variable can be adapted easily to your needs. You could even add an additional variable containing a short title or keyword describing the event, if that made sense. (This information should be stored in the EVENTS dataset.)

Occasional Contributor
Posts: 17

Re: Aggregating variables (sum or mean) between two specific dates

Thank you! It worked perfectly!
Trusted Advisor
Posts: 1,116

Re: Aggregating variables (sum or mean) between two specific dates

Hi @HannahPhD,

 

Yes, this could be done with PROC SQL, as shown below:

/* Create test data */

data have;
input Company $ Date :mmddyy10. Image;
format date mmddyy10.;
cards;
83976 1/1/2007 12
83976 1/2/2007 9
83976 1/3/2007 3 
83977 1/1/2007 7
83977 1/2/2007 6
83977 2/1/2007 3 
;

/* Define date range */

%let startdt = '01JAN2007'd;
%let enddt   = '30JAN2007'd;

/* Create dataset with the desired additional variables */

proc sql;
create table want as
select a.*, sum_image, avg_image from
have a left join
(select company, sum(image) as sum_image,
                 avg(image) as avg_image
 from have
 where &startdt<=date<=&enddt
 group by company) b
on a.company=b.company & &startdt<=date<=&enddt
order by company, date;
quit;

proc print data=want;
run;

It's possible, however, that other techniques would be more suitable. This depends, for example, on the number of variables to be summarized. Also, it would be important to know more details on how the date ranges will be defined.

 

The code above assigns the calculated summary statistics only to observations whose dates fall into the date range (see the last observation of the second company), but this could be changed easily.

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 1866 views
  • 5 likes
  • 3 in conversation