BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
HannahPhD
Obsidian | Level 7

Hi Everybody!

 

I'm trying to aggregate a variable, let's call it Image, between two specific dates for each company Id. So Imagine I have a database with three columns: Company ID, Date, Image. I want to calculate that for each company ID what is the value of image variable (sum or average) between for instance January 17th of 2008 and February 16th of 2008. It's a panel data. I have multiple observations for each ID daily from 2007 to 2015. I want to aggregate each variable's value between specific dates for each company ID. 

If that helps, I hope I'm clear enough, for company 83976, I want to have a new variable summing the values between 1/1/2007 and 1/30/2007 for image. 

 

I read it should work with proc SQL but I didn't find a good example. 

 Company Date Image 

83976  1/1/2007 12

83976 1/2/2007   9

83976  1/3/2007  3 

 

Thank you!

 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hello @HannahPhD,

 

Thanks for providing sample data. I think the code could look like this:

 

/* Create hypothetical event data */

data events;
input CompanyID startdt :mmddyy. enddt :mmddyy.;
format startdt enddt mmddyy10.;
cards;
10107 01/01/2007 01/03/2007
10107 01/11/2007 01/22/2007
11081 03/06/2008 03/07/2008
11081 05/11/2008 05/13/2008
12060 03/14/2007 03/17/2007
;

/* Create the desired summary dataset */

proc sql;
create table want as
select a.CompanyID,
       cat(put(b.startdt,mmddyy10.),' - ',put(b.enddt,mmddyy10.)) as Date_range length=23,
       sum(Images)         as Sum_Images,
       sum(ImagesUniverse) as Sum_ImagesUniverse,
       sum(Microblogs)     as Sum_Microblogs
from have a, events b
where a.CompanyID=b.CompanyID & b.startdt<=a.date<=b.enddt
group by a.CompanyID, Date_range;
quit;

I'll write explanations in a separate post.

 

Edit: Added length specification for variable Date_range to save disk space.
If dataset WANT was to be used for further processing (rather than reporting), startdt and enddt should be selected as separate variables, not combined to a character variable. In this case, Date_range would be replaced by startdt, enddt in the GROUP BY clause.

View solution in original post

12 REPLIES 12
ballardw
Super User

Do you have the start/end date pairs for each company in a data set (or if all the date intervals are the same just the date pairs)? What would you like the data to look like in the output? And are the daes in you data SAS date value variables or character?

 

One example that would not be easily extensible but would work for a single date pair across all companies:

proc sql;
   create table want as
   select company, sum(image) as ImageSum
   from have 
   where date between '01JAN2007'd and '30JAN2007'd
   group by company;
run;

If the date pairs are in a data set that could be joined with your existing data to create the groups keeping the start and end dates so you would know which interval the data represents.

 

HannahPhD
Obsidian | Level 7

Thank you for replying! Your syntax works and very easy for me t understand! However, now I want to change yours to fit my multiple companies with different time frames:

proc sql;
create table want as
select permno, sum(images) as ImagesSum
from sasuser.infegysubset(where=(companyid='83976'))
where date between '30dec2007'd and '30JAN2008'd

where date between '27dec2008'd and '27JAN2009'd

from sasuser.infegysubset(where=(companyid='10107'))

where date between '30June2008'd and '30July2008'd

where date between '27NoV2008'd and '27DEC2009'd

group by permno;
quit;

 

So my question is now how I can do it for each company with different dates(the one I pasted above gives me error because of second where). And how efficiently do it for multiple companies. The codes I pasted above doesnt work. I just wanted to show you what I need. So I have multiple companies each require different dates to be specified to acquire the sum variable.

FreelanceReinh
Jade | Level 19

You're welcome. I'm sure, we'll find a solution. We only need to know the structure of the input tables and the desired output table.

 

Your new PROC SQL step suggests that the WANT table will have only two variables: permno (the meaning of which is not yet clear to me) and ImagesSum. So, no companyid and no dates? In particular, WANT would contain only summary data, not the original records, enriched with summary data?

 

And we have two date ranges per company? That's good to know. The code will then be flexible enough to handle varying numbers of date ranges per company.

 

Would the start and end dates come from a dataset like this?

companyid  startdt      enddt
10107      06/30/2008   07/30/2008
10107      11/27/2008   12/27/2009
83976      12/30/2007   01/30/2008
83976      12/27/2008   01/27/2009

Does SASUSER.INFEGYSUBSET look like the table in your initial post, but with an additional grouping variable permno?

 

Sorry for asking so many questions, but complete specifications facilitate program development.

 

I'll be back tomorrow (CET), but I hope the American and Asian "colleagues" will provide support in the meantime, if necessary.

 

HannahPhD
Obsidian | Level 7

Hello!

I have uploaded a sample of my data. I want the new data set that SQL creates have company Id and sum variable for each company ID. However, as you can see in my data, for each company ID I have a range from 2007 to 2015 and daily value of each variable(images). Now, I am trying to make another data set based on some events, those events are date ranges for each company like from Dec 31st of 2007 to 30th january 2008 for company 10107. Each company may have different date ranges requiring sum variable for each of these specified ranges. So, the new  table should allow me to see a company ID and sum variable attributable to that date range which I will speicify in the command.

Also, Can I see the date range in the output too?

I was wonderring if I could do multiple dates and multiple companies all in one syntax.

 

Thanks Again!

 

 

FreelanceReinh
Jade | Level 19

Hello @HannahPhD,

 

Thanks for providing sample data. I think the code could look like this:

 

/* Create hypothetical event data */

data events;
input CompanyID startdt :mmddyy. enddt :mmddyy.;
format startdt enddt mmddyy10.;
cards;
10107 01/01/2007 01/03/2007
10107 01/11/2007 01/22/2007
11081 03/06/2008 03/07/2008
11081 05/11/2008 05/13/2008
12060 03/14/2007 03/17/2007
;

/* Create the desired summary dataset */

proc sql;
create table want as
select a.CompanyID,
       cat(put(b.startdt,mmddyy10.),' - ',put(b.enddt,mmddyy10.)) as Date_range length=23,
       sum(Images)         as Sum_Images,
       sum(ImagesUniverse) as Sum_ImagesUniverse,
       sum(Microblogs)     as Sum_Microblogs
from have a, events b
where a.CompanyID=b.CompanyID & b.startdt<=a.date<=b.enddt
group by a.CompanyID, Date_range;
quit;

I'll write explanations in a separate post.

 

Edit: Added length specification for variable Date_range to save disk space.
If dataset WANT was to be used for further processing (rather than reporting), startdt and enddt should be selected as separate variables, not combined to a character variable. In this case, Date_range would be replaced by startdt, enddt in the GROUP BY clause.

FreelanceReinh
Jade | Level 19

Hello again,

 

You wrote that you would specify the date range "in the command." I'd suggest you rather create a dataset EVENTS containing the date ranges as shown in the code. Thus you have the dates in a dataset where they belong and not in your source code. PROC SQL can then pull them from this dataset and assign them to the records in your large HAVE dataset according to the WHERE condition.

 

As you can see in the hypothetical event data, the number of date ranges can vary from company to company and there may be companies (such as that with CompanyID 11308 in the example) which do not have event data at all. Likewise, it would be no problem if there were event data for companies available without records in your HAVE dataset. Currently, the WANT dataset with summary data would cover only companies with records in both input datasets (but this could be changed, if necessary).

 

I've created three sum variables, one for each original count variable. Or did you want a vertically structured dataset with only one sum variable and an additional variable indicating what the sum variable represents ("Images", "ImagesUniverse" or "Microblogs") in the respective observation?

 

Obviously, the format of the Date_range variable can be adapted easily to your needs. You could even add an additional variable containing a short title or keyword describing the event, if that made sense. (This information should be stored in the EVENTS dataset.)

mrtitch
Calcite | Level 5

How would you modify this code to reference Sum_Images in follow-on processing as if it is  variable in itself? 

I'm calculating Exposure At Default (EAD) for 5 mortgages for which I have create variables that representing the Unpaid Balance (UPB) for the next 3 months. Idea is to sum all 5 UPB values next month, and save it in EAD1. Do the same for the month thereafter, and save it into variable EAD2. Lastly, do the same for the 3rd month, and save it into EAD3. 

 

I've tried to incorporate the retain statement into the SQL code, but SAS does not like that.

 

Thank you for any advice you may provide.

 

Respectfully,

 

Marshall

FreelanceReinh
Jade | Level 19

Hello Marshall,

 

Welcome to SAS Support Communities. Next time please open a new thread rather than posting in old topics. I hope some administrator or super user will be so kind as to move your question to a new place. (You can easily refer to old topics by providing links.)

 

If you had posted your sample data in the form of a data step (as shown below), I wouldn't have used simplified data instead:

data upb;
length loan_id $12;
input loan_id UPB_Pred1-UPB_Pred3;
cards;
1 1067 1065 1063
2 1370 1367 1365
3  816  813  810
4  384  383  382
5 2372 2356 2340
;

 

It seems to me that the structure of your data differs from that in the old 2016 thread you refer to. To calculate the totals of UPB_Pred1, UPB_Pred2 and UPB_Pred3 and to make them available in variables EAD1, EAD2 and EAD3, respectively, you can use PROC SUMMARY:

proc summary data=upb;
var UPB_Pred1-UPB_Pred3;
output out=ead sum=EAD1-EAD3;
run;

PROC SQL is suitable as well (and does not automatically add variables _TYPE_ and _FREQ_ to the output dataset):

proc sql;
create table ead as
select sum(UPB_Pred1) as EAD1,
       sum(UPB_Pred2) as EAD2,
       sum(UPB_Pred3) as EAD3
from upb;
quit;

 

mrtitch
Calcite | Level 5

Thank you so much for your code recommendations. I tried the PROC SQL and it worked perfectly. It is very simple to understand and utilize for future applications. In the future I will be sure to create a new threat and copy the link from the original/similar thread. 

 

Thank you again!

HannahPhD
Obsidian | Level 7
Thank you! It worked perfectly!
FreelanceReinh
Jade | Level 19

Hi @HannahPhD,

 

Yes, this could be done with PROC SQL, as shown below:

/* Create test data */

data have;
input Company $ Date :mmddyy10. Image;
format date mmddyy10.;
cards;
83976 1/1/2007 12
83976 1/2/2007 9
83976 1/3/2007 3 
83977 1/1/2007 7
83977 1/2/2007 6
83977 2/1/2007 3 
;

/* Define date range */

%let startdt = '01JAN2007'd;
%let enddt   = '30JAN2007'd;

/* Create dataset with the desired additional variables */

proc sql;
create table want as
select a.*, sum_image, avg_image from
have a left join
(select company, sum(image) as sum_image,
                 avg(image) as avg_image
 from have
 where &startdt<=date<=&enddt
 group by company) b
on a.company=b.company & &startdt<=date<=&enddt
order by company, date;
quit;

proc print data=want;
run;

It's possible, however, that other techniques would be more suitable. This depends, for example, on the number of variables to be summarized. Also, it would be important to know more details on how the date ranges will be defined.

 

The code above assigns the calculated summary statistics only to observations whose dates fall into the date range (see the last observation of the second company), but this could be changed easily.

PavanKumar2
Fluorite | Level 6

I want the same thing in SAS VA. is it possible?

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 10228 views
  • 5 likes
  • 5 in conversation