DATA Step, Macro, Functions and more

Summing up values for lagged years

Accepted Solution Solved
Reply
Contributor
Posts: 40
Accepted Solution

Summing up values for lagged years

[ Edited ]

I have a dataset where each oberservation is identified with a unique ID and year (Data Set1). I have another data set also with ID and years and values for several other variables. For each observation in Data Set1, I want to look at Data Set 2 and if that same ID exists then I want to sum up the values for the previous 3 years in Data Set 2.

 

For example I'm looking at something like this:

 

 Dataset1

YearID
2014456
2014678
2014709
2015346
2015456
2015567
2016678
2016709
2016891

 

Dataset2

IDYearMathScienceMusicLatin
300201156325445
456201185878995
678201176875676
456201288899093
567201250545657
456201386879394
300201313345435
567201345485659
456201485878895
200201456585456
678201470767578
567201458585055
678201574767870
300201534321426

 

The output I'm looking at:

IDYearMathScienceMusicLatin
4562014259263272282
678201476875676
70920140000
34620150000
4562015259263271282
5672015153160162171
6782016144152153148
70920160000
89120160000

 

Thank you!


Accepted Solutions
Solution
‎06-29-2016 02:17 PM
Contributor
Posts: 39

Re: Summing up values for lagged years

Hi aaou

 

The merge can be done using proc sql

 

data ds1;
input year id;
cards;
2014 456
2014 678
2014 709
2015 346
2015 456
2015 567
2016 678
2016 709
2016 891
;

data ds2;
input id year Math Science Music Latin;
cards;
300 2011 56 32 54 45
456 2011 85 87 89 95
678 2011 76 87 56 76
456 2012 88 89 90 93
567 2012 50 54 56 57
456 2013 86 87 93 94
300 2013 13 34 54 35
567 2013 45 48 56 59
456 2014 85 87 88 95
200 2014 56 58 54 56
678 2014 70 76 75 78
567 2014 58 58 50 55
678 2015 74 76 78 70
300 2015 34 32 14 26
;
run;

proc sql ;
create table ds3 as
select ds1.year, ds1.id,Math,Science,Music,Latin from ds1 left join ds2
on ds1.id=ds2.id and ds1.year-ds2.year between 1 and 3;

 

proc means data=ds3 noprint ;
class year id;
types year*id;
output out=ds4 sum=;
run;

 

yearid_TYPE__FREQ_MathScienceMusicLatin
201445633259263272282
20146783176875676
201470931....
201534631....
201545633259263271282
201556733153160162171
201667832144152153148
201670931....
201689131....

 

View solution in original post


All Replies
Solution
‎06-29-2016 02:17 PM
Contributor
Posts: 39

Re: Summing up values for lagged years

Hi aaou

 

The merge can be done using proc sql

 

data ds1;
input year id;
cards;
2014 456
2014 678
2014 709
2015 346
2015 456
2015 567
2016 678
2016 709
2016 891
;

data ds2;
input id year Math Science Music Latin;
cards;
300 2011 56 32 54 45
456 2011 85 87 89 95
678 2011 76 87 56 76
456 2012 88 89 90 93
567 2012 50 54 56 57
456 2013 86 87 93 94
300 2013 13 34 54 35
567 2013 45 48 56 59
456 2014 85 87 88 95
200 2014 56 58 54 56
678 2014 70 76 75 78
567 2014 58 58 50 55
678 2015 74 76 78 70
300 2015 34 32 14 26
;
run;

proc sql ;
create table ds3 as
select ds1.year, ds1.id,Math,Science,Music,Latin from ds1 left join ds2
on ds1.id=ds2.id and ds1.year-ds2.year between 1 and 3;

 

proc means data=ds3 noprint ;
class year id;
types year*id;
output out=ds4 sum=;
run;

 

yearid_TYPE__FREQ_MathScienceMusicLatin
201445633259263272282
20146783176875676
201470931....
201534631....
201545633259263271282
201556733153160162171
201667832144152153148
201670931....
201689131....

 

Contributor
Posts: 40

Re: Summing up values for lagged years

Hi John,

Thank you very much! much apreciated. :-)

Contributor
Posts: 40

Re: Summing up values for lagged years

I'm new to programming,so only if you have the time, would you mind dissecting your code a bit; so that I understand each statement better? Only if you don't mind. 

 

Also in the original dataset I'm dealing with, I have a large number of variables that I need to add up. So instead of listing each variable when I'm creating the table using sql,is there an easier way of referring to all the variables I need to add in this manner?

 

Thank you!

Contributor
Posts: 39

Re: Summing up values for lagged years

Hi.

 

For the first question, if you google "SQL Joins" then you'll find a wealth of info (don't include "sas" in your search).

Most simple SQL code can then be used within a SAS Proc SQL procedure step, but it's not exactly the same.

"Left join" (a.k.a left outer join) keeps all of the observations from the first named dataset (ds1) and only those observations from the 2nd datastep that match the join conditions.

The join conditions are defined in the On clause .

Classic SQL wouldn't allow the 'between' comparison to be used ; that is unique to SAS.

 

 

To answer your second question, replace the proc sql step with the one below.

This uses a wildcard (*) in the select clause to keep all variables in both datasets.

But to allow that I had to use sas dataset options to rename the year and id variables in ds2.

This is possible because sql joins do not need the matching colums (in the ON clause) to have the same names.

Compare that to data step merge using By , where the By variables have to have the same name in both datasets.

 


proc sql ;
create table ds3(drop=id2 year2) as
select * from ds1 left join ds2 (rename=(year=year2 id=id2))
on ds1.id=ds2.id2 and ds1.year-ds2.year2 between 1 and 3;
quit;

 

 

 

Contributor
Posts: 40

Re: Summing up values for lagged years

Thank you very much.

 

Your code works perfectly for the example I have given. However, when I run it with my actual datasets it gives an error. I have copy pasted the code and the error. Could you show me what I'm doing wrong?

 

proc sql ;
create table ipo_lag3(drop=CUSIP2 Year2) as
select * from ma.sample_firms left join ma.ipo_tranposed_new (rename=(Year=Year2 CUSIP=CUSIP2))
on ma.sample_firms.CUSIP=ma.ipo_tranposed_new.CUSIP2 and ma.sample_firms.Year-ma.ipo_tranposed_new.Year2 between 1 and 3;
quit;

 

*ds1 is ma.sample_firms, ds2 is ma.ipo_transposed_new, ds3 is ipo_lag3 and ID is CUSIP;

 

I have attached a screenshot of the error I get.

 

 

Thanks!


error.png
Contributor
Posts: 39

Re: Summing up values for lagged years

You will have to use a table alias for each of the data tables, using the keyword AS. For an example see example 32.27 at https://onlinecourses.science.psu.edu/stat482/node/46

Contributor
Posts: 40

Re: Summing up values for lagged years

Thanks!

I changed it as follows:

 

proc sql ;
create table ipo_lag3(drop=CUSIP2 Year2) as
select * from ma.sample_firms as ds1 left join ma.ipo_tranposed_new as ds2(rename=(Year=Year2 CUSIP=CUSIP2))on
ds1.CUSIP=ds2.CUSIP2 and ds1.Year-ds2.Year2 between 1 and 3;
quit;

 

But I get the attached error. 

 

 


error.png
Contributor
Posts: 39

Re: Summing up values for lagged years

I expect the dataset option has to be moved to be immediately after the dataset name and not after the alias name.

That is:

left join ma.ipo_tranposed_new (rename=(Year=Year2 CUSIP=CUSIP2)) as ds2

Contributor
Posts: 40

Re: Summing up values for lagged years

[ Edited ]

Thanks! Sorry to keep on troubling you,

Then I get the attached error. The variables in ds2 that I want to add up are numerical. And the year and ID are character. I double checked. So I'm not sure why I'm getting this error.

 

I'm assuming it's reffering to this portion of the code

 

ds1.Year-ds2.Year2 between 1 and 3;


error.png
Contributor
Posts: 39

Re: Summing up values for lagged years

[ Edited ]

You're right. You can't do an arithmetic expression on a character variable.

 

In certain cases, like in a DATA step, SAS will recognise this. It will do automatic data conversion and write a comment to the log "Character values have been converted to numeric values at the places given by: (Line) : (Column)."

 

Proc SQL doesn't do this automatic data conversion.

 

See http://support.sas.com/kb/24/590.html on using input function to convert character to numeric ( and vice-versa using the put function).

 

You could make the change at an earlier step , or you can do it in the Proc SQL step as below.

 

proc sql ;

create table ds3(drop=id2 year2) as

select * from ds1 left join ds2 (rename=(year=year2 id=id2))

on ds1.id=ds2.id2 and input(ds1.year,4.)-input(ds2.year2,4.) between 1 and 3;

quit;

 

Super User
Posts: 9,687

Re: Summing up values for lagged years

data ds1;
input year id;
cards;
2014 456
2014 678
2014 709
2015 346
2015 456
2015 567
2016 678
2016 709
2016 891
;
data ds2;
input id year Math Science Music Latin;
cards;
300 2011 56 32 54 45
456 2011 85 87 89 95
678 2011 76 87 56 76
456 2012 88 89 90 93
567 2012 50 54 56 57
456 2013 86 87 93 94
300 2013 13 34 54 35
567 2013 45 48 56 59
456 2014 85 87 88 95
200 2014 56 58 54 56
678 2014 70 76 75 78
567 2014 58 58 50 55
678 2015 74 76 78 70
300 2015 34 32 14 26
;
run;
proc sql;
create table want as
 select *,
  coalesce((select sum(Math) from ds2 where id=a.id and year between a.year-3 and a.year-1),0) as Math,
  coalesce((select sum(Science) from ds2 where id=a.id and year between a.year-3 and a.year-1),0) as Science,
  coalesce((select sum(Music) from ds2 where id=a.id and year between a.year-3 and a.year)-1,0) as Music,
  coalesce((select sum(Latin) from ds2 where id=a.id and year between a.year-3 and a.year)-1,0) as Latin
   from ds1 as a;
quit;
Contributor
Posts: 40

Re: Summing up values for lagged years

Hi thanks for your response. Suppose instead of Math, Science, Music and Latin I had a large number of other variables...then how would you alter the code...to have it in a way that I don't have to type each and every variable?

Super User
Posts: 9,687

Re: Summing up values for lagged years

OK. you can make a macro variable to hold it or use call execute():

data ds1;
input year id;
cards;
2014 456
2014 678
2014 709
2015 346
2015 456
2015 567
2016 678
2016 709
2016 891
;
data ds2;
input id year Math Science Music Latin;
cards;
300 2011 56 32 54 45
456 2011 85 87 89 95
678 2011 76 87 56 76
456 2012 88 89 90 93
567 2012 50 54 56 57
456 2013 86 87 93 94
300 2013 13 34 54 35
567 2013 45 48 56 59
456 2014 85 87 88 95
200 2014 56 58 54 56
678 2014 70 76 75 78
567 2014 58 58 50 55
678 2015 74 76 78 70
300 2015 34 32 14 26
;
run;
proc transpose data=ds2(obs=0) out=temp(where=(upcase(_name_) not in ('ID' 'YEAR')));
 var _all_;
run;
data _null_;
 set temp end=last;
 if _n_=1 then call execute('proc sql;create table want as select *');
 call execute(cat(',coalesce((select sum(',_name_,') from ds2 where id=a.id and year between a.year-3 and a.year-1),0) as ',_name_));
 if last then call execute('from ds1 as a;quit;');
run;


☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 13 replies
  • 483 views
  • 3 likes
  • 3 in conversation