I have a dataset where each oberservation is identified with a unique ID and year (Data Set1). I have another data set also with ID and years and values for several other variables. For each observation in Data Set1, I want to look at Data Set 2 and if that same ID exists then I want to sum up the values for the previous 3 years in Data Set 2.
For example I'm looking at something like this:
Dataset1
Year | ID |
2014 | 456 |
2014 | 678 |
2014 | 709 |
2015 | 346 |
2015 | 456 |
2015 | 567 |
2016 | 678 |
2016 | 709 |
2016 | 891 |
Dataset2
ID | Year | Math | Science | Music | Latin |
300 | 2011 | 56 | 32 | 54 | 45 |
456 | 2011 | 85 | 87 | 89 | 95 |
678 | 2011 | 76 | 87 | 56 | 76 |
456 | 2012 | 88 | 89 | 90 | 93 |
567 | 2012 | 50 | 54 | 56 | 57 |
456 | 2013 | 86 | 87 | 93 | 94 |
300 | 2013 | 13 | 34 | 54 | 35 |
567 | 2013 | 45 | 48 | 56 | 59 |
456 | 2014 | 85 | 87 | 88 | 95 |
200 | 2014 | 56 | 58 | 54 | 56 |
678 | 2014 | 70 | 76 | 75 | 78 |
567 | 2014 | 58 | 58 | 50 | 55 |
678 | 2015 | 74 | 76 | 78 | 70 |
300 | 2015 | 34 | 32 | 14 | 26 |
The output I'm looking at:
ID | Year | Math | Science | Music | Latin |
456 | 2014 | 259 | 263 | 272 | 282 |
678 | 2014 | 76 | 87 | 56 | 76 |
709 | 2014 | 0 | 0 | 0 | 0 |
346 | 2015 | 0 | 0 | 0 | 0 |
456 | 2015 | 259 | 263 | 271 | 282 |
567 | 2015 | 153 | 160 | 162 | 171 |
678 | 2016 | 144 | 152 | 153 | 148 |
709 | 2016 | 0 | 0 | 0 | 0 |
891 | 2016 | 0 | 0 | 0 | 0 |
Thank you!
Hi aaou
The merge can be done using proc sql
data ds1;
input year id;
cards;
2014 456
2014 678
2014 709
2015 346
2015 456
2015 567
2016 678
2016 709
2016 891
;
data ds2;
input id year Math Science Music Latin;
cards;
300 2011 56 32 54 45
456 2011 85 87 89 95
678 2011 76 87 56 76
456 2012 88 89 90 93
567 2012 50 54 56 57
456 2013 86 87 93 94
300 2013 13 34 54 35
567 2013 45 48 56 59
456 2014 85 87 88 95
200 2014 56 58 54 56
678 2014 70 76 75 78
567 2014 58 58 50 55
678 2015 74 76 78 70
300 2015 34 32 14 26
;
run;
proc sql ;
create table ds3 as
select ds1.year, ds1.id,Math,Science,Music,Latin from ds1 left join ds2
on ds1.id=ds2.id and ds1.year-ds2.year between 1 and 3;
proc means data=ds3 noprint ;
class year id;
types year*id;
output out=ds4 sum=;
run;
year | id | _TYPE_ | _FREQ_ | Math | Science | Music | Latin |
2014 | 456 | 3 | 3 | 259 | 263 | 272 | 282 |
2014 | 678 | 3 | 1 | 76 | 87 | 56 | 76 |
2014 | 709 | 3 | 1 | . | . | . | . |
2015 | 346 | 3 | 1 | . | . | . | . |
2015 | 456 | 3 | 3 | 259 | 263 | 271 | 282 |
2015 | 567 | 3 | 3 | 153 | 160 | 162 | 171 |
2016 | 678 | 3 | 2 | 144 | 152 | 153 | 148 |
2016 | 709 | 3 | 1 | . | . | . | . |
2016 | 891 | 3 | 1 | . | . | . | . |
Hi aaou
The merge can be done using proc sql
data ds1;
input year id;
cards;
2014 456
2014 678
2014 709
2015 346
2015 456
2015 567
2016 678
2016 709
2016 891
;
data ds2;
input id year Math Science Music Latin;
cards;
300 2011 56 32 54 45
456 2011 85 87 89 95
678 2011 76 87 56 76
456 2012 88 89 90 93
567 2012 50 54 56 57
456 2013 86 87 93 94
300 2013 13 34 54 35
567 2013 45 48 56 59
456 2014 85 87 88 95
200 2014 56 58 54 56
678 2014 70 76 75 78
567 2014 58 58 50 55
678 2015 74 76 78 70
300 2015 34 32 14 26
;
run;
proc sql ;
create table ds3 as
select ds1.year, ds1.id,Math,Science,Music,Latin from ds1 left join ds2
on ds1.id=ds2.id and ds1.year-ds2.year between 1 and 3;
proc means data=ds3 noprint ;
class year id;
types year*id;
output out=ds4 sum=;
run;
year | id | _TYPE_ | _FREQ_ | Math | Science | Music | Latin |
2014 | 456 | 3 | 3 | 259 | 263 | 272 | 282 |
2014 | 678 | 3 | 1 | 76 | 87 | 56 | 76 |
2014 | 709 | 3 | 1 | . | . | . | . |
2015 | 346 | 3 | 1 | . | . | . | . |
2015 | 456 | 3 | 3 | 259 | 263 | 271 | 282 |
2015 | 567 | 3 | 3 | 153 | 160 | 162 | 171 |
2016 | 678 | 3 | 2 | 144 | 152 | 153 | 148 |
2016 | 709 | 3 | 1 | . | . | . | . |
2016 | 891 | 3 | 1 | . | . | . | . |
Hi John,
Thank you very much! much apreciated. 🙂
I'm new to programming,so only if you have the time, would you mind dissecting your code a bit; so that I understand each statement better? Only if you don't mind.
Also in the original dataset I'm dealing with, I have a large number of variables that I need to add up. So instead of listing each variable when I'm creating the table using sql,is there an easier way of referring to all the variables I need to add in this manner?
Thank you!
Hi.
For the first question, if you google "SQL Joins" then you'll find a wealth of info (don't include "sas" in your search).
Most simple SQL code can then be used within a SAS Proc SQL procedure step, but it's not exactly the same.
"Left join" (a.k.a left outer join) keeps all of the observations from the first named dataset (ds1) and only those observations from the 2nd datastep that match the join conditions.
The join conditions are defined in the On clause .
Classic SQL wouldn't allow the 'between' comparison to be used ; that is unique to SAS.
To answer your second question, replace the proc sql step with the one below.
This uses a wildcard (*) in the select clause to keep all variables in both datasets.
But to allow that I had to use sas dataset options to rename the year and id variables in ds2.
This is possible because sql joins do not need the matching colums (in the ON clause) to have the same names.
Compare that to data step merge using By , where the By variables have to have the same name in both datasets.
proc sql ;
create table ds3(drop=id2 year2) as
select * from ds1 left join ds2 (rename=(year=year2 id=id2))
on ds1.id=ds2.id2 and ds1.year-ds2.year2 between 1 and 3;
quit;
Thank you very much.
Your code works perfectly for the example I have given. However, when I run it with my actual datasets it gives an error. I have copy pasted the code and the error. Could you show me what I'm doing wrong?
proc sql ;
create table ipo_lag3(drop=CUSIP2 Year2) as
select * from ma.sample_firms left join ma.ipo_tranposed_new (rename=(Year=Year2 CUSIP=CUSIP2))
on ma.sample_firms.CUSIP=ma.ipo_tranposed_new.CUSIP2 and ma.sample_firms.Year-ma.ipo_tranposed_new.Year2 between 1 and 3;
quit;
*ds1 is ma.sample_firms, ds2 is ma.ipo_transposed_new, ds3 is ipo_lag3 and ID is CUSIP;
I have attached a screenshot of the error I get.
Thanks!
You will have to use a table alias for each of the data tables, using the keyword AS. For an example see example 32.27 at https://onlinecourses.science.psu.edu/stat482/node/46
Thanks!
I changed it as follows:
proc sql ;
create table ipo_lag3(drop=CUSIP2 Year2) as
select * from ma.sample_firms as ds1 left join ma.ipo_tranposed_new as ds2(rename=(Year=Year2 CUSIP=CUSIP2))on
ds1.CUSIP=ds2.CUSIP2 and ds1.Year-ds2.Year2 between 1 and 3;
quit;
But I get the attached error.
I expect the dataset option has to be moved to be immediately after the dataset name and not after the alias name.
That is:
left join ma.ipo_tranposed_new (rename=(Year=Year2 CUSIP=CUSIP2)) as ds2
Thanks! Sorry to keep on troubling you,
Then I get the attached error. The variables in ds2 that I want to add up are numerical. And the year and ID are character. I double checked. So I'm not sure why I'm getting this error.
I'm assuming it's reffering to this portion of the code
ds1.Year-ds2.Year2 between 1 and 3;
You're right. You can't do an arithmetic expression on a character variable.
In certain cases, like in a DATA step, SAS will recognise this. It will do automatic data conversion and write a comment to the log "Character values have been converted to numeric values at the places given by: (Line) : (Column)."
Proc SQL doesn't do this automatic data conversion.
See http://support.sas.com/kb/24/590.html on using input function to convert character to numeric ( and vice-versa using the put function).
You could make the change at an earlier step , or you can do it in the Proc SQL step as below.
proc sql ;
create table ds3(drop=id2 year2) as
select * from ds1 left join ds2 (rename=(year=year2 id=id2))
on ds1.id=ds2.id2 and input(ds1.year,4.)-input(ds2.year2,4.) between 1 and 3;
quit;
data ds1;
input year id;
cards;
2014 456
2014 678
2014 709
2015 346
2015 456
2015 567
2016 678
2016 709
2016 891
;
data ds2;
input id year Math Science Music Latin;
cards;
300 2011 56 32 54 45
456 2011 85 87 89 95
678 2011 76 87 56 76
456 2012 88 89 90 93
567 2012 50 54 56 57
456 2013 86 87 93 94
300 2013 13 34 54 35
567 2013 45 48 56 59
456 2014 85 87 88 95
200 2014 56 58 54 56
678 2014 70 76 75 78
567 2014 58 58 50 55
678 2015 74 76 78 70
300 2015 34 32 14 26
;
run;
proc sql;
create table want as
select *,
coalesce((select sum(Math) from ds2 where id=a.id and year between a.year-3 and a.year-1),0) as Math,
coalesce((select sum(Science) from ds2 where id=a.id and year between a.year-3 and a.year-1),0) as Science,
coalesce((select sum(Music) from ds2 where id=a.id and year between a.year-3 and a.year)-1,0) as Music,
coalesce((select sum(Latin) from ds2 where id=a.id and year between a.year-3 and a.year)-1,0) as Latin
from ds1 as a;
quit;
Hi thanks for your response. Suppose instead of Math, Science, Music and Latin I had a large number of other variables...then how would you alter the code...to have it in a way that I don't have to type each and every variable?
OK. you can make a macro variable to hold it or use call execute(): data ds1; input year id; cards; 2014 456 2014 678 2014 709 2015 346 2015 456 2015 567 2016 678 2016 709 2016 891 ; data ds2; input id year Math Science Music Latin; cards; 300 2011 56 32 54 45 456 2011 85 87 89 95 678 2011 76 87 56 76 456 2012 88 89 90 93 567 2012 50 54 56 57 456 2013 86 87 93 94 300 2013 13 34 54 35 567 2013 45 48 56 59 456 2014 85 87 88 95 200 2014 56 58 54 56 678 2014 70 76 75 78 567 2014 58 58 50 55 678 2015 74 76 78 70 300 2015 34 32 14 26 ; run; proc transpose data=ds2(obs=0) out=temp(where=(upcase(_name_) not in ('ID' 'YEAR'))); var _all_; run; data _null_; set temp end=last; if _n_=1 then call execute('proc sql;create table want as select *'); call execute(cat(',coalesce((select sum(',_name_,') from ds2 where id=a.id and year between a.year-3 and a.year-1),0) as ',_name_)); if last then call execute('from ds1 as a;quit;'); run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.