Hi All,
Here are what I have in my dataset. The year identifier -- fyear, the company identifier -- gvkey and the variable interested -- txt.
I want to have the sum of txt for each firm over a five-year period (year t to year t-4) and require at least three consecutive years of non-missing data. I am using SAS 9.4.
I got suggested to use PROC SUMMARY and NMISS output, so I referred to Syntax for them, but I cannot work it out. Then I tried to look for any similar examples to use those, which I could not find either. Could anyone please let me know what codes I need to use or give me some examples I can refer to?
Thanks a lot.
Jiaxin
Here I have simulated some annual data for three companies. The third company has missing values for some years. The data is re-arranged so that we can compute moving sums of 5 observations, then we compute the number of consecutive non-missing data points. Finally PROC SUMMARY computes the moving sums of any 5 year time period. If you don't want years where the number of consecutive non-missings is less than three, then you can delete these from the output.
/* Simulate some data */
data simul;
do company_id=1 to 3;
do year=2000 to 2019;
txt=rand('uniform');
/* Add in some missing values at random for company_id=3 */
if company_id=3 then do;
is_missing=rand('uniform');
if is_missing>0.5 then txt=.;
end;
output;
end;
end;
drop is_missing;
run;
data rearrange;
set simul;
do i=1 to 5;
group=i+_n_-1;
output;
end;
run;
proc sort data=rearrange;
by company_id group year;
run;
data rearrange1;
set rearrange;
by company_id group;
if first.group then consec=0;
if missing(txt) then consec=0;
else consec+1;
run;
proc summary nway data=rearrange1;
class company_id group;
var txt year consec;
output out=moving_sums sum(txt)=moving_sum min(year)=first_year_of_5
max(year)=last_year_of_5 max(consec)=max_consec;
run;
I want to have the sum of txt for each firm over a five-year period (year t to year t-4) and require at least three consecutive years of non-missing data.
Is your data daily, or monthly, or quarterly, or annual?
When you say you want three consecutive years with no missing data, is that no missing data from even a single day/month/quarter? Or is it even one good data point in a year?
My data is annual data, so what I mean by three consecutive years with no missing data is that no missing data for a year.
/* UNTESTED CODE */
data have2;
set have;
by company_id;
if first.company_id then consec=0;
/* Compute consecutive years */
if not missing(txt) then consec+1;
else consec=0;
run;
proc summary data=have2 nway;
class company_id;
var txt consec;
output out=want sum(txt)=sum_txt max(consec)=max_consec;
run;
Assumes the data is sorted by company_id and year.
The output data set WANT will have rows where max_consec is less than three, you would eliminate these.
@jiaxinyang wrote:
Hi All,
Here are what I have in my dataset. The year identifier -- fyear, the company identifier -- gvkey and the variable interested -- txt.
I want to have the sum of txt for each firm over a five-year period (year t to year t-4) and require at least three consecutive years of non-missing data. I am using SAS 9.4.
I got suggested to use PROC SUMMARY and NMISS output, so I referred to Syntax for them, but I cannot work it out. Then I tried to look for any similar examples to use those, which I could not find either. Could anyone please let me know what codes I need to use or give me some examples I can refer to?
Thanks a lot.
Jiaxin
You might provide some example data and the Proc summary code that you tried and explain what you needed that you did not get. It would also help to provide what you expect for the results given the example data that you provide.
I do say that when I see something like "the sum of txt" that I really wonder what values you have for TXT as that would often look much more like a character variable. And character variables seldom "sum" nicely.
Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the {i} icon or attached as text to show exactly what you have and that we can test code against.
Thanks for your reply. txt is numeric and it is the total income taxes for each given year.
I don't see how proc summary can help you with rolling 5 year sums. Assuming your data are (1) sorted by gvkey fyear, and (2) there are no "holes" in the fyear sequence (although that could be managed), then either proc expand (if you have SAS/ETS) or a DATA step would do well. Given the simplicity of yoru request, probably the DATA step is the most efficacious.
Strategy:
Edit note: Now that you've provided actual data, it's easier to test my program, which is corrected and re-submitted below. It's also been tested with resetting of a couple of years with missing TXT values, to ensure you only get 5-year sums when the 5-year span include 3 consecutive valid TXT values:
data want (drop=_:);
set have;
by gvkey;
length _strng $5; /*character string of 5 Y's or N's */
retain _strng _5yearsum;
_5yearsum=sum(_5yearsum,TXT,-lag5(TXT)); /*Update temporary rolling sum*/
if missing(TXT) then _strng=substr(_strng,2,4)||'N'; *concatenate 2nd-5th chars of STRNG with an N*;
else _strng=substr(_strng,2,4)||'Y';
if lag4(gvkey)=gvkey and find(_strng,'YYY') then sum_5years=_5yearsum;
run;
@mkeintz wrote:
I don't see how proc summary can help you with rolling 5 year sums. Assuming your data are (1) sorted by gvkey fyear, and (2) there are no "holes" in the fyear sequence (although that could be managed), then either proc expand (if you have SAS/ETS) or a DATA step would do well. Given the simplicity of yoru request, probably the DATA step is the most efficacious.
You are correct, PROC SUMMARY doesn't fit here. 😞
Hi Mkeintz,
Thanks for your respond. I used your codes in my datafile and there is zero observations generated.
I attached some excel files herewith to illustrate what I have and what I want.
1. I extract a small sample of my data file -- input
2. The output file I got after running your codes -- output
3. My expected output file -- rep.output
In rep.output file, I manually calculated the sum of five years' txt; however, the extracted input file is not representative since there is no missing values in the sample. Therefore, I made up a even smaller sample in the same file (rep.output) with some missing values and explain the reason why I exclude some sums.
I hope you can help me with this. Thanks in advance.
Best regards,
Jiaxin
Here I have simulated some annual data for three companies. The third company has missing values for some years. The data is re-arranged so that we can compute moving sums of 5 observations, then we compute the number of consecutive non-missing data points. Finally PROC SUMMARY computes the moving sums of any 5 year time period. If you don't want years where the number of consecutive non-missings is less than three, then you can delete these from the output.
/* Simulate some data */
data simul;
do company_id=1 to 3;
do year=2000 to 2019;
txt=rand('uniform');
/* Add in some missing values at random for company_id=3 */
if company_id=3 then do;
is_missing=rand('uniform');
if is_missing>0.5 then txt=.;
end;
output;
end;
end;
drop is_missing;
run;
data rearrange;
set simul;
do i=1 to 5;
group=i+_n_-1;
output;
end;
run;
proc sort data=rearrange;
by company_id group year;
run;
data rearrange1;
set rearrange;
by company_id group;
if first.group then consec=0;
if missing(txt) then consec=0;
else consec+1;
run;
proc summary nway data=rearrange1;
class company_id group;
var txt year consec;
output out=moving_sums sum(txt)=moving_sum min(year)=first_year_of_5
max(year)=last_year_of_5 max(consec)=max_consec;
run;
Nice recovery for use of proc summary. Write out each record 5 times, such that each year belongs to five distinct groups, which can then be used as a class variable for proc summary. I have done something like this for rolling windows regressions. I output each monthly data record once per window it was part of, making a window_id using the earliest date of each window. Submitting the greatly expanded dataset to PROC REG worked well, but I got significant performance penalties as the window size went from 6 to 12 to 18 to 24 (months). I reverted back to a data step approach and generated rolling sums of cross-products and squares for each window (analogous to generating rolling 5-year sums). And then I just submitted the rolling SSCP to PROC REG.
Actually, this is what I had in mind all along, but my first explanation must have come at 4am because it wasn't really close. But yes, you will get performance hits as the window increases.
The codes worked. Thanks a lot.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.