BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
jiaxinyang
Calcite | Level 5

Hi All,


Here are what I have in my dataset. The year identifier -- fyear, the company identifier -- gvkey and the variable interested -- txt.

I want to have the sum of txt for each firm over a five-year period (year t to year t-4) and require at least three consecutive years of non-missing data. I am using SAS 9.4.

I got suggested to use PROC SUMMARY and NMISS output, so I referred to Syntax for them, but I cannot work it out. Then I tried to look for any similar examples to use those, which I could not find either. Could anyone please let me know what codes I need to use or give me some examples I can refer to?

Thanks a lot.

Jiaxin

1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26

Here I have simulated some annual data for three companies. The third company has missing values for some years. The data is re-arranged so that we can compute moving sums of 5 observations, then we compute the number of consecutive non-missing data points. Finally PROC SUMMARY computes the moving sums of any 5 year time period. If you don't want years where the number of consecutive non-missings is less than three, then you can delete these from the output.

 

/* Simulate some data */
data simul;
    do company_id=1 to 3;
        do year=2000 to 2019;
           txt=rand('uniform');
            /* Add in some missing values at random for company_id=3 */
           if company_id=3 then do;
               is_missing=rand('uniform');
               if is_missing>0.5 then txt=.;
           end;
           output;
       end;
   end;
   drop is_missing;
run;
data rearrange;
    set simul;
    do i=1 to 5;
        group=i+_n_-1;
        output;
    end;
run;
proc sort data=rearrange;
    by company_id group year;
run;
data rearrange1;
	set rearrange;
	by company_id group;
	if first.group then consec=0;
	if missing(txt) then consec=0;
	else consec+1;
run;
proc summary nway data=rearrange1;
    class company_id group;
    var txt year consec;
    output out=moving_sums sum(txt)=moving_sum min(year)=first_year_of_5
    	max(year)=last_year_of_5 max(consec)=max_consec;
run;
--
Paige Miller

View solution in original post

12 REPLIES 12
PaigeMiller
Diamond | Level 26

I want to have the sum of txt for each firm over a five-year period (year t to year t-4) and require at least three consecutive years of non-missing data.

 

Is your data daily, or monthly, or quarterly, or annual?


When you say you want three consecutive years with no missing data, is that no missing data from even a single day/month/quarter? Or is it even one good data point in a year?

--
Paige Miller
jiaxinyang
Calcite | Level 5

My data is annual data, so what I mean by three consecutive years with no missing data is that no missing data for a year.

PaigeMiller
Diamond | Level 26
/* UNTESTED CODE */
data have2;
    set have;
    by company_id;
    if first.company_id then consec=0;
    /* Compute consecutive years */
    if not missing(txt) then consec+1;
    else consec=0;
run;
proc summary data=have2 nway;
    class company_id;
    var txt consec;
    output out=want sum(txt)=sum_txt max(consec)=max_consec;
run;

Assumes the data is sorted by company_id and year.

 

The output data set WANT will have rows where max_consec is less than three, you would eliminate these.

--
Paige Miller
ballardw
Super User

@jiaxinyang wrote:

Hi All,


Here are what I have in my dataset. The year identifier -- fyear, the company identifier -- gvkey and the variable interested -- txt.

I want to have the sum of txt for each firm over a five-year period (year t to year t-4) and require at least three consecutive years of non-missing data. I am using SAS 9.4.

I got suggested to use PROC SUMMARY and NMISS output, so I referred to Syntax for them, but I cannot work it out. Then I tried to look for any similar examples to use those, which I could not find either. Could anyone please let me know what codes I need to use or give me some examples I can refer to?

Thanks a lot.

Jiaxin


You might provide some example data and the Proc summary code that you tried and explain what you needed that you did not get. It would also help to provide what you expect for the results given the example data that you provide.

 

I do say that when I see something like "the sum of txt" that I really wonder what values you have for TXT as that would often look much more like a character variable. And character variables seldom "sum" nicely.

 

Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the {i} icon or attached as text to show exactly what you have and that we can test code against.

jiaxinyang
Calcite | Level 5

Thanks for your reply. txt is numeric and it is the total income taxes for each given year.

mkeintz
PROC Star

I don't see how proc summary can help you with rolling 5 year sums.   Assuming your data are (1) sorted by gvkey fyear, and (2) there are no "holes" in the fyear sequence (although that could be managed), then either proc expand (if you have SAS/ETS) or a DATA step would do well.  Given the simplicity of yoru request, probably the DATA step is the most efficacious.

 

Strategy:

  1. Generate a rolling 5-record (which in your case is 5-year) sum of variable TXT by adding the current value, and subtracting the 5-record old value.
  2. Use a five-character string of "Y" and "N" where "YYYYY" means the last 4 years plus the current year all have non-missing TXT.   Or something like "YNYYN' means the 4-year-old, 2 year old, and 1 year old are valid but the current and 3-year old are missing.  This means all you have to do is search for a string of "YYY" to confirm 3-consecutive years somewhere in the 5-year span.

 

Edit note:  Now that you've provided actual data, it's easier to test my program, which is corrected and re-submitted below.  It's also been tested with resetting of a couple of years with missing TXT values, to ensure you only get 5-year sums when the 5-year span include 3 consecutive valid TXT values:

 

data want (drop=_:);
  set have;
  by gvkey;

  length _strng $5;  /*character string of 5 Y's or N's */
  retain _strng  _5yearsum;

  _5yearsum=sum(_5yearsum,TXT,-lag5(TXT));   /*Update temporary rolling sum*/

  if missing(TXT) then _strng=substr(_strng,2,4)||'N';  *concatenate 2nd-5th chars of STRNG with an N*;
  else                 _strng=substr(_strng,2,4)||'Y';

  if lag4(gvkey)=gvkey and find(_strng,'YYY') then sum_5years=_5yearsum;
run;

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
PaigeMiller
Diamond | Level 26

@mkeintz wrote:

I don't see how proc summary can help you with rolling 5 year sums.   Assuming your data are (1) sorted by gvkey fyear, and (2) there are no "holes" in the fyear sequence (although that could be managed), then either proc expand (if you have SAS/ETS) or a DATA step would do well.  Given the simplicity of yoru request, probably the DATA step is the most efficacious.


You are correct, PROC SUMMARY doesn't fit here. 😞

--
Paige Miller
jiaxinyang
Calcite | Level 5

Hi Mkeintz, 

 

Thanks for your respond. I used your codes in my datafile and there is zero observations generated.

 

I attached some excel files herewith to illustrate what I have and what I want. 

1. I extract a small sample of my data file -- input 

2. The output file I got after running your codes -- output

3. My expected output file -- rep.output

     In rep.output file, I manually calculated the sum of five years' txt; however, the extracted input file is not representative since there is no missing values in the sample. Therefore, I made up a even smaller sample in the same file (rep.output) with some missing values and explain the reason why I exclude some sums.

 

I hope you can help me with this. Thanks in advance.

 

Best regards,

 

Jiaxin

 

 

 

 

PaigeMiller
Diamond | Level 26

Here I have simulated some annual data for three companies. The third company has missing values for some years. The data is re-arranged so that we can compute moving sums of 5 observations, then we compute the number of consecutive non-missing data points. Finally PROC SUMMARY computes the moving sums of any 5 year time period. If you don't want years where the number of consecutive non-missings is less than three, then you can delete these from the output.

 

/* Simulate some data */
data simul;
    do company_id=1 to 3;
        do year=2000 to 2019;
           txt=rand('uniform');
            /* Add in some missing values at random for company_id=3 */
           if company_id=3 then do;
               is_missing=rand('uniform');
               if is_missing>0.5 then txt=.;
           end;
           output;
       end;
   end;
   drop is_missing;
run;
data rearrange;
    set simul;
    do i=1 to 5;
        group=i+_n_-1;
        output;
    end;
run;
proc sort data=rearrange;
    by company_id group year;
run;
data rearrange1;
	set rearrange;
	by company_id group;
	if first.group then consec=0;
	if missing(txt) then consec=0;
	else consec+1;
run;
proc summary nway data=rearrange1;
    class company_id group;
    var txt year consec;
    output out=moving_sums sum(txt)=moving_sum min(year)=first_year_of_5
    	max(year)=last_year_of_5 max(consec)=max_consec;
run;
--
Paige Miller
mkeintz
PROC Star

@PaigeMiller 

 

Nice recovery for use of proc summary.  Write out each record 5 times, such that each year belongs to five distinct groups, which can then be used as a class variable for proc summary.   I have done something like this for rolling windows regressions.  I output each monthly data record once per window it was part of, making a window_id using the earliest date of each window.  Submitting the greatly expanded dataset to PROC REG worked well, but I got significant performance penalties as the window size went from 6 to 12 to 18 to 24 (months).  I reverted back to a data step approach and generated rolling sums of cross-products and squares for each window (analogous to generating rolling 5-year sums).  And then I just submitted the rolling SSCP to PROC REG.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
PaigeMiller
Diamond | Level 26

@mkeintz 

 

Actually, this is what I had in mind all along, but my first explanation must have come at 4am because it wasn't really close. But yes, you will get performance hits as the window increases.

--
Paige Miller
jiaxinyang
Calcite | Level 5

The codes worked. Thanks a lot.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 12 replies
  • 2230 views
  • 0 likes
  • 4 in conversation