SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

making dummy variables for checking whether three row information out of four previous rows exist

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 19
Accepted Solution

making dummy variables for checking whether three row information out of four previous rows exist

 

data example;
	input firmname $ year revenue;
	cards;
MSMFT 2000 323
MSMFT 2001 231
MSMFT 2004 232
MSMFT 2005 152
MSMFT 2006 345
MSMFT 2007 232
MSMFT 2009 231
MSMFT 2010 341
MSMFT 2011 325
NOOTA 1997 232
NOOTA 1998 212
NOOTA 1999 231
NOOTA 2000 405
NOOTA 2001 525
NOOTA 2002 251
NOOTA 2004 231
NOOTA 2005 550
NOOTA 2006 124
NOOTA 2007 235
NOOTA 2013 952
NOOTA 2014 532
run;

 

Hi. I posted a question regarding how to assign a dummy variable for the existence of row information in three in a row, and related to that question, I am also wondering how to assign a dummy variable for checking whether three row information out of four previous rows exist or not.

 

To be specific, as shown in the example data above, if the MSMFT's revenue information for 2006, 2007, 2009 exists, but missing for 2008, so the dummy variable for the existence of revenue information in three-year in a row should be zero, but the dummy variable that I would like to generate should be one, in that, for 2009, two out of three previous years' revenue information (e.g. 2006 and 2007) exist along with the revenue information for 2009. Likewise, for 2010, two out of three previous years' revenue information (e.g. 2007 and 2009) exist along with the revenue information for 2010, so the dummy variable that I would like to generate should be given one. The final output from the example case should look like below. Thank you in advance!

 

 

  firmname year revenue three_row three_four
MSMFT200032300
MSMFT200123100
MSMFT200423200
MSMFT200515200
MSMFT200634511
MSMFT200723211
MSMFT200923101
MSMFT201034101
MSMFT201132511
NOOTA199723200
NOOTA199821200
NOOTA199923111
NOOTA200040511
NOOTA200152511
NOOTA200225111
NOOTA200423101
NOOTA200555001
NOOTA200612411
NOOTA200723511
NOOTA201395200
NOOTA201453200

 


Accepted Solutions
Solution
‎04-01-2017 09:13 PM
Super User
Posts: 5,091

Re: making dummy variables for checking whether three row information out of four previous rows exis

The simplest way I can think of would be to create an extra-wide temporary array.  Suppose, for example, that the span of 1950 to 2050 would be more than enough years to cover every possible year you might encounter.  You could code:

 

data want;

set have;

by firmname;

array years {1950:2050} _temporary_;

if first.firmname then do _n_=1950 to 2050;

   years{_n_}=0;

end;

years{year}=1;

if sum(years{year-3}, years{year-2}, years{year-1}) >= 2 then three_four=1;

else three_four=0;

run;

 

Note a couple of features here.  The SUM function doesn't have to add in years{year}.  It's a given that years{year} must be 1.  If you do add that in, then you have to compare to a total of 3 instead of a total of 2.  And it may be possible to abbreviate the code (I couldn't test it, so I didn't abbreviate it):

 

if sum(of years{year-3} - years{year-1}) >= 2 then three_four=1;

 

Finally, the code assumes you don't have any years as small as 1952.  If you do, you have to extend the array backwards so that there is always an element that matches year-3.

View solution in original post


All Replies
Solution
‎04-01-2017 09:13 PM
Super User
Posts: 5,091

Re: making dummy variables for checking whether three row information out of four previous rows exis

The simplest way I can think of would be to create an extra-wide temporary array.  Suppose, for example, that the span of 1950 to 2050 would be more than enough years to cover every possible year you might encounter.  You could code:

 

data want;

set have;

by firmname;

array years {1950:2050} _temporary_;

if first.firmname then do _n_=1950 to 2050;

   years{_n_}=0;

end;

years{year}=1;

if sum(years{year-3}, years{year-2}, years{year-1}) >= 2 then three_four=1;

else three_four=0;

run;

 

Note a couple of features here.  The SUM function doesn't have to add in years{year}.  It's a given that years{year} must be 1.  If you do add that in, then you have to compare to a total of 3 instead of a total of 2.  And it may be possible to abbreviate the code (I couldn't test it, so I didn't abbreviate it):

 

if sum(of years{year-3} - years{year-1}) >= 2 then three_four=1;

 

Finally, the code assumes you don't have any years as small as 1952.  If you do, you have to extend the array backwards so that there is always an element that matches year-3.

Occasional Contributor
Posts: 19

Re: making dummy variables for checking whether three row information out of four previous rows exis

[ Edited ]

Thank you so much for such a prompt response and answer to my question. But I have one additional question regarding the array in that I am not that familar to using array function.

 

data want;

set have;

by firmname;

array years {1950:2050} _temporary_;

if first.firmname then do _n_=1950 to 2050;

   years{_n_}=0;

end;

years{year}=1;

if sum(years{year-3}, years{year-2}, years{year-1}) >= 2 then three_four=1;

else three_four=0;

run;

 

I think I could get a basic sense of what your code imply for producing the dummy variable that I would like to generate. First, you generate a temporary array with length of 101 (e.g. 2050-1950+1). Second, if the first firmname observation row, from years1 (1950) to years101 (2050), assign 0 to each cell of the first firmname observation row. Third, assign 1 if the corresponding column years (1950 to 2050) matches with the year value from 'year' column. Finally, calculate the sum of three previous rows and compare its value with 2, and if it is greater than 2 then, assign 1 for 'three_four' column.

 

I am wondering if my own description of your SAS code is correct or not, and also wondering why the result is different if I just delete '_temporary_' from 'array years {1950:2050} _temporary_', and the result without '_temporary_' is shown below. Thank you!

 

results.jpg

Super User
Posts: 5,091

Re: making dummy variables for checking whether three row information out of four previous rows exis

Your description is exactly right.

 

When you remove _temporary_, the array is populated by variables.  The difference is that elements of a _temporary_ array are automatically retained, while variables are not.  You could get the same result if you were to add the RETAIN YEARS1 - YEARS101; statement for the variables comprising the array.  You might then want to drop those variables as well.

Respected Advisor
Posts: 4,654

Re: making dummy variables for checking whether three row information out of four previous rows exis

This code gives the desired result, assuming the data is sorted by year (within a firm) and there are no duplicate years (within a firm):

 

data want;
array y_{3};
do until(last.firmname);
    set example; by firmname notsorted;
    i + 1;
    y_{mod(i,3)+1} = year;
    three_row = cmiss(of y_{*})=0 and range(of y_{*})=dim(y_)-1;
    three_four = cmiss(of y_{*})=0 and range(of y_{*})<=dim(y_);
    output;
    end;
drop i y_: ;
run;
PG
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 4 replies
  • 278 views
  • 4 likes
  • 3 in conversation