turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Data Management
- /
- Forum
- /
- making dummy variables for checking whether three ...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-01-2017 07:30 PM

```
data example;
input firmname $ year revenue;
cards;
MSMFT 2000 323
MSMFT 2001 231
MSMFT 2004 232
MSMFT 2005 152
MSMFT 2006 345
MSMFT 2007 232
MSMFT 2009 231
MSMFT 2010 341
MSMFT 2011 325
NOOTA 1997 232
NOOTA 1998 212
NOOTA 1999 231
NOOTA 2000 405
NOOTA 2001 525
NOOTA 2002 251
NOOTA 2004 231
NOOTA 2005 550
NOOTA 2006 124
NOOTA 2007 235
NOOTA 2013 952
NOOTA 2014 532
run;
```

Hi. I posted a question regarding how to assign a dummy variable for the existence of row information in three in a row, and related to that question, I am also wondering how to assign a dummy variable for checking whether three row information out of four previous rows exist or not.

To be specific, as shown in the example data above, if the MSMFT's revenue information for 2006, 2007, 2009 exists, but missing for 2008, so the dummy variable for the existence of revenue information in three-year in a row should be zero, but the dummy variable that I would like to generate should be one, in that, for 2009, two out of three previous years' revenue information (e.g. 2006 and 2007) exist along with the revenue information for 2009. Likewise, for 2010, two out of three previous years' revenue information (e.g. 2007 and 2009) exist along with the revenue information for 2010, so the dummy variable that I would like to generate should be given one. The final output from the example case should look like below. Thank you in advance!

firmname year revenue three_row three_four

MSMFT | 2000 | 323 | 0 | 0 |

MSMFT | 2001 | 231 | 0 | 0 |

MSMFT | 2004 | 232 | 0 | 0 |

MSMFT | 2005 | 152 | 0 | 0 |

MSMFT | 2006 | 345 | 1 | 1 |

MSMFT | 2007 | 232 | 1 | 1 |

MSMFT | 2009 | 231 | 0 | 1 |

MSMFT | 2010 | 341 | 0 | 1 |

MSMFT | 2011 | 325 | 1 | 1 |

NOOTA | 1997 | 232 | 0 | 0 |

NOOTA | 1998 | 212 | 0 | 0 |

NOOTA | 1999 | 231 | 1 | 1 |

NOOTA | 2000 | 405 | 1 | 1 |

NOOTA | 2001 | 525 | 1 | 1 |

NOOTA | 2002 | 251 | 1 | 1 |

NOOTA | 2004 | 231 | 0 | 1 |

NOOTA | 2005 | 550 | 0 | 1 |

NOOTA | 2006 | 124 | 1 | 1 |

NOOTA | 2007 | 235 | 1 | 1 |

NOOTA | 2013 | 952 | 0 | 0 |

NOOTA | 2014 | 532 | 0 | 0 |

Accepted Solutions

Solution

04-01-2017
09:13 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-01-2017 07:55 PM

The simplest way I can think of would be to create an extra-wide temporary array. Suppose, for example, that the span of 1950 to 2050 would be more than enough years to cover every possible year you might encounter. You could code:

data want;

set have;

by firmname;

array years {1950:2050} _temporary_;

if first.firmname then do _n_=1950 to 2050;

years{_n_}=0;

end;

years{year}=1;

if sum(years{year-3}, years{year-2}, years{year-1}) >= 2 then three_four=1;

else three_four=0;

run;

Note a couple of features here. The SUM function doesn't have to add in years{year}. It's a given that years{year} must be 1. If you do add that in, then you have to compare to a total of 3 instead of a total of 2. And it may be possible to abbreviate the code (I couldn't test it, so I didn't abbreviate it):

if sum(of years{year-3} - years{year-1}) >= 2 then three_four=1;

Finally, the code assumes you don't have any years as small as 1952. If you do, you have to extend the array backwards so that there is always an element that matches year-3.

All Replies

Solution

04-01-2017
09:13 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-01-2017 07:55 PM

The simplest way I can think of would be to create an extra-wide temporary array. Suppose, for example, that the span of 1950 to 2050 would be more than enough years to cover every possible year you might encounter. You could code:

data want;

set have;

by firmname;

array years {1950:2050} _temporary_;

if first.firmname then do _n_=1950 to 2050;

years{_n_}=0;

end;

years{year}=1;

if sum(years{year-3}, years{year-2}, years{year-1}) >= 2 then three_four=1;

else three_four=0;

run;

Note a couple of features here. The SUM function doesn't have to add in years{year}. It's a given that years{year} must be 1. If you do add that in, then you have to compare to a total of 3 instead of a total of 2. And it may be possible to abbreviate the code (I couldn't test it, so I didn't abbreviate it):

if sum(of years{year-3} - years{year-1}) >= 2 then three_four=1;

Finally, the code assumes you don't have any years as small as 1952. If you do, you have to extend the array backwards so that there is always an element that matches year-3.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-01-2017 08:56 PM - edited 04-01-2017 09:09 PM

Thank you so much for such a prompt response and answer to my question. But I have one additional question regarding the array in that I am not that familar to using array function.

data want;

set have;

by firmname;

array years {1950:2050} _temporary_;

if first.firmname then do _n_=1950 to 2050;

years{_n_}=0;

end;

years{year}=1;

if sum(years{year-3}, years{year-2}, years{year-1}) >= 2 then three_four=1;

else three_four=0;

run;

I think I could get a basic sense of what your code imply for producing the dummy variable that I would like to generate. First, you generate a temporary array with length of 101 (e.g. 2050-1950+1). Second, if the first firmname observation row, from years1 (1950) to years101 (2050), assign 0 to each cell of the first firmname observation row. Third, assign 1 if the corresponding column years (1950 to 2050) matches with the year value from 'year' column. Finally, calculate the sum of three previous rows and compare its value with 2, and if it is greater than 2 then, assign 1 for 'three_four' column.

I am wondering if my own description of your SAS code is correct or not, and also wondering why the result is different if I just delete '_temporary_' from 'array years {1950:2050} _temporary_', and the result without '_temporary_' is shown below. Thank you!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-01-2017 09:55 PM

Your description is exactly right.

When you remove _temporary_, the array is populated by variables. The difference is that elements of a _temporary_ array are automatically retained, while variables are not. You could get the same result if you were to add the RETAIN YEARS1 - YEARS101; statement for the variables comprising the array. You might then want to drop those variables as well.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-01-2017 11:04 PM

This code gives the desired result, assuming the data is sorted by year (within a firm) and there are no duplicate years (within a firm):

```
data want;
array y_{3};
do until(last.firmname);
set example; by firmname notsorted;
i + 1;
y_{mod(i,3)+1} = year;
three_row = cmiss(of y_{*})=0 and range(of y_{*})=dim(y_)-1;
three_four = cmiss(of y_{*})=0 and range(of y_{*})<=dim(y_);
output;
end;
drop i y_: ;
run;
```

PG