SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

First and Last occurrence in a dataset

Reply
Occasional Contributor
Posts: 10

First and Last occurrence in a dataset

Hello

 

I have data on employment in each firm in each quarter from 1990 to 2010. The data is organized by individual SAS files for each state in the US in each quarter. So for instance the file AK19901 contains employment data (variable EMP) on all firms (each firm has a FIRMID) in Alaska (variable STATE takes value AK) in 1990 in quarter1 (variable YR_QTR is 19901). Each of the file names  has the following format - 2 letter state code, year and quarternumber (e.g. AK19901 AK19902 AK19903,....)

 

From the above, I want to create a dataset with just one observation per firm with the following 3 variables:  FIRMID ENTRYDATE EXITDATE where

- ENTRYDATE is the first YR_QTR of positive employment (EMP >0) and

- EXITDATE is the last YR_QTR of positive employment (EMP >0) followed by four quarters where EMP is 0. 

 

How do I go about doing this without pooling all the years of data for a firm? (There are over a million firms so I am running short of memory). For instance, if I could pool only two years first and calculate ENTRY and EXIT and then do a rolling two year window to get all the firms?

 

I would appreciate any programming tips you could provide me.

 

Thanks

 

Dana

 

Respected Advisor
Posts: 3,894

Re: First and Last occurrence in a dataset

@dshills

Please provide data steps creating sample data and also show us the desired output.

 

Questions:

1. You want employment data so wouldn't you need 4 variables? FIRMID EMPID ENTRYDATE EXITDATE

2. Can the same FIRMID be in more than one state?

3. If FIRMID can be in more than one state: Is it possible that an employee changes the state and appears in the same quarter in two state SAS files for the same firm?

 

....and such questions are the reason why you have to provide representative sample data and desired output.

Occasional Contributor
Posts: 10

Re: First and Last occurrence in a dataset

Thank you. Please see below:

 

Questions:

1. You want employment data so wouldn't you need 4 variables? FIRMID EMPID ENTRYDATE EXITDATE

****No I am just using EMP to determine ENTRYDATE and EXITDATE. So I could do without EMP in the final dataset.

 

2. Can the same FIRMID be in more than one state?

**** No

 

3. If FIRMID can be in more than one state: Is it possible that an employee changes the state and appears in the same quarter in two state SAS files for the same firm?

 **** N/A

....and such questions are the reason why you have to provide representative sample data and desired output.

 

Representative Sample Data - Assume I have data from 19901-19905. I am showing pooled data so I dont have to show  data in individual files. The variable SOURCEFILE below is just for illustration and indicates which file the observation comes from.

SOURCEFILE STATE FIRMID YR_QTR EMP
AK19901 AK 1 19901 0
AK19901 AK 2 19901 12
AK19901 AK 3 19901 450
AK19901 AK 4 19901 0
AK19901 AK 5 19901 60
AK19902 AK 1 19902 10
AK19902 AK 2 19902 0
AK19902 AK 3 19902 450
AK19902 AK 4 19902 0
AK19902 AK 5 19902 65
AK19903 AK 1 19903 12
AK19903 AK 2 19903 0
AK19903 AK 3 19903 0
AK19903 AK 4 19903 85
AK19904 AK 1 19904 12
AK19904 AK 2 19904 0
AK19904 AK 3 19904 0
AK19904 AK 4 19904 855
AK19905 AK 1 19905 13
AK19905 AK 2 19905 0
AK19905 AK 3 19905 0
AK19905 AK 4 19905 85

 

 

DESIRED OUTPUT

FIRMID ENTRYDATE EXITDATE COMMENT (For reference only)
1 19902 0 First date of positive employment is 19902. Last date is not known so EXIT takes 0
2 19901 19901 Start of sample period in the data is 19901 so entry date takes that value. (Even though entry could have occurred prior to 19901). Thereafter there are 4 quarters of 0 employment so EXIT is also 19901
3 19901 0 Start of sample period in the data is 19901 so entry date takes that value. (Even though entry could have occurred prior to 19901). While the firm seems to have exited in 19902, there are only 3 quarters of 0 employment after that so EXITDATE is 0.
4 19903 0 First date of positive employment is 19903. Last date is not known so EXIT takes 0
5 19901 0 First date of positive employment is 19902. Firm is no longer there after 19902 but we don’t have the 4 quarters of 0 employment so EXIT takes 0

 

Hope this clarifies the question. Thanks

Super User
Posts: 17,840

Re: First and Last occurrence in a dataset

Memory shouldn't be an issue here. How big are your starting datasets and where are you running out of memory? In a sort? 

Occasional Contributor
Posts: 10

Re: First and Last occurrence in a dataset

The individual files range from 2 MB to 250MB. Sorry I misuse the term "memory". I am operating on a shared server space and the administrator is complaining that I am taking up too much space. So I was trying to see if there was some way of doing this without creating a pooled dataset.

Super User
Posts: 17,840

Re: First and Last occurrence in a dataset

I'm not sure if this would be more efficient but it would probably work and be easy to understand. 

 

If you don't cross states I would process each state individually and append all of those, sort them, identify first/last per firm as your rules and then loop....I don't know if this would be efficient in terms of processing time but I suspect it's the easiest to program. 

Use Call EXECUTE to call the macro 50 times, once for each State. 

 

Here's a sketch of the idea. I'm sure others will have more efficient solutions.

 

%macro firm_start(state_abbr=);

data have; 
set &state_abbr.:;
run;

*sort by firm;
proc sort data=have;
by firmID <date variable>;
run;

*rules to identify first;
data processed;
set have;

run;

*append to master;
proc append base=master data=processed;
run;

*clean up;
proc datasets lib=work nodetails nolist;
delete processed;
run; quit;

%mend;





Occasional Contributor
Posts: 10

Re: First and Last occurrence in a dataset

Thank you Reeza. This works well. 

Could someone help me with figuring out how to code the EXITDATE. EXITDATE is the last YR_QTR of positive employment (EMP >0) followed by four quarters where EMP is 0. 

Ask a Question
Discussion stats
  • 6 replies
  • 128 views
  • 0 likes
  • 3 in conversation