How to pick necessary observations according to some rules

SeanZ · Posted 09-04-2014 01:53 AM

Hi, I have a data set of over 10 million observations, which is huge. I only want to choose useful information from it. The rules are fairly complicated. First let me show how the data set looks like

Dataset1

manager quarter firm assets type s1 s2 s3 LINE

100 5 1 10 1 1 1 1 1

100 6 2 5 1 1 1 2 2

100 7 1 12 1 1 2 1 3

101 5 1 10 2 1 1 1 4

101 6 2 10 2 1 1 1 5

101 6 2 10 2 1 1 2 6

101 7 1 10 2 1 1 1 7

102 5 1 10 1 1 1 1 8

102 6 1 10 1 1 1 1 9

Dataset2

firm qtr1 qtr2

1 5 6

Variables:

manager: ID number of a manager, I have three manager with ID 100, 101 and 102

quarter: I converted date(quarter) into numbers. 5 means the fifth quarter

firm: ID number for a firm that manager buys.

assets: some variable I want to use to calculate something.

type: type of manager. manager 100 is a type 1 manager. 101 is a type 2 manager. Each manager's type cannot change. Different manager can be the same type.

s1, s2, and s3: Three different firm's type. They are specific to firms and different date(quarter).

qtr1: start quarter for the firm.

qtr2: end quarter for the firm.

Ok. Now I would like to show what I want to accomplish.

The firms in dataset2 are a subset of firms in dataset1.

First, I want to keep all observations that are related to dataset2. By that I mean I want to keep any observations for firm #1 quarter 5 and quarter 6, and firm#2 quarter 6 and quarter 7. AND any observations that have the SAME firms' type s1, s2 and s3 (must match exactly) for the each quarter.

For example. firm#1 in quarter 5 has s1=1, s2=1 and s3=1. In that specific quarter (5), and same firm type s1=s2=s3. Line 5 and 8 have the same value. Therefore, I need to keep line 5 and line 8 besides line 1.

In that sense, for the second observation firm1 with qtr2 (quarter=6). There is only one observation (line5), that meets the criterion. (quarter=6 and s1=s2=s3) besides line 9(original line).

Finally, I will have observations line 1, 5, 8, 9

With these observations, I will do the following job.

For the first observation, firm 1 at quarter 5, I will calculate the mean of asset variable of line 1 and 8, which is (10+10)/2.

For the second observation, firm1 at quarter 6, I will calculate the mean of asset variable of line 5, which is just 5.

Anyone help?

RW9 · Posted 09-04-2014 05:10 AM

Hi,

It looks like you have a reasonable spec above, and have not pointed out any specific questions. Could I suggest you start coding the problem you described above and come back with any specific questions. I suggest coding a datastep to each row you have written above.

Ksharp · Posted 09-04-2014 08:27 AM

If I understood what you mean .Obs4 is also what you need ?

data Dataset1;
input manager    quarter     firm          assets          type          s1          s2          s3                         LINE ;
cards;
100               5          1                 10               1               1          1            1                              1
100               6          2                 5                 1               1          1            2                              2
100               7          1                 12              1               1          2            1                               3
101               5          1                 10               2               1          1            1                              4
101               6          2                10               2               1          1            1                               5
101               6          2                 10               2               1          1            2                              6
101               7          1                 10               2               1          1            1                              7
102               5          1                 10               1               1          1            1                              8
102               6          1                 10               1               1          1            1                              9
;
run;
data Dataset2 ;
input firm          qtr1               qtr2     ;
cards;
1               5                    6
2 6 7
;
run;
data key(drop=qtr1 qtr2);
 set dataset2;
 quarter=qtr1;output;
 quarter=qtr2;output;
run;
data want;
 if _n_ eq 1 then do;
  if 0 then set key;
  declare hash ha(dataset:'key');
   ha.definekey(all:'y');
   ha.definedone();
end;
set dataset1;
obs+1;
if ha.check()=0 and range(of s:)=0;
run;

Xia Keshan

How to pick necessary observations according to some rules

Re: How to pick necessary observations according to some rules

Re: How to pick necessary observations according to some rules

How to pick necessary observations according to some rules

Re: How to pick necessary observations according to some rules

Re: How to pick necessary observations according to some rules

Click image to register for webinar

Classroom Training Available!