Hi, I have a data set of over 10 million observations, which is huge. I only want to choose useful information from it. The rules are fairly complicated. First let me show how the data set looks like
Dataset1
manager quarter firm assets type s1 s2 s3 LINE
100 5 1 10 1 1 1 1 1
100 6 2 5 1 1 1 2 2
100 7 1 12 1 1 2 1 3
101 5 1 10 2 1 1 1 4
101 6 2 10 2 1 1 1 5
101 6 2 10 2 1 1 2 6
101 7 1 10 2 1 1 1 7
102 5 1 10 1 1 1 1 8
102 6 1 10 1 1 1 1 9
Dataset2
firm qtr1 qtr2
1 5 6
Variables:
manager: ID number of a manager, I have three manager with ID 100, 101 and 102
quarter: I converted date(quarter) into numbers. 5 means the fifth quarter
firm: ID number for a firm that manager buys.
assets: some variable I want to use to calculate something.
type: type of manager. manager 100 is a type 1 manager. 101 is a type 2 manager. Each manager's type cannot change. Different manager can be the same type.
s1, s2, and s3: Three different firm's type. They are specific to firms and different date(quarter).
qtr1: start quarter for the firm.
qtr2: end quarter for the firm.
Ok. Now I would like to show what I want to accomplish.
The firms in dataset2 are a subset of firms in dataset1.
First, I want to keep all observations that are related to dataset2. By that I mean I want to keep any observations for firm #1 quarter 5 and quarter 6, and firm#2 quarter 6 and quarter 7. AND any observations that have the SAME firms' type s1, s2 and s3 (must match exactly) for the each quarter.
For example. firm#1 in quarter 5 has s1=1, s2=1 and s3=1. In that specific quarter (5), and same firm type s1=s2=s3. Line 5 and 8 have the same value. Therefore, I need to keep line 5 and line 8 besides line 1.
In that sense, for the second observation firm1 with qtr2 (quarter=6). There is only one observation (line5), that meets the criterion. (quarter=6 and s1=s2=s3) besides line 9(original line).
Finally, I will have observations line 1, 5, 8, 9
With these observations, I will do the following job.
For the first observation, firm 1 at quarter 5, I will calculate the mean of asset variable of line 1 and 8, which is (10+10)/2.
For the second observation, firm1 at quarter 6, I will calculate the mean of asset variable of line 5, which is just 5.
Anyone help?
Hi,
It looks like you have a reasonable spec above, and have not pointed out any specific questions. Could I suggest you start coding the problem you described above and come back with any specific questions. I suggest coding a datastep to each row you have written above.
If I understood what you mean .Obs4 is also what you need ?
data Dataset1; input manager quarter firm assets type s1 s2 s3 LINE ; cards; 100 5 1 10 1 1 1 1 1 100 6 2 5 1 1 1 2 2 100 7 1 12 1 1 2 1 3 101 5 1 10 2 1 1 1 4 101 6 2 10 2 1 1 1 5 101 6 2 10 2 1 1 2 6 101 7 1 10 2 1 1 1 7 102 5 1 10 1 1 1 1 8 102 6 1 10 1 1 1 1 9 ; run; data Dataset2 ; input firm qtr1 qtr2 ; cards; 1 5 6 2 6 7 ; run; data key(drop=qtr1 qtr2); set dataset2; quarter=qtr1;output; quarter=qtr2;output; run; data want; if _n_ eq 1 then do; if 0 then set key; declare hash ha(dataset:'key'); ha.definekey(all:'y'); ha.definedone(); end; set dataset1; obs+1; if ha.check()=0 and range(of s:)=0; run;
Xia Keshan
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.