Hello SAS community,
@Shmuel offered me the following code so that I could bring a variable into the data set I am running a regression on from another data set in my program:
*bring variable in from other data set;
proc sql;
create table paper.ceo_firm as
select a.* , b.VOL
from paper.ceo_firm as a
left join paper.CSRP_Monthly_Stock_char4 as b
on a.ticker = b.ticker and
a.year = b.year;
quit;
I ran into an error and @Shmuel identified it, saying that:
"The difference in number of observations means that the combination of ticker and year are not unique.
SQL join makes a Cartesian join thus multiplies output.
If you know what other variables make it unique - add them to the conditions"
But I had trouble implementing this because, for one reason, I'm not sure how it would look in the code, and, secondly, I can't imagine how else to make this variable unique since basically my whole program has been revolving around the ticker*year combination.
Let me know if you can help take my data set to the next level! If this works to call VOL into my ceo_firm dataset, then I could use it for other variables as well, which could be very useful to me.
Thanks for your time!
-SAStuck
So what do you want as an output dataset? If it needs to be unique by ticker / year, then what do you want VOL to be? The sum over each year, the average, the minimum or something else?
If you want the sum of vol, something like this should work:
proc sql;
create table ceo_firm as
select a.* , b.VOL
from paper.ceo_firm as a
left join
(select ticker
,year
,sum(VOL) as VOL
from paper.CSRP_Monthly_Stock_char4
group by ticker
,year
) as b
on a.ticker = b.ticker and
a.year = b.year;
quit;
I think it would help to explain why both tables have duplicates by ticker plus year combination. At a guess your monthly stock table has monthly rows for each ticker/year. However why does your CEO firm table have multiple rows per ticker/year?
Would it help the situation if I drop rows with multiple ticker/years?
Posting a few rows showing the duplicates would help us understand the data better otherwise we are just guessing what the best approach should be.
I ran the following code:
*find duplicates: ticker/year combo;
proc freq data=paper.CSRP_Monthly_Stock_char4;
tables ticker * year / noprint
out=paper.multiple_tyCSRP (where=(count > 1));
run;
proc freq data=paper.ceo_firm;
tables ticker * year / noprint
out=paper.multiple_tyfirm (where=(count > 1));
run;
And below I am posting examples of the output from the respective tables:
Row | Ticker | Year | Count | Percent |
2 | A | 2011 | 12 | 0.00198938 |
3 | A | 2012 | 12 | 0.00198938 |
4 | A | 2013 | 12 | 0.00198938 |
5 | A | 2014 | 12 | 0.00198938 |
6 | A | 2015 | 12 | 0.00198938 |
7 | A | 2016 | 12 | 0.00198938 |
8 | A | 2017 | 12 | 0.00198938 |
9 | AA | 2010 | 12 | 0.00198938 |
10 | AA | 2011 | 12 | 0.00198938 |
11 | AA | 2012 | 12 | 0.00198938 |
12 | AA | 2013 | 12 | 0.00198938 |
13 | AA | 2014 | 12 | 0.00198938 |
14 | AA | 2015 | 12 | 0.00198938 |
15 | AA | 2016 | 11 | 0.0018235984 |
16 | AA | 2017 | 12 | 0.00198938 |
17 | AAAP | 2016 | 12 | 0.00198938 |
18 | AAAP | 2017 | 12 | 0.00198938 |
19 | AABA | 2017 | 6 | 0.00099469 |
That was paper.CSRP_Monthly_Stock_char4.
paper.ceo_firm is empty (no duplicates).
So what do you want as an output dataset? If it needs to be unique by ticker / year, then what do you want VOL to be? The sum over each year, the average, the minimum or something else?
If you want the sum of vol, something like this should work:
proc sql;
create table ceo_firm as
select a.* , b.VOL
from paper.ceo_firm as a
left join
(select ticker
,year
,sum(VOL) as VOL
from paper.CSRP_Monthly_Stock_char4
group by ticker
,year
) as b
on a.ticker = b.ticker and
a.year = b.year;
quit;
Your code worked. I will accept it as the solution. I did run into an error though. I know I am just making a simple mistake:
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK; 70 71 proc reg data=paper.ceo_firm plots(maxpoints=none); 72 model salary=annual_return bonus stock_awards age male vol; ERROR: Variable VOL not found. NOTE: The previous statement has been deleted. 73 run; WARNING: No variables specified for an SSCP matrix. Execution terminating. NOTE: PROCEDURE REG used (Total process time): real time 0.00 seconds user cpu time 0.00 seconds system cpu time 0.00 seconds memory 3408.59k OS Memory 33216.00k Timestamp 04/15/2018 01:24:17 PM Step Count 89 Switch Count 0 Page Faults 0 Page Reclaims 481 Page Swaps 0 Voluntary Context Switches 3 Involuntary Context Switches 0 Block Input Operations 0 Block Output Operations 48 74 75 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK; 87
Should I adapt your code to name the table paper.ceo_firm? Or adapt my model to pull from the table your code created, work.ceo_firm? Let me know! Thanks!
I deliberately created a new dataset rather than overwrite the existing one as it is not good practice to do so and you get a warning in your SAS log. I suggest you go with the new name. If you want a permanent dataset try something like paper.ceo_firm2.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.