Dear all,
I have SAS data including the following variables:
Ticker (identification Code for company)
IDDIR (directors' Identification Number)
Postcode (directors' residential postal code)
Year (Financial accounting year)
miss_code (dummy variable=1 if there is missing Postcode; else=0)
I wish to conduct an analysis on the reliability of the data. I might have to drop my analysis (or part of them) if the following conditions fulfilled:
1. Too many missing values on "postcode" for a given firm-year - MISSING VALUES).
2. Postal code are the similar for all/most the directors in the same firms-year - UNREASONABLE VALUES.
3. Number of directors in given year for a firm is significantly reduced compared to later years. For example, a firm migth have 10 directors in year 2010, but it is reported as 1 director in year 2003). This indicates that the dataset is not complete.
Therefore, I would like to generate an output that contains the information in table as follows:
Ticker Year total_miss total_dir total_same percent_miss percent_same change_dir percent_change change_following
total_miss = total number of directors with missing postcode for a given firm-year
total_dir = total number of directors for a given firm-year
total_same = total number of directors with the same postcode (with non-missing postcode) for a given firm-year
percent_miss = total_miss/total_dir for a given firm-year
percent_same= total_same/total_dir for a given firm-year
change_dir = increase or decrease of total_dir compared to prior year (t-1) = total_dir(t) - total_dir(t-1); t=time period
percent_change (t)= change_dir/total_dir in a given year (t) = total_dir(t) - total_dir(t-1)/total_dir(t-1) for a given firm-year
change_following (t) = percent_change (t+1) for the same ticker in the following one year
For example, If the total directors in 2005=6 and directors in 2007 = 10. Then change_dir for year 2007 = 4 (or +4/6= +0.75). "Changes in following year" for 2006 means 4 (or 0.75) increases.
By the end, I would wish to generate a report/table that indicate number/percent of observations (firm-years) by years:
For each year (from 2001 - 2010):
1. Number (as well as in percent = number for each decile/total firms for given year) of firm with percent_miss ranging from 0% to 100% in 10% interval (in deciles)
2. Number (as well as in percent = number for each decile/total firms for given year) of firms with percent_same ranging from 0% to 100% in 10% interval (in deciles)
3. Number (as well as in percent) of firms with change_following which is ranging from 0% to 100% in 10% interval (in deciles) as well as an additonal category, which is named as "missing" (if there is no prior or following year data to facilitate calculations).
I understand it is indeed difficult and seeking for any advises and helps.
Thank you very much in advance.
Regards,
mspak
I did this very quickly so I cannot speak to accuracy but it is a start for now. This meets the first portion of your problem, I am not sure I clearly understand your expectation for the second portion with the decile calculations so I left it out for now.
proc sql;
create table foo as
select ticker,
year,
sum(missing(postcode)) as total_miss,
count(iddir) as total_dir,
count(distinct iddir) as total_same,
(calculated total_miss/calculated total_dir) as percent_miss
from in.misingcode
group by ticker,year;
create table bar as
select a.ticker,
a.year,
a.total_miss,
a.total_dir,
a.total_same,
(a.total_miss/a.total_dir) as percent_miss,
a.total_dir-b.total_dir as change_dir,
(calculated change_dir/a.total_dir) as percent_change,
((a.total_dir-c.total_dir)/a.total_dir) as change_following
from foo a
left join foo b on a.ticker=b.ticker and a.year=b.year-1
left join foo c on a.ticker=c.ticker and a.year=c.year+1
order by a.ticker,a.year;
drop table foo;
quit;
ticker | Year | total_miss | total_dir | total_same | percent_miss | change_dir | percent_change | change_following |
---|---|---|---|---|---|---|---|---|
A | 2001 | 1 | 1 | 1 | 1 | 0 | 0.0000 | . |
A | 2002 | 1 | 1 | 1 | 1 | 1 | 1.0000 | 0.00000 |
A | 2003 | 0 | 0 | 0 | . | -1 | . | . |
A | 2004 | 1 | 1 | 1 | 1 | 0 | 0.0000 | 1.00000 |
A | 2005 | 1 | 1 | 1 | 1 | 0 | 0.0000 | 0.00000 |
A | 2006 | 1 | 1 | 1 | 1 | 1 | 1.0000 | 0.00000 |
A | 2007 | 1 | 0 | 0 | . | -1 | . | . |
A | 2008 | 0 | 1 | 1 | 0 | -15 | -15.0000 | 1.00000 |
A | 2009 | 16 | 16 | 13 | 1 | -3 | -0.1875 | 0.93750 |
A | 2010 | 0 | 19 | 14 | 0 | . | . | 0.15789 |
AA | 2001 | 1 | 1 | 1 | 1 | 0 | 0.0000 | . |
AA | 2002 | 1 | 1 | 1 | 1 | 1 | 1.0000 | 0.00000 |
AA | 2003 | 0 | 0 | 0 | . | -1 | . | . |
AA | 2004 | 0 | 1 | 1 | 0 | 0 | 0.0000 | 1.00000 |
AA | 2005 | 0 | 1 | 1 | 0 | 0 | 0.0000 | 0.00000 |
AA | 2006 | 1 | 1 | 1 | 1 | 1 | 1.0000 | 0.00000 |
AA | 2007 | 0 | 0 | 0 | . | -1 | . | . |
AA | 2008 | 0 | 1 | 1 | 0 | -23 | -23.0000 | 1.00000 |
AA | 2009 | 24 | 24 | 20 | 1 | -3 | -0.1250 | 0.95833 |
AA | 2010 | 0 | 27 | 21 | 0 | . | . | 0.11111 |
Thanks FriedEggs again,
From your first part of SQL:
proc sql;
create table foo as
select ticker,
year,
sum(missing(postcode)) as total_miss,
count(iddir) as total_dir,
count(distinct iddir) as total_same,
(calculated total_miss/calculated total_dir) as percent_miss
from in.misingcode
group by ticker,year;
The count(iddir) as total_dir includes those duplicate IDDIR; but the count(distinct iddir) as total_same, eliminate the duplicates and provides the total of directors. I think your total_same should be the total_dir by my definition. I wish to eliminate duplicates IDDIR.
My total_same definition is that total directors which have the same zipcode for a given firm-year, but this should excludes those have missing zipcodes. In other words, directors who have missing postcodes do not counted as same postcodes.
Change_following is seems incorect: change_dir for year 2007 = Change_following for year 2006. I would like to see the number of changes for the current year compared to the following years. Perhaps for this information, I can present them as time-series data for years 2001 - 2010 in terms of total number of unique directors (by IDDIR) reported. Then I should detect the significant changes for a given year. The year with most significant change should be identified and then the number of changes should be reported only for the year with the highest changes.
Is there any quick way or special function that deal with missing values by grouping them according to the category, such as firm-year as in my case?
The dataset that I provided - the incomplete should be more in earlier years than later years. Therefore, my purpose to detect significant change of total number of directors - to imply that in same firms, the incomplete data might be large in earlier years before the significant change of number of directors.
I am digesting your second part of the SQL in which 3 tables was used as source tables. I think in SQL, the lagging and leading figures can be written as year+1 and year-1. In DATA step, I used PROC EXPAND to do this. I am imagining the tables after left joins by the criteria that you set.
SQL do need creativity in designing the program.
Thank you for your helps.
Note: How long did you take to master this skill? I am just 3 months old in terms of SAS program, and having attended SAS Prog1 and prog2 trainings.
Regadrs,
mspak
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.