Missing Value, Unreasonable Value, Incomplete Data Analysis

mspak · Posted 03-16-2012 04:37 AM

Dear all,

I have SAS data including the following variables:

Ticker (identification Code for company)

IDDIR (directors' Identification Number)

Postcode (directors' residential postal code)

Year (Financial accounting year)

miss_code (dummy variable=1 if there is missing Postcode; else=0)

I wish to conduct an analysis on the reliability of the data. I might have to drop my analysis (or part of them) if the following conditions fulfilled:

1. Too many missing values on "postcode" for a given firm-year - MISSING VALUES).

2. Postal code are the similar for all/most the directors in the same firms-year - UNREASONABLE VALUES.

3. Number of directors in given year for a firm is significantly reduced compared to later years. For example, a firm migth have 10 directors in year 2010, but it is reported as 1 director in year 2003). This indicates that the dataset is not complete.

Therefore, I would like to generate an output that contains the information in table as follows:

Ticker Year total_miss total_dir total_same percent_miss percent_same change_dir percent_change change_following

total_miss = total number of directors with missing postcode for a given firm-year

total_dir = total number of directors for a given firm-year

total_same = total number of directors with the same postcode (with non-missing postcode) for a given firm-year

percent_miss = total_miss/total_dir for a given firm-year

percent_same= total_same/total_dir for a given firm-year

change_dir = increase or decrease of total_dir compared to prior year (t-1) = total_dir(t) - total_dir(t-1); t=time period

percent_change (t)= change_dir/total_dir in a given year (t) = total_dir(t) - total_dir(t-1)/total_dir(t-1) for a given firm-year

change_following (t) = percent_change (t+1) for the same ticker in the following one year

For example, If the total directors in 2005=6 and directors in 2007 = 10. Then change_dir for year 2007 = 4 (or +4/6= +0.75). "Changes in following year" for 2006 means 4 (or 0.75) increases.

By the end, I would wish to generate a report/table that indicate number/percent of observations (firm-years) by years:

For each year (from 2001 - 2010):

1. Number (as well as in percent = number for each decile/total firms for given year) of firm with percent_miss ranging from 0% to 100% in 10% interval (in deciles)

2. Number (as well as in percent = number for each decile/total firms for given year) of firms with percent_same ranging from 0% to 100% in 10% interval (in deciles)

3. Number (as well as in percent) of firms with change_following which is ranging from 0% to 100% in 10% interval (in deciles) as well as an additonal category, which is named as "missing" (if there is no prior or following year data to facilitate calculations).

I understand it is indeed difficult and seeking for any advises and helps.

Thank you very much in advance.

Regards,

mspak

FriedEgg · Posted 03-16-2012 05:52 AM

I did this very quickly so I cannot speak to accuracy but it is a start for now. This meets the first portion of your problem, I am not sure I clearly understand your expectation for the second portion with the decile calculations so I left it out for now.

proc sql;

create table foo as

select ticker,

year,

sum(missing(postcode)) as total_miss,

count(iddir) as total_dir,

count(distinct iddir) as total_same,

(calculated total_miss/calculated total_dir) as percent_miss

from in.misingcode

group by ticker,year;

create table bar as

select a.ticker,

a.year,

a.total_miss,

a.total_dir,

a.total_same,

(a.total_miss/a.total_dir) as percent_miss,

a.total_dir-b.total_dir as change_dir,

(calculated change_dir/a.total_dir) as percent_change,

((a.total_dir-c.total_dir)/a.total_dir) as change_following

from foo a

left join foo b on a.ticker=b.ticker and a.year=b.year-1

left join foo c on a.ticker=c.ticker and a.year=c.year+1

order by a.ticker,a.year;

drop table foo;

quit;

ticker	Year	total_miss	total_dir	total_same	percent_miss	change_dir	percent_change	change_following
A	2001	1	1	1	1	0	0.0000	.
A	2002	1	1	1	1	1	1.0000	0.00000
A	2003	0	0	0	.	-1	.	.
A	2004	1	1	1	1	0	0.0000	1.00000
A	2005	1	1	1	1	0	0.0000	0.00000
A	2006	1	1	1	1	1	1.0000	0.00000
A	2007	1	0	0	.	-1	.	.
A	2008	0	1	1	0	-15	-15.0000	1.00000
A	2009	16	16	13	1	-3	-0.1875	0.93750
A	2010	0	19	14	0	.	.	0.15789
AA	2001	1	1	1	1	0	0.0000	.
AA	2002	1	1	1	1	1	1.0000	0.00000
AA	2003	0	0	0	.	-1	.	.
AA	2004	0	1	1	0	0	0.0000	1.00000
AA	2005	0	1	1	0	0	0.0000	0.00000
AA	2006	1	1	1	1	1	1.0000	0.00000
AA	2007	0	0	0	.	-1	.	.
AA	2008	0	1	1	0	-23	-23.0000	1.00000
AA	2009	24	24	20	1	-3	-0.1250	0.95833
AA	2010	0	27	21	0	.	.	0.11111

mspak · Posted 03-16-2012 07:02 AM

Thanks FriedEggs again,

From your first part of SQL:

proc sql;

create table foo as

select ticker,

year,

sum(missing(postcode)) as total_miss,

count(iddir) as total_dir,

count(distinct iddir) as total_same,

(calculated total_miss/calculated total_dir) as percent_miss

from in.misingcode

group by ticker,year;

The count(iddir) as total_dir includes those duplicate IDDIR; but the count(distinct iddir) as total_same, eliminate the duplicates and provides the total of directors. I think your total_same should be the total_dir by my definition. I wish to eliminate duplicates IDDIR.

My total_same definition is that total directors which have the same zipcode for a given firm-year, but this should excludes those have missing zipcodes. In other words, directors who have missing postcodes do not counted as same postcodes.

Change_following is seems incorect: change_dir for year 2007 = Change_following for year 2006. I would like to see the number of changes for the current year compared to the following years. Perhaps for this information, I can present them as time-series data for years 2001 - 2010 in terms of total number of unique directors (by IDDIR) reported. Then I should detect the significant changes for a given year. The year with most significant change should be identified and then the number of changes should be reported only for the year with the highest changes.

Is there any quick way or special function that deal with missing values by grouping them according to the category, such as firm-year as in my case?

The dataset that I provided - the incomplete should be more in earlier years than later years. Therefore, my purpose to detect significant change of total number of directors - to imply that in same firms, the incomplete data might be large in earlier years before the significant change of number of directors.

I am digesting your second part of the SQL in which 3 tables was used as source tables. I think in SQL, the lagging and leading figures can be written as year+1 and year-1. In DATA step, I used PROC EXPAND to do this. I am imagining the tables after left joins by the criteria that you set.

SQL do need creativity in designing the program.

Thank you for your helps.

Note: How long did you take to master this skill? I am just 3 months old in terms of SAS program, and having attended SAS Prog1 and prog2 trainings.

Regadrs,

mspak

Missing Value, Unreasonable Value, Incomplete Data Analysis

Missing Value, Unreasonable Value, Incomplete Data Analysis

Re: Missing Value, Unreasonable Value, Incomplete Data Analysis

SAS Innovate 2025: Call for Content

Click image to register for webinar

Classroom Training Available!