Hello Everyone,
I have a dataset (20,000 + observations) that consists of firm identifier, year, and audit committee director names. I want to create a variable (Dir_change) if a new director joins or an existing member leaves the company (it is possible that in some cases the company may not hire a new director to replace the leaving director or hire an additional director to increase the number of members). So to give you an idea, this is how I want my dir_change variable to look.
| Firm Identifier | Year | Director | Dir_change | 
| A | 2007 | Jack | 0 | 
| A | 2007 | Tom | 0 | 
| A | 2007 | Linda | 0 | 
| A | 2008 | Jack | 1 | 
| A | 2008 | Tom | 1 | 
| A | 2008 | Mary | 1 | 
| B | 2007 | Jenny | 0 | 
| B | 2007 | Jessy | 0 | 
| B | 2007 | Joe | 0 | 
| B | 2008 | Jenny | 1 | 
| B | 2008 | Jessy | 1 | 
| C | 2007 | Ally | 0 | 
| C | 2007 | April | 0 | 
| C | 2008 | Ally | 0 | 
| C | 2008 | April | 0 | 
| D | 2007 | Mark | 0 | 
| D | 2007 | Frank | 0 | 
| D | 2008 | Mark | 1 | 
| D | 2008 | Frank | 1 | 
| D | 2008 | Mike | 1 | 
In the above example, in company A, Linda is replacing Mary in 2008 and hence I want dir_Change to be 1 for that year. For company B, Joe left the board and hence I want the dir_Change to be 1. There was no change in the directors for company C and hence it is zero. For company D, although no one left the company, Mike joined the company in 2008 and hence I want dir_change to be 1.
I am a newbie and know only basic coding so far. I would greatly appreciate it if someone shares the code that will help me get the desired output.
Thank you for your time.
Best regards,
GokulK
Here is a SQL solution involving a full join (always tricky):
data have;
input firm $ year name $;
datalines;
A   2007    Jack    0
A   2007    Tom 0
A   2007    Linda   0
A   2008    Jack    1
A   2008    Tom 1
A   2008    Mary    1
B   2007    Jenny   0
B   2007    Jessy   0
B   2007    Joe 0
B   2008    Jenny   1
B   2008    Jessy   1
C   2007    Ally    0
C   2007    April   0
C   2008    Ally    0
C   2008    April   0
D   2007    Mark    0
D   2007    Frank   0
D   2008    Mark    1
D   2008    Frank   1
D   2008    Mike    1
;
proc sql;
create table want as
select 
    h.*,
    coalesce(c.change, 0) as change
from
    have as h left join
    (
    select 
        coalesce(a.firm, b.firm) as firm,
        coalesce(a.year, b.year+1) as year,
        nmiss(a.firm) + nmiss(b.firm) > 0 as change
    from 
        have as a full join
        have as b on 
            a.firm=b.firm and 
            a.year=b.year+1 and 
            a.name=b.name
    group by calculated firm, calculated year
    having count(a.firm) > 0 and count(b.firm) > 0 ) as c
    on h.firm=c.firm and h.year=c.year;
select * from want;
quit;
Is it always two years for a firm? Or could the data contain more than two years, and then the solution ought to compare each year to the previous year? Or do you compare to the first year?
What about spelling errors and capitalization differences? Are these all cleaned up beforehand?
It is a 10-year data starting 2007. I would like to see if the committee composition changed compared to the previous year. That is, for instance, are the directors in the committee the same in 2019 as they were in 2018. Also, there is a possibility that a company may (1) lose more than one director in a year or (2) lose one and add two directors to their committee. I have capitalized all the names in my actual dataset.
Thank you.
Best regards,
GokulK
You can read each firmid/year of data twice. In the first pass, make an array of names, sort the array, then copy the names, separated by commas, to a single character variable (_this_year_names). Then set the DIR_CHANGE dummy according to whether _this_year_names matches its own lagged value (except for the first year of the id, when DIR_CHANGE is set to zero):
data have;
  input firm_id $ year dirname $;
datalines;
A	2007	Jack	0
A	2007	Tom	0
A	2007	Linda	0
A	2008	Jack	1
A	2008	Tom	1
A	2008	Mary	1
B	2007	Jenny	0
B	2007	Jessy	0
B	2007	Joe	0
B	2008	Jenny	1
B	2008	Jessy	1
C	2007	Ally	0
C	2007	April	0
C	2008	Ally	0
C	2008	April	0
D	2007	Mark	0
D	2007	Frank	0
D	2008	Mark	1
D	2008	Frank	1
D	2008	Mike	1
run;
data want (drop=_:);
  array _dnames {20} $20;      /*One year of names in an array */
  length _this_year_names _last_year_names $400; /*One year as csv long character*/
  /* Read one year of names and populate the array*/
  do _n=1 by 1 until (last.year);
    set have;
    by firm_id year;
    if first.firm_id then dir_change=0; /*First first year only*/
    _dnames{_n}=dirname;
  end;
  /* Sort the array and copy a csv list into a character variable*/
  call sortc(of _dnames{*});
  _this_year_names=catx(',',of _dnames{*});
  _last_year_names=lag(_this_year_names);
  /* Compare to last year value*/
  if dir_change=. then dir_change=(_this_year_names ^= _last_year_names);
  /* Re-read the year and output */
  do until (last.year);
    set have;
    by firm_id year;
    output;
  end;
run;Just make sure the array size (20 above) accommodates all the names for a single year. And make sure the length of the _this_year_names & _last_year_names ($400 above) is enough to contain all the names, separated by commas.
Note the sorting of the names will put blank names (i.e. if your only have 15 names in a year, youll have 5 blanks) first, followed by the actual names. This won't matter for the task as you describe it.
And of course make sure the each distinct name is consistent in its use of uppercase/lowercase.
Here is a SQL solution involving a full join (always tricky):
data have;
input firm $ year name $;
datalines;
A   2007    Jack    0
A   2007    Tom 0
A   2007    Linda   0
A   2008    Jack    1
A   2008    Tom 1
A   2008    Mary    1
B   2007    Jenny   0
B   2007    Jessy   0
B   2007    Joe 0
B   2008    Jenny   1
B   2008    Jessy   1
C   2007    Ally    0
C   2007    April   0
C   2008    Ally    0
C   2008    April   0
D   2007    Mark    0
D   2007    Frank   0
D   2008    Mark    1
D   2008    Frank   1
D   2008    Mike    1
;
proc sql;
create table want as
select 
    h.*,
    coalesce(c.change, 0) as change
from
    have as h left join
    (
    select 
        coalesce(a.firm, b.firm) as firm,
        coalesce(a.year, b.year+1) as year,
        nmiss(a.firm) + nmiss(b.firm) > 0 as change
    from 
        have as a full join
        have as b on 
            a.firm=b.firm and 
            a.year=b.year+1 and 
            a.name=b.name
    group by calculated firm, calculated year
    having count(a.firm) > 0 and count(b.firm) > 0 ) as c
    on h.firm=c.firm and h.year=c.year;
select * from want;
quit;
Thank you everyone for providing me the codes.
Was there just two year ?
data have; input firm $ year name $; datalines; A 2007 Jack 0 A 2007 Tom 0 A 2007 Linda 0 A 2008 Jack 1 A 2008 Tom 1 A 2008 Mary 1 B 2007 Jenny 0 B 2007 Jessy 0 B 2007 Joe 0 B 2008 Jenny 1 B 2008 Jessy 1 C 2007 Ally 0 C 2007 April 0 C 2008 Ally 0 C 2008 April 0 D 2007 Mark 0 D 2007 Frank 0 D 2008 Mark 1 D 2008 Frank 1 D 2008 Mike 1 ; proc sql; create table temp as select distinct firm from ( select firm,name,count(distinct year) as count from have group by firm,name ) group by firm having sum(count ne 2); create table want as select *,case when year=max(year) and firm in (select firm from temp) then 1 else 0 end as flag from have as a group by firm order by 1,2; quit;
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
