Re: Pair Matching Code

jjadall1 · Posted 11-25-2015 12:01 AM

Hello,

I'm working on a project where I would like to see if there is a correlation between corporate social responsibility (CSR) issuance and stock return. I have 316 cases of firms issuing CSR reports and the year (i.e., company ABC in 1994 and company XYZ in 1998). I coded these cases as 1 in a separate CSR variable indicating that these firms initially issued a CSR report in whichever year the initial issuance was.

For these cases, I need to match them with firms that did not issue a CSR report in that year or any other year. I need to match each firm with another firm (control firm) in the same SIC code (same industry), and with the smallest differences in total assets (AT) and return percentage (RET) in the year that the CSR reporting firm issued its initial report.

For example, a company with CIK (my ID variable) 0000001800 issued its initial CSR report in 1999. I need to match this firm with another firm that never issued a CSR report (in 1999, before, or after), a firm that is in SIC code 2834 (which is company 0000001800's SIC code), and one that is closest to company 0000001800 in AT and RET during 1999. For these 2 firms (0000001800 and control firm), I need to keep the data for 1998, 1999, and 2000 (year before, year of, and year after initial CSR report issuance).

How do I code to get what I need above?

Thanks a lot and best regards,
Jadallah

RW9 · Posted 11-25-2015 05:33 AM

Hi,

Please provide an example few test rows of each dataset (as a datastep), and what the output should look. Bear in mind that we don't work in your company so all those anacronym's and such like mean nothing to me, so providing simple data and output helps visualise the problem.

jjadall1 · Posted 11-25-2015 06:53 AM

Hello,

Thanks for your help with this. An example is attached with 2 firms with data from the same 3 years (1998, 1999, and 2000) for each firm. Thus, there are 6 total observations. Obviously, in my real dataset, there are many more firms with the same SIC code than the 2 I included. I need help with 2 steps:

1. I need help creating the CSR1 column that I included in the example. To be more specific, I need the control firm (0000003116 in my example) to have not issued a CSR report in ALL years of my data (not just 1998, 1999, or 2000).

2. I need to match firms based on the way I described it in my initial post.

Hopefully this example will help. Please let me know if you have any other questions.

Thanks a lot for your help with this urgent matter!

Jadallah

RW9 · Posted 11-25-2015 07:03 AM

Please formulate test data as datasteps so I don't have to type everything in, for example:

data have;
  length cik $20;
  input cik $ fyear sic at ret csr;
datalines;
0000001800 1998 2834 13216.213 -0.251908397 0
0000001800 1999 2834 14471.044 -0.258928571 1
0000001800 2000 2834 15283.254 0.3339070568 0
0000003116 1998 2834 61.416 0.3448275862 0
0000003116 1999 2834 76.098 0 0
0000003116 2000 2834 96.518 0.3461538462 0
;
run;

With regards to this, I do not follow in your data given how 1800 is related to 3116. Why does 1800 check 3116 but not the reverse? What I am trying to get at is what the logical expression would be to pull "control" data from the dataset to merge back to the original data, i.e. what linking pairs there are, e.g.

1800 = 3116

3116 = nothing

jjadall1 · Posted 11-25-2015 08:04 AM

Hi,

You don't need to type everything in since I have the data saved as a SAS dataset. So you don't need to put datalines in the code.

0000001800 and 0000003116 are both company ID numbers. The actual values don't matter mathematically. In this example, company 0000001800 is the company with a CSR report and company 0000003116 is a company in the same SIC code who has never issued a CSR report. Thus, I want to match company 0000001800 with company 0000003116 in this example.

Does that make sense?

Thanks!

Jadallah

RW9 · Posted 11-25-2015 08:16 AM

Sorry, you misundertand. The datastep is not for your benefit, it is for my benefit. You have the data, I don't, but in order to provide code I need some data to work with and I don't want to type things in from a spreadsheet.

If that is all your logic is, or that you will give me then:

proc sql;
  create table WANT as
  select  *,
          case when exists(select distinct CIK from HAVE where CIK="0000003116" and CSR=1) then 1
               else 0 end as CSR1
  from    HAVE;
quit;

However I am pretty sure that isn't the solution you are after, as you will need to write a where clause for every possible combination in your data. This is why I ask for clarification on what CIK matches to what other CIK, as that is the key point. How do you know that 3116 matches 1800, what is the logical link between the two.

jjadall1 · Posted 11-25-2015 08:24 AM

You're right - that's not what I'm looking for. The CIKs are matched because they have the same SIC code. I need CIKs to match based on SIC code and I need the smallest difference between total assets (AT) and return (RET) for the match. In my example, if a different firm (who has never issued a CSR report) in SIC code 2834 is closer to firm 0000001800 (with regards to AT and RET) than firm 0000003116 is, then that different firm should be matched with 0000001800.

RW9 · Posted 11-25-2015 08:44 AM

Ah, ok we are getting somewhere. I have split the task up into two in the below example. The first part assigns a baseline flag to the row with min at ret (in that order, or did you mean the min of those two?). This is then merged back to the data and SIC becomes the link variable, and then where its baseline and 1, or any other.

data have;
  length cik $20;
  input cik $ fyear sic at ret csr;
datalines;
0000001800 1998 2834 13216.213 -0.251908397 0
0000001800 1999 2834 14471.044 -0.258928571 1
0000001800 2000 2834 15283.254 0.3339070568 0
0000003116 1998 2834 61.416 0.3448275862 0
0000003116 1999 2834 76.098 0 0
0000003116 2000 2834 96.518 0.3461538462 0
;
run;

proc sort data=have out=additional;
  by sic cik at ret;
run;
data inter;
  set additional;
  by sic;
  flag="Y";
  if first.sic then output;
run;

proc sql;
  create table INTER2 as
  select  A.*,
          B.FLAG
  from    WORK.HAVE A
  left join WORK.INTER B
  on      A.SIC=B.SIC
  and     A.CIK=B.CIK;
quit;

proc sql;
  create table WANT as
  select  A.*,
          case when exists(select distinct SIC from INTER2 where A.SIC=SIC and A.CIK=CIK and CSR=1) or
                    exists(select distinct SIC from INTER2 where A.SIC=SIC and FLAG ne "Y" and CSR=1) then 1
               else 0 end as CSR1
  from    HAVE A;
quit;

jjadall1 · Posted 11-25-2015 10:48 AM

Thanks for the help.

2 questions about the last step:

1. Should the second last line be have or additional next to from?

2. Should it take a long time to run? The beginning steps ran very quickly. The last step took a very long time and I stopped it.

Thanks,

Jadallah

RW9 · Posted 11-25-2015 11:07 AM

Try this in replace of the last step, what is happening is the exists() is running for each observation, so taking a long time:

proc sql;
  create table WANT as
  select  A.*,
          case when B.CIK is not null then 1
               else 0 end as CSR1
  from    HAVE A
  left join (select distinct CIK,SIC from INTER2 where CSR=1) B
  on      A.SIC=B.SIC
  and     A.CIK=B.CIK;
quit;

jjadall1 · Posted 11-25-2015 11:37 AM

Great - the CSR1 column is now correct. Then, what do I do to actually get a dataset with only the matched cases? I want a dataset with the matched cases (1 year before, year of CSR initial report, and 1 year after). For example, if there are 200 cases with the necessary data (SIC, AT, RET, etc.), then I should have 1,200 observations, calculated as follows:

200 cases times 3 years for CSR issuing firm = 600

200 cases times 3 years for non-CSR issuing firm = 600

Total = 1,200

To give an example, if a firm issued a CSR report in 1998, then I need the following 6 observations:

1997 data for CSR-issuing firm

1998 data for CSR-issuing firm

1999 data for CSR-issuing firm (as you see, 1 year before, year of issuance, and 1 year after)

1997 data for matched non-CSR issuing firm

1998 data for matched non-CSR issuing firm

1999 data for matched non-CSR issuing firm

Does that make sense? Thanks again for your help with this. We're making great progress!

Jadallah

ks5 · Posted 01-06-2017 06:10 PM

Were you able to create code for creating those matched pairs including AT and RET? I am stuck on a similar issue?

Thanks!

jjadall1 · Posted 01-06-2017 08:45 PM

Hello,

I did not end up using SAS for this match before. However, I used SAS for a different matching project in 2016. Here's some code for you (you can change what you want to fit your needs):

proc sql;
create table audit5.temp as
select a.listCount, a.obs, b.obs as obs2, count(*) as matchCount
from audit5.grp1Talla as a, audit5.grp2Talla as b
where a.value = b.value
and a.listCount LE b.listCount
and a.Effect = b.Effect
group by 1,2,3
having listCount = matchCount;
quit;

/* Finally, join the two */
proc sql;
create table audit5.final as
select a.*, b.obs2, c.cik as cik2, c.accounting as accounting2, c.fraud as fraud2, c.errors as errors2, c.otherissues as otherissues2, c.lnassets as lnassets2, c.roa as roa2,
c.size as size2, c.mtb as mtb2, c.lev as lev2, c.fcf as fcf2, c.loss as loss2, c.foreign as foreign2, c.segments as segments2,
c.comnam as comnan2, c.abnbhar as abnbhar2, c.big4 as big42, c.cityex as cityex2, c.natex as natex2, c.ic as ic2,
c.auditorofficesize as auditorofficesize2, c.clientimportance as clientimportance2, c.longtenure as longtenure2,
c.ye as ye2, c.big_r as big_r2, c.multiplerestatements as multiplerestatements2, c.sic as sic2, c.fyear as fyear2
from audit5.grp1 a, audit5.temp b, audit5.grp2 c
where a.obs=b.obs
and c.obs=b.obs2;
quit;

data audit5.final2;
set audit5.final;
if lnassets=. then delete;
if lnassets2=. then delete;
distance=lnassets2-lnassets;
run;

proc sort data=audit5.final2;
by obs distance;
run;

data audit5.final3;
set audit5.final2;
by obs distance;
if first.obs;
run;

data audit5.bigr audit5.littler;
set audit5.final3;
if big_r=1 then output audit5.bigr;
if big_r=1 then output audit5.littler;
run;

data audit5.bigr1;
set audit5.bigr;
keep cik fyear sic size mtb lev fcf roa loss foreign segments comnam abnbhar big4 cityex natex ic auditorofficesize
clientimportance longtenure ye effect accounting fraud errors otherissues big_r multiplerestatements;
run;

data audit5.littler1;
set audit5.littler;
keep cik2 fyear2 sic2 size2 mtb2 lev2 fcf2 roa2 loss2 foreign2 segments2 comnan2 abnbhar2 big42 cityex2 natex2 ic2 auditorofficesize2
clientimportance2 longtenure2 ye2 effect accounting2 fraud2 errors2 otherissues2 big_r2 multiplerestatements2;
run;

data audit5.littler2;
set audit5.littler1;
rename cik2=cik;
rename sic2=sic;
rename size2=size;
rename mtb2=mtb;
rename lev2=lev;
rename fcf2=fcf;
rename roa2=roa;
rename loss2=loss;
rename foreign2=foreign;
rename segments2=segments;
rename comnan2=comnam;
rename abnbhar2=abnbhar;
rename big42=big4;
rename cityex2=cityex;
rename natex2=natex;
rename ic2=ic;
rename auditorofficesize2=auditorofficesize;
rename clientimportance2=clientimportance;
rename longtenure2=longtenure;
rename ye2=ye;
rename big_r2=big_r;
rename multiplerestatements2=multiplerestatements;
rename fyear2=fyear;
run;

proc sort data=audit5.bigr1 out=audit5.bigr2; by cik fyear;
run;

proc sort data=audit5.littler2 out=audit5.littler3; by cik fyear;
run;

data audit5.combinedfinal;
set audit5.bigr2 audit5.littler3;
run;

data audit5.combinedfinal1;
set audit5.combinedfinal;
if Big_R=1 then Little_R=0;
else Little_R=1;
run;

proc means data=audit5.combinedfinal1;
run;

proc logistic data= audit5.combinedfinal1 descending;
class fyear SIC;
class Loss / param=ref;
class Foreign / param=ref;
class Big4 / param=ref;
class CityEx / param=ref;
class NatEx / param=ref;
class IC / param=ref;
class LongTenure / param=ref;
class YE / param=ref;
class Little_R / param=ref;
class MultipleRestatements / param=ref;
model Little_R = LongTenure ClientImportance AuditorOfficeSize CityEx NatEx Size MTB Lev FCF Loss ROA
Segments Abnbhar Big4 IC Foreign YE MultipleRestatements fyear SIC /link=probit RSQ;
output out= audit5.combinedfinal2 p=Probability;
run;

Best regards,

Jadallah

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away