11-25-2015 12:01 AM
11-25-2015 05:33 AM
Please provide an example few test rows of each dataset (as a datastep), and what the output should look. Bear in mind that we don't work in your company so all those anacronym's and such like mean nothing to me, so providing simple data and output helps visualise the problem.
11-25-2015 06:53 AM
Thanks for your help with this. An example is attached with 2 firms with data from the same 3 years (1998, 1999, and 2000) for each firm. Thus, there are 6 total observations. Obviously, in my real dataset, there are many more firms with the same SIC code than the 2 I included. I need help with 2 steps:
1. I need help creating the CSR1 column that I included in the example. To be more specific, I need the control firm (0000003116 in my example) to have not issued a CSR report in ALL years of my data (not just 1998, 1999, or 2000).
2. I need to match firms based on the way I described it in my initial post.
Hopefully this example will help. Please let me know if you have any other questions.
Thanks a lot for your help with this urgent matter!
11-25-2015 07:03 AM
Please formulate test data as datasteps so I don't have to type everything in, for example:
data have; length cik $20; input cik $ fyear sic at ret csr; datalines; 0000001800 1998 2834 13216.213 -0.251908397 0 0000001800 1999 2834 14471.044 -0.258928571 1 0000001800 2000 2834 15283.254 0.3339070568 0 0000003116 1998 2834 61.416 0.3448275862 0 0000003116 1999 2834 76.098 0 0 0000003116 2000 2834 96.518 0.3461538462 0 ; run;
With regards to this, I do not follow in your data given how 1800 is related to 3116. Why does 1800 check 3116 but not the reverse? What I am trying to get at is what the logical expression would be to pull "control" data from the dataset to merge back to the original data, i.e. what linking pairs there are, e.g.
1800 = 3116
3116 = nothing
11-25-2015 08:04 AM
You don't need to type everything in since I have the data saved as a SAS dataset. So you don't need to put datalines in the code.
0000001800 and 0000003116 are both company ID numbers. The actual values don't matter mathematically. In this example, company 0000001800 is the company with a CSR report and company 0000003116 is a company in the same SIC code who has never issued a CSR report. Thus, I want to match company 0000001800 with company 0000003116 in this example.
Does that make sense?
11-25-2015 08:16 AM
Sorry, you misundertand. The datastep is not for your benefit, it is for my benefit. You have the data, I don't, but in order to provide code I need some data to work with and I don't want to type things in from a spreadsheet.
If that is all your logic is, or that you will give me then:
proc sql; create table WANT as select *, case when exists(select distinct CIK from HAVE where CIK="0000003116" and CSR=1) then 1 else 0 end as CSR1 from HAVE; quit;
However I am pretty sure that isn't the solution you are after, as you will need to write a where clause for every possible combination in your data. This is why I ask for clarification on what CIK matches to what other CIK, as that is the key point. How do you know that 3116 matches 1800, what is the logical link between the two.
11-25-2015 08:24 AM
You're right - that's not what I'm looking for. The CIKs are matched because they have the same SIC code. I need CIKs to match based on SIC code and I need the smallest difference between total assets (AT) and return (RET) for the match. In my example, if a different firm (who has never issued a CSR report) in SIC code 2834 is closer to firm 0000001800 (with regards to AT and RET) than firm 0000003116 is, then that different firm should be matched with 0000001800.
11-25-2015 08:44 AM
Ah, ok we are getting somewhere. I have split the task up into two in the below example. The first part assigns a baseline flag to the row with min at ret (in that order, or did you mean the min of those two?). This is then merged back to the data and SIC becomes the link variable, and then where its baseline and 1, or any other.
data have; length cik $20; input cik $ fyear sic at ret csr; datalines; 0000001800 1998 2834 13216.213 -0.251908397 0 0000001800 1999 2834 14471.044 -0.258928571 1 0000001800 2000 2834 15283.254 0.3339070568 0 0000003116 1998 2834 61.416 0.3448275862 0 0000003116 1999 2834 76.098 0 0 0000003116 2000 2834 96.518 0.3461538462 0 ; run; proc sort data=have out=additional; by sic cik at ret; run; data inter; set additional; by sic; flag="Y"; if first.sic then output; run; proc sql; create table INTER2 as select A.*, B.FLAG from WORK.HAVE A left join WORK.INTER B on A.SIC=B.SIC and A.CIK=B.CIK; quit; proc sql; create table WANT as select A.*, case when exists(select distinct SIC from INTER2 where A.SIC=SIC and A.CIK=CIK and CSR=1) or exists(select distinct SIC from INTER2 where A.SIC=SIC and FLAG ne "Y" and CSR=1) then 1 else 0 end as CSR1 from HAVE A; quit;
11-25-2015 10:48 AM
Thanks for the help.
2 questions about the last step:
1. Should the second last line be have or additional next to from?
2. Should it take a long time to run? The beginning steps ran very quickly. The last step took a very long time and I stopped it.
11-25-2015 11:07 AM
Try this in replace of the last step, what is happening is the exists() is running for each observation, so taking a long time:
proc sql; create table WANT as select A.*, case when B.CIK is not null then 1 else 0 end as CSR1 from HAVE A left join (select distinct CIK,SIC from INTER2 where CSR=1) B on A.SIC=B.SIC and A.CIK=B.CIK; quit;
11-25-2015 11:37 AM
Great - the CSR1 column is now correct. Then, what do I do to actually get a dataset with only the matched cases? I want a dataset with the matched cases (1 year before, year of CSR initial report, and 1 year after). For example, if there are 200 cases with the necessary data (SIC, AT, RET, etc.), then I should have 1,200 observations, calculated as follows:
200 cases times 3 years for CSR issuing firm = 600
200 cases times 3 years for non-CSR issuing firm = 600
Total = 1,200
To give an example, if a firm issued a CSR report in 1998, then I need the following 6 observations:
1997 data for CSR-issuing firm
1998 data for CSR-issuing firm
1999 data for CSR-issuing firm (as you see, 1 year before, year of issuance, and 1 year after)
1997 data for matched non-CSR issuing firm
1998 data for matched non-CSR issuing firm
1999 data for matched non-CSR issuing firm
Does that make sense? Thanks again for your help with this. We're making great progress!
01-06-2017 08:45 PM
I did not end up using SAS for this match before. However, I used SAS for a different matching project in 2016. Here's some code for you (you can change what you want to fit your needs):
create table audit5.temp as
select a.listCount, a.obs, b.obs as obs2, count(*) as matchCount
from audit5.grp1Talla as a, audit5.grp2Talla as b
where a.value = b.value
and a.listCount LE b.listCount
and a.Effect = b.Effect
group by 1,2,3
having listCount = matchCount;
/* Finally, join the two */
create table audit5.final as
select a.*, b.obs2, c.cik as cik2, c.accounting as accounting2, c.fraud as fraud2, c.errors as errors2, c.otherissues as otherissues2, c.lnassets as lnassets2, c.roa as roa2,
c.size as size2, c.mtb as mtb2, c.lev as lev2, c.fcf as fcf2, c.loss as loss2, c.foreign as foreign2, c.segments as segments2,
c.comnam as comnan2, c.abnbhar as abnbhar2, c.big4 as big42, c.cityex as cityex2, c.natex as natex2, c.ic as ic2,
c.auditorofficesize as auditorofficesize2, c.clientimportance as clientimportance2, c.longtenure as longtenure2,
c.ye as ye2, c.big_r as big_r2, c.multiplerestatements as multiplerestatements2, c.sic as sic2, c.fyear as fyear2
from audit5.grp1 a, audit5.temp b, audit5.grp2 c
if lnassets=. then delete;
if lnassets2=. then delete;
proc sort data=audit5.final2;
by obs distance;
by obs distance;
data audit5.bigr audit5.littler;
if big_r=1 then output audit5.bigr;
if big_r=1 then output audit5.littler;
keep cik fyear sic size mtb lev fcf roa loss foreign segments comnam abnbhar big4 cityex natex ic auditorofficesize
clientimportance longtenure ye effect accounting fraud errors otherissues big_r multiplerestatements;
keep cik2 fyear2 sic2 size2 mtb2 lev2 fcf2 roa2 loss2 foreign2 segments2 comnan2 abnbhar2 big42 cityex2 natex2 ic2 auditorofficesize2
clientimportance2 longtenure2 ye2 effect accounting2 fraud2 errors2 otherissues2 big_r2 multiplerestatements2;
proc sort data=audit5.bigr1 out=audit5.bigr2; by cik fyear;
proc sort data=audit5.littler2 out=audit5.littler3; by cik fyear;
set audit5.bigr2 audit5.littler3;
if Big_R=1 then Little_R=0;
proc means data=audit5.combinedfinal1;
proc logistic data= audit5.combinedfinal1 descending;
class fyear SIC;
class Loss / param=ref;
class Foreign / param=ref;
class Big4 / param=ref;
class CityEx / param=ref;
class NatEx / param=ref;
class IC / param=ref;
class LongTenure / param=ref;
class YE / param=ref;
class Little_R / param=ref;
class MultipleRestatements / param=ref;
model Little_R = LongTenure ClientImportance AuditorOfficeSize CityEx NatEx Size MTB Lev FCF Loss ROA
Segments Abnbhar Big4 IC Foreign YE MultipleRestatements fyear SIC /link=probit RSQ;
output out= audit5.combinedfinal2 p=Probability;