08-14-2015 11:55 PM
Hi all. I'm trying to create a random sample with some conditions. I have two datasets: Dataset 1 is 450 merger observations with variables:
Acquirer ID, Target ID, Year of Merger, Target Industry
Dataset 2 is 24,000 firm observations with variables:
ID, Year, Industry
For each Acquirer in Dataset 1, I want to draw a random sample from Dataset 2 with the same year of merger and the same industry as the actual target. I need to repeat this random sampling 500 times so that I have 500 pairings of the actual acquirer with random targets.
Let me know if I need to clarify anything. Thank you for your help!
08-15-2015 08:17 AM
You should post some data which can explain your question better.
input AcquirerID TargetID Year Industry $;
1 1 2012 aaa
2 1 2014 bbb
3 2 2015 ccc
4 3 2010 ddd
input ID Year Industry $;
1 2012 aaa
2 2014 bbb
3 2015 ccc
4 2010 ddd
5 2012 aaa
6 2014 bbb
7 2015 ccc
8 2010 ddd
9 2012 aaa
10 2012 aaa
11 2015 ccc
12 2010 ddd
13 2012 aaa
14 2014 bbb
15 2015 ccc
16 2010 ddd
create table tempA as
from A as a left join
(select Year,Industry,count(*) as n
group by Year,Industry) as b
on a.Year=b.Year and a.Industry=b.Industry;
proc sort data=B out=tempB ;by Year Industry;run;
by Year Industry;
if first.Industry then k=0;
if _n_ eq 1 then do;
if 0 then set tempB;
declare hash ha(dataset:'tempB');
call missing(of _all_);
do i=1 to ×
drop i k rc n;
08-17-2015 12:06 PM
Have you verified that for each combination of year and industry in your first data set that there are 500 matching in the other? If there are not lots of duplicates of year and industry in dataset1 you may have an issue with even coming close to your targets of 500.
08-17-2015 08:28 PM
As ballardw suggested, you will need to re-use some matches. There will likely not be enough choices to make each match unique.
Here's another approach. The set-up takes most of the work, creating formats to identify which observations might match. Choosing one randomly then is relatively easy. Here's one way to start:
proc sort data=dataset2;
by industry year;
retain fmtname '$indyr';
by industry year;
length label $ 11;
if first.year then label = put (_n_, z5.);
start = industry || put(year, 4.);
substr(label, 7, 5) = put(_n_, z5.);
proc format cntlin=dataset2a;
This will give you a format that translates the concatenated Industry + year into the set of observations that match. For example, when LABEL=1001 1420, it means that the observations that match the given industry + year are observations 1001 through 1420. You might want to print a few observations from dataset2a to get a feel for what that looks like.
Using the format is (comparatively) easy. For example:
observations = put(industry || put(year, 4.), $indyr.);
firstone = input(scan(observations, 1), 5.);
lastone = input(scan(observations, 2, 5.);
do pairing=1 to 500;
obsno = firstone + floor(ranuni(12345) * (lastone - firstone + 1));
set dataset1 point=obsno;
The observations in dataset1 do not have to be sorted for this to work. If mismatches are possible (some Industry/year combinations in dataset1 but not in dataset2), this can still be done but becomes a little harder.
The code is untested, so may need to be debugged, but the approach should be fine.
All the observations end up in a single data set, which you can sort and process BY PAIRING if necessary.
... edited to add the second SET statement