Help using Base SAS procedures

Random Sample

Reply
Contributor
Posts: 23

Random Sample

Hi all.  I'm trying to create a random sample with some conditions.  I have two datasets: Dataset 1 is 450 merger observations with variables:

Acquirer ID, Target ID, Year of Merger, Target Industry

Dataset 2 is 24,000 firm observations with variables:

ID, Year, Industry

For each Acquirer in Dataset 1, I want to draw a random sample from Dataset 2 with the same year of merger and the same industry as the actual target.  I need to repeat this random sampling 500 times so that I have 500 pairings of the actual acquirer with random targets.

Let me know if I need to clarify anything.  Thank you for your help!

Super User
Posts: 10,023

Re: Random Sample

You should post some data which can explain your question better.

Code: Program

data A;
input AcquirerID TargetID Year Industry $;
cards;
1 1 2012 aaa
2 1 2014 bbb
3 2 2015 ccc
4 3 2010 ddd
;
run;
data B;
input ID Year Industry $;
cards;
1 2012 aaa
2 2014 bbb
3 2015 ccc
4 2010 ddd
5 2012 aaa
6 2014 bbb
7 2015 ccc
8 2010 ddd
9 2012 aaa
10 2012 aaa
11 2015 ccc
12 2010 ddd
13 2012 aaa
14 2014 bbb
15 2015 ccc
16 2010 ddd
;
run;

proc sql;
create table tempA as
  select a.*,n
   from A as a left join
   (select Year,Industry,count(*) as n
   from B
   group by Year,Industry) as b
   on a.Year=b.Year and a.Industry=b.Industry;
quit;
proc sort data=B out=tempB ;by Year Industry;run;
data tempB;
set tempB;
by Year Industry;
if first.Industry then k=0;
k+1;
run;



%let times=4;

%let dsid=%sysfunc(open(tempB));
%let nobs=%sysfunc(attrn(&dsid,nlobs));
%let dsid=%sysfunc(close(&dsid));
data want;
if _n_ eq 1 then do;
  if 0 then set tempB;
  declare hash ha(dataset:'tempB');
  ha.definekey('k','year','Industry');
  ha.definedata('id','year','Industry');
  ha.definedone();
end;
call missing(of _all_);
set tempA;
call streaminit(1234);
do i=1 to ×
k=ceil(n*rand('uniform'));
rc=ha.find();
output;
end;
drop i k rc n;
run;
Super User
Posts: 11,343

Re: Random Sample

Have you verified that for each combination of year and industry in your first data set that there are 500 matching in the other? If there are not lots of duplicates of year and industry in dataset1 you may have an issue with even coming close to your targets of 500.

Super User
Posts: 5,499

Re: Random Sample

As ballardw suggested, you will need to re-use some matches.  There will likely not be enough choices to make each match unique.

Here's another approach.  The set-up takes most of the work, creating formats to identify which observations might match.  Choosing one randomly then is relatively easy.  Here's one way to start:

proc sort data=dataset2;

  by industry year;

run;

data dataset2a;

  set dataset2;

  retain fmtname '$indyr';

  by industry year;

  length label $ 11;

  if first.year then label = put (_n_, z5.);

  retain label;

  if last.year;

  start = industry || put(year, 4.);

  substr(label, 7, 5) = put(_n_, z5.);

  output;

run;

proc format cntlin=dataset2a;

run;

This will give you a format that translates the concatenated Industry + year into the set of observations that match.  For example, when LABEL=1001 1420, it means that the observations that match the given industry + year are observations 1001 through 1420.  You might want to print a few observations from dataset2a to get a feel for what that looks like.

Using the format is (comparatively) easy.  For example:

data want;

  set dataset1;

  observations = put(industry || put(year, 4.), $indyr.);

  firstone = input(scan(observations, 1), 5.);

  lastone = input(scan(observations, 2, 5.);

  do pairing=1 to 500;

     obsno = firstone + floor(ranuni(12345) * (lastone - firstone + 1));

     set dataset1 point=obsno;

     output;

  end;

run;

The observations in dataset1 do not have to be sorted for this to work.  If mismatches are possible (some Industry/year combinations in dataset1 but not in dataset2), this can still be done but becomes a little harder.

The code is untested, so may need to be debugged, but the approach should be fine.

All the observations end up in a single data set, which you can sort and process BY PAIRING if necessary.

Good luck.

... edited to add the second SET statement

Ask a Question
Discussion stats
  • 3 replies
  • 344 views
  • 0 likes
  • 4 in conversation