## code to create a subset

Solved
Frequent Contributor
Posts: 96

# code to create a subset

Hi All,

I need to create a subsample out of my control sample to match my test sample. This is what I have done so far: I created two dataset: Test and Control where firms in Test sample are not in the control sample. My test sample has 200 observations while my control sample has 3000 observations. I need to create a subset of Control sample (Control_subset) where there are 200 observations in Control_subset and these observations match the test sample based on year and auditor name (AU). For instance,

Test Sample

firm year AU

A    2000 KPMG

B    2005  PwC

CONTROL Sample

firm year AU

e     2000 KPMG

f      2000 AA

g     2000 PwC

h      2000 AA

i    2000 AA

j     2005 KPMG

k    2005 PwC

l     2005 PwC

Desired Output (Control_subset)

firm year AU

e    2000 KPMG

k    2005  PwC

Can anyone please provide me with the code?

SS

Accepted Solutions
Solution
‎07-13-2014 08:42 PM
Posts: 5,539

## Re: code to create a subset

In the following code, the role of the 0.1*rand('UNIFORM') term is only to chose randomly among the matches that might be at the same size distance. The factor (0.1, in my example) must be smaller than the smallest positive size difference.

proc sql;

create table subCtrl as

select c.* , abs(t.size - c.size) + 0.1*rand('UNIFORM') as d

from test as t inner join control as c

on t.AU=c.AU and T.year=c.year;

create table match as

select firm, year, AU, size

from subCtrl

group by AU, year

having d=min(d);

drop table subCtrl;

quit;

PG

PG

All Replies
Posts: 1,270

## Re: code to create a subset

Hi,

You need to flag first observation in control data set in case there are duplicates based on year and au. Try this for the desired output.

proc sort data=control;
by year au;
run;

data control_new;
set control;
by year au;
if first.au then flag=1;
run;

proc sql;
select a.* from control_new a
inner join test b
on a.year=b.year
and a.au=b.au
and a.flag=1;
quit;

Posts: 5,539

## Re: code to create a subset

To select the match randomly among the possible candidates:

proc sql;

create table subCtrl as

select c.* , rand('UNIFORM') as r

from test as t inner join control as c

on t.AU=c.AU and T.year=c.year;

create table match as

select firm, year, AU

from subCtrl

group by AU, year

having r=min(r);

drop table subCtrl;

quit;

PG

PG
Frequent Contributor
Posts: 96

## Re: code to create a subset

Dear PG,

Thanks for the code. Can you please tell me how the code would look like if I want to add one more condition. I also want to make sure that the Control_subset is closet match to the test sample based on size? That is,

Test Sample

firm year AU     Size

A    2000 KPMG     10

B    2005  PwC     12

CONTROL Sample

firm year AU     Size

e     2000 KPMG     11

f      2000 AA     11

g     2000 PwC     12

h      2000 AA     15

i    2000 AA     9

j     2005 KPMG     5

k    2005 PwC     10

l     2005 PwC     18

Desired Output

firm year AU     Size

e    2000 KPMG     11

k    2005  PwC     10

SS

Super Contributor
Posts: 275

## Re: code to create a subset

proc sql;

select c.* from test as t, control as c

where  t.au=c.au and t.year=c.year group by c.year,c.au having abs(c.size-t.size)=min(abs(c.size-t.size)) ;

quit;

Solution
‎07-13-2014 08:42 PM
Posts: 5,539

## Re: code to create a subset

In the following code, the role of the 0.1*rand('UNIFORM') term is only to chose randomly among the matches that might be at the same size distance. The factor (0.1, in my example) must be smaller than the smallest positive size difference.

proc sql;

create table subCtrl as

select c.* , abs(t.size - c.size) + 0.1*rand('UNIFORM') as d

from test as t inner join control as c

on t.AU=c.AU and T.year=c.year;

create table match as

select firm, year, AU, size

from subCtrl

group by AU, year

having d=min(d);

drop table subCtrl;

quit;

PG

PG
Frequent Contributor
Posts: 96

## Re: code to create a subset

Thanks a lot.

🔒 This topic is solved and locked.