DATA Step, Macro, Functions and more

Constructing random pairs with constraint and within group

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 119
Accepted Solution

Constructing random pairs with constraint and within group

[ Edited ]

I have the following dataset: 

 

 

data work.have;
 infile cards expandtabs truncover;
 input manager manager_firm_pair $24. n_number;
 cards;
 63 63|1323A|45C6C		2
 63 63|1323A|789A 		2
 63 63|1323A|12SAS		2
 63 63|1323A|2DS3D		2
 63 63|789A|12SAS		2
 63 63|789A|2DS3D		2
 63 63|2DS3D|12SAS		2
 89 89|R34S2|SDS23		3
 89 89|R34S2|J234S		3
 89 89|R34S2|S2WAD3		3
 89 89|J234S|S2WAD3		3
;
run;

The variable "n_number" represents the number of random rows the code needs to select. For example, for Manager 63, the code needs to randomly select 2 non-identical rows (or manager_firm_pair) that have the Manager number 63.

 

I want to repeat this selection X times (in this example 10 times, but with my real sample 10,000 times). In other words, I want to construct X samples each time randomly selecting "n_number" of rows (or manager_firm_pair) for a each Manager.  

Here are a few sample outputs that I want based on the above data, work.have:

 

data work.want_1;
 infile cards expandtabs truncover;
 input manager_firm_pair $24. n_number;
 cards;
 63 63|1323A|789A		2
 63	63|1323A|2DS3D		2
 89 89|R34S2|J234S		3
 89 89|R34S2|S2WAD3		3
 89 89|J234S|S2WAD3		3
;
run;

data work.want_2;
 infile cards expandtabs truncover;
 input manager_firm_pair $24. n_number;
 cards;
 63	63|1323A|12SAS		2
 63	63|789A|12SAS		2
 89 89|R34S2|J234S		3
 89 89|R34S2|S2WAD3		3
 89 89|J234S|S2WAD3		3
;
run;


I thought of using the following code, but I cannot specify how to iterate this random selection X times and select "n_number" of rows for each manager:

PROC SURVEYSELECT DATA=work.have OUT=work.try METHOD=SRS
  SAMPSIZE=10 SEED=1234567;
RUN;

 


Accepted Solutions
Solution
‎02-12-2018 10:56 PM
Super User
Posts: 13,523

Re: Construct random pairs with constraint and within group

[ Edited ]

First I think you need to clarify. Are sample 2 records from manager 63?

Then you want a STRATA statement for MANAGER. If you have a large number of managers then you can indicate the number of records per manager you want with a data set that would look like

REPS option would tell the procedure to identify x number of identically sized samples:

 

If your actual data "have" has many identical values of n_number to select then summarize from work.have.

This may get you started.

 

proc summary data=work.have nway;
   class manager;
   var n_number;
   output out=work.stratsize (drop= _f: _t:) max=_nsize_;
run;

proc sort data=work.have;
   by manager;
run;

proc surveyselect data=work.have
      sampsize=work.stratsize 
      reps=10
      out=work.sampled
;
   stratum manager;
run;

Though unless you have an extremely large number of manager / manager_firm_pair combinations you'll likely have a lot of duplicates if you actually go with 10,000 reps. 

 

View solution in original post


All Replies
Solution
‎02-12-2018 10:56 PM
Super User
Posts: 13,523

Re: Construct random pairs with constraint and within group

[ Edited ]

First I think you need to clarify. Are sample 2 records from manager 63?

Then you want a STRATA statement for MANAGER. If you have a large number of managers then you can indicate the number of records per manager you want with a data set that would look like

REPS option would tell the procedure to identify x number of identically sized samples:

 

If your actual data "have" has many identical values of n_number to select then summarize from work.have.

This may get you started.

 

proc summary data=work.have nway;
   class manager;
   var n_number;
   output out=work.stratsize (drop= _f: _t:) max=_nsize_;
run;

proc sort data=work.have;
   by manager;
run;

proc surveyselect data=work.have
      sampsize=work.stratsize 
      reps=10
      out=work.sampled
;
   stratum manager;
run;

Though unless you have an extremely large number of manager / manager_firm_pair combinations you'll likely have a lot of duplicates if you actually go with 10,000 reps. 

 

Frequent Contributor
Posts: 119

Re: Construct random pairs with constraint and within group

[ Edited ]

Thanks for willing to help out, @ballardw. Exactly, each randomly constructed sample will have 2 records for manager 63. The idea of this program is actually very easy. The variable "n_number" is exogenously given. I need to construct "n_number" of records for each manager. For example, in the work.have dataset, manager 63 has n_number=2, indicating that 2 records need to be drawn from the work.have dataset for each randomly constructed sample. The two records will always be rows that have manager=63. I will iterate this exercise over all managers (i.e., draw "n_number" of raws that have the same manager number). 

I have tried to run the code you have posted, but the following error occurs:

 78         
 79         proc surveyselect data=work.have
 80               sampsize=work.stratsize
 81               reps=10
 82               out=work.sampled
 83         ;
 84            stratum manager;
 85         run;
 
 ERROR: No _NSIZE_ variable is found in the SAMPSIZE= input data set.
 NOTE: The SAS System stopped processing this step because of errors.
 WARNING: The data set WORK.SAMPLED may be incomplete.  When this step was 
          stopped there were 0 observations and 0 variables.
 WARNING: Data set WORK.SAMPLED was not replaced because this step was stopped.


I am not sure why you are calculating the maximum n_number for each manager. I think that one could do the same as follows:

proc sql;
	create table work.stratsize as
	select distinct manager, n_number as _nsize_
	from work.have;
quit;


proc surveyselect data=work.have
      sampsize=work.stratsize 
      reps=10
      out=work.sampled
;
   stratum manager;
run;


To answer one of the other questions related to the sample size, I do have a large number of observation for each manager (thousands of rows, in total there are about 20 million unique manager-firm pairs). Also, to answer the duplicates related question, each manager-firm pair is unique. So there are no duplicates.

I have a followup question though. I understand that the above code selects randomly "_nsize_" (or "n_number") of rows for each manager and construct 10 samples by doing so. However, I am not sure how to change the restrict the selection to with replacement of rows (SRS is a without replacement procedure as far as I recall). Is there a way to change these conditions in proc surveryselect?  

Super User
Posts: 13,523

Re: Construct random pairs with constraint and within group

I fixed the code is summary, typically I drop the _freq_ and _type_ variables with _: in the drop but since the variable we needed was _nsize_ it got dropped as well.

 

I was calculating "max" that was one way. Select distinct was another possibility, just depends on style choices.

 


Yegen wrote:


I have a followup question though. I understand that the above code selects randomly "_nsize_" (or "n_number") of rows for each manager and construct 10 samples by doing so. However, I am not sure how to change the restrict the selection to with replacement of rows (SRS is a without replacement procedure as far as I recall). Is there a way to change these conditions in proc surveryselect?  


Simple SRS is without replacement but each replication starts with all of the records. So you have multiple results with the same manager_firm_pair in different replicates.

My results from your sample data with 10 replicates;

 
manager_firm_pair Frequency Percent Cumulative
Frequency
Cumulative
Percent
63|1323A|12SAS 2 4.00 2 4.00
63|1323A|2DS3D 4 8.00 6 12.00
63|1323A|45C6C 5 10.00 11 22.00
63|1323A|789A 1 2.00 12 24.00
63|2DS3D|12SAS 3 6.00 15 30.00
63|789A|12SAS 3 6.00 18 36.00
63|789A|2DS3D 2 4.00 20 40.00
89|J234S|S2WAD3 8 16.00 28 56.00
89|R34S2|J234S 7 14.00 35 70.00
89|R34S2|S2WAD3 7 14.00 42 84.00
89|R34S2|SDS23 8 16.00 50 100.00

 

Frequent Contributor
Posts: 119

Re: Construct random pairs with constraint and within group

[ Edited ]

I see that makes sense, thanks for these helpful comments. Thanks for pointing that out regarding the SRS method. I was not sure that the selection of records for each new sample is drawn from the entire pool of records. That's exactly what I need to do. 

The code is running quite well except for one issue. I am getting the following error for some managers (i.e., IDs):

 527        
 528        
 529        proc surveyselect data=work.sample_have
 530              sampsize=work.stratsize
 531              method=srs
 532              reps=100
 533              out=work.sampled
 534        ;
 535           stratum id;
 536        run;
 
 ERROR: The sample size, 8, is greater than the number of sampling units, 3.
 NOTE: The above message was for the following stratum:
       id=7.
 ERROR: The sample size, 7, is greater than the number of sampling units, 3.
 NOTE: The above message was for the following stratum:
       id=8.
 ERROR: The sample size, 5, is greater than the number of sampling units, 3.
 NOTE: The above message was for the following stratum:
       id=9.

I assume that the stratsize is greater than the actual number of records for these managers, right? It looks like that these managers are never included in the sample. Would it make sense to replace the stratsize for these managers with the actual number of records (see as below)? As far as I understood it, stratsize is the number of records that need to be randomly drawn from the initial pool that contains all records. Is that fair to say that? 

 

data stratsize(drop=sample_size);
	set stratsize;
	if sample_size<_nsize_ then _nsize_=sample_size; else _nsize_=_nsize_;
run;
Super User
Posts: 13,523

Re: Construct random pairs with constraint and within group

The option SELECTALL will select all the records available when the specified sample size exceeds what is actually in the data.

Frequent Contributor
Posts: 119

Re: Construct random pairs with constraint and within group

Perfect, thanks for this helpful comment @ballardw.

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 174 views
  • 2 likes
  • 2 in conversation