BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Yegen
Pyrite | Level 9

I have the following dataset: 

 

 

data work.have;
 infile cards expandtabs truncover;
 input manager manager_firm_pair $24. n_number;
 cards;
 63 63|1323A|45C6C		2
 63 63|1323A|789A 		2
 63 63|1323A|12SAS		2
 63 63|1323A|2DS3D		2
 63 63|789A|12SAS		2
 63 63|789A|2DS3D		2
 63 63|2DS3D|12SAS		2
 89 89|R34S2|SDS23		3
 89 89|R34S2|J234S		3
 89 89|R34S2|S2WAD3		3
 89 89|J234S|S2WAD3		3
;
run;

The variable "n_number" represents the number of random rows the code needs to select. For example, for Manager 63, the code needs to randomly select 2 non-identical rows (or manager_firm_pair) that have the Manager number 63.

 

I want to repeat this selection X times (in this example 10 times, but with my real sample 10,000 times). In other words, I want to construct X samples each time randomly selecting "n_number" of rows (or manager_firm_pair) for a each Manager.  

Here are a few sample outputs that I want based on the above data, work.have:

 

data work.want_1;
 infile cards expandtabs truncover;
 input manager_firm_pair $24. n_number;
 cards;
 63 63|1323A|789A		2
 63	63|1323A|2DS3D		2
 89 89|R34S2|J234S		3
 89 89|R34S2|S2WAD3		3
 89 89|J234S|S2WAD3		3
;
run;

data work.want_2;
 infile cards expandtabs truncover;
 input manager_firm_pair $24. n_number;
 cards;
 63	63|1323A|12SAS		2
 63	63|789A|12SAS		2
 89 89|R34S2|J234S		3
 89 89|R34S2|S2WAD3		3
 89 89|J234S|S2WAD3		3
;
run;


I thought of using the following code, but I cannot specify how to iterate this random selection X times and select "n_number" of rows for each manager:

PROC SURVEYSELECT DATA=work.have OUT=work.try METHOD=SRS
  SAMPSIZE=10 SEED=1234567;
RUN;

 

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

First I think you need to clarify. Are sample 2 records from manager 63?

Then you want a STRATA statement for MANAGER. If you have a large number of managers then you can indicate the number of records per manager you want with a data set that would look like

REPS option would tell the procedure to identify x number of identically sized samples:

 

If your actual data "have" has many identical values of n_number to select then summarize from work.have.

This may get you started.

 

proc summary data=work.have nway;
   class manager;
   var n_number;
   output out=work.stratsize (drop= _f: _t:) max=_nsize_;
run;

proc sort data=work.have;
   by manager;
run;

proc surveyselect data=work.have
      sampsize=work.stratsize 
      reps=10
      out=work.sampled
;
   stratum manager;
run;

Though unless you have an extremely large number of manager / manager_firm_pair combinations you'll likely have a lot of duplicates if you actually go with 10,000 reps. 

 

View solution in original post

6 REPLIES 6
ballardw
Super User

First I think you need to clarify. Are sample 2 records from manager 63?

Then you want a STRATA statement for MANAGER. If you have a large number of managers then you can indicate the number of records per manager you want with a data set that would look like

REPS option would tell the procedure to identify x number of identically sized samples:

 

If your actual data "have" has many identical values of n_number to select then summarize from work.have.

This may get you started.

 

proc summary data=work.have nway;
   class manager;
   var n_number;
   output out=work.stratsize (drop= _f: _t:) max=_nsize_;
run;

proc sort data=work.have;
   by manager;
run;

proc surveyselect data=work.have
      sampsize=work.stratsize 
      reps=10
      out=work.sampled
;
   stratum manager;
run;

Though unless you have an extremely large number of manager / manager_firm_pair combinations you'll likely have a lot of duplicates if you actually go with 10,000 reps. 

 

Yegen
Pyrite | Level 9

Thanks for willing to help out, @ballardw. Exactly, each randomly constructed sample will have 2 records for manager 63. The idea of this program is actually very easy. The variable "n_number" is exogenously given. I need to construct "n_number" of records for each manager. For example, in the work.have dataset, manager 63 has n_number=2, indicating that 2 records need to be drawn from the work.have dataset for each randomly constructed sample. The two records will always be rows that have manager=63. I will iterate this exercise over all managers (i.e., draw "n_number" of raws that have the same manager number). 

I have tried to run the code you have posted, but the following error occurs:

 78         
 79         proc surveyselect data=work.have
 80               sampsize=work.stratsize
 81               reps=10
 82               out=work.sampled
 83         ;
 84            stratum manager;
 85         run;
 
 ERROR: No _NSIZE_ variable is found in the SAMPSIZE= input data set.
 NOTE: The SAS System stopped processing this step because of errors.
 WARNING: The data set WORK.SAMPLED may be incomplete.  When this step was 
          stopped there were 0 observations and 0 variables.
 WARNING: Data set WORK.SAMPLED was not replaced because this step was stopped.


I am not sure why you are calculating the maximum n_number for each manager. I think that one could do the same as follows:

proc sql;
	create table work.stratsize as
	select distinct manager, n_number as _nsize_
	from work.have;
quit;


proc surveyselect data=work.have
      sampsize=work.stratsize 
      reps=10
      out=work.sampled
;
   stratum manager;
run;


To answer one of the other questions related to the sample size, I do have a large number of observation for each manager (thousands of rows, in total there are about 20 million unique manager-firm pairs). Also, to answer the duplicates related question, each manager-firm pair is unique. So there are no duplicates.

I have a followup question though. I understand that the above code selects randomly "_nsize_" (or "n_number") of rows for each manager and construct 10 samples by doing so. However, I am not sure how to change the restrict the selection to with replacement of rows (SRS is a without replacement procedure as far as I recall). Is there a way to change these conditions in proc surveryselect?  

ballardw
Super User

I fixed the code is summary, typically I drop the _freq_ and _type_ variables with _: in the drop but since the variable we needed was _nsize_ it got dropped as well.

 

I was calculating "max" that was one way. Select distinct was another possibility, just depends on style choices.

 


@Yegen wrote:


I have a followup question though. I understand that the above code selects randomly "_nsize_" (or "n_number") of rows for each manager and construct 10 samples by doing so. However, I am not sure how to change the restrict the selection to with replacement of rows (SRS is a without replacement procedure as far as I recall). Is there a way to change these conditions in proc surveryselect?  


Simple SRS is without replacement but each replication starts with all of the records. So you have multiple results with the same manager_firm_pair in different replicates.

My results from your sample data with 10 replicates;

 
manager_firm_pair Frequency Percent Cumulative
Frequency
Cumulative
Percent
63|1323A|12SAS 2 4.00 2 4.00
63|1323A|2DS3D 4 8.00 6 12.00
63|1323A|45C6C 5 10.00 11 22.00
63|1323A|789A 1 2.00 12 24.00
63|2DS3D|12SAS 3 6.00 15 30.00
63|789A|12SAS 3 6.00 18 36.00
63|789A|2DS3D 2 4.00 20 40.00
89|J234S|S2WAD3 8 16.00 28 56.00
89|R34S2|J234S 7 14.00 35 70.00
89|R34S2|S2WAD3 7 14.00 42 84.00
89|R34S2|SDS23 8 16.00 50 100.00

 

Yegen
Pyrite | Level 9

I see that makes sense, thanks for these helpful comments. Thanks for pointing that out regarding the SRS method. I was not sure that the selection of records for each new sample is drawn from the entire pool of records. That's exactly what I need to do. 

The code is running quite well except for one issue. I am getting the following error for some managers (i.e., IDs):

 527        
 528        
 529        proc surveyselect data=work.sample_have
 530              sampsize=work.stratsize
 531              method=srs
 532              reps=100
 533              out=work.sampled
 534        ;
 535           stratum id;
 536        run;
 
 ERROR: The sample size, 8, is greater than the number of sampling units, 3.
 NOTE: The above message was for the following stratum:
       id=7.
 ERROR: The sample size, 7, is greater than the number of sampling units, 3.
 NOTE: The above message was for the following stratum:
       id=8.
 ERROR: The sample size, 5, is greater than the number of sampling units, 3.
 NOTE: The above message was for the following stratum:
       id=9.

I assume that the stratsize is greater than the actual number of records for these managers, right? It looks like that these managers are never included in the sample. Would it make sense to replace the stratsize for these managers with the actual number of records (see as below)? As far as I understood it, stratsize is the number of records that need to be randomly drawn from the initial pool that contains all records. Is that fair to say that? 

 

data stratsize(drop=sample_size);
	set stratsize;
	if sample_size<_nsize_ then _nsize_=sample_size; else _nsize_=_nsize_;
run;
ballardw
Super User

The option SELECTALL will select all the records available when the specified sample size exceeds what is actually in the data.

Yegen
Pyrite | Level 9

Perfect, thanks for this helpful comment @ballardw.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 1999 views
  • 2 likes
  • 2 in conversation