BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.

Hi all,

 

I am a relatively new SAS Enterprise Guide user working with a random sample. I previously selected a simple random sample (no duplicates) of 25 line items. It turns out some line items are not applicable to our review. I want to expand my sample to select new replacement line items. In other software programs I've used, I was able to do this by running a new (larger) sample using the same seed number. I tried this with all 3 options (simple/restricted, duplicates allowed but removed, and all duplicates included) but SAS isn't selecting the first 25 line items as before to be included in the larger sample. Since we've already started our review and spent time accumulating documents, I don't want to start over with a brand new sample.

 

Help! Is there any saving this situation?

 

Thanks in advance...

 

 

1 ACCEPTED SOLUTION
8 REPLIES 8
Rick_SAS
SAS Super FREQ

Please post the code that EG generates. 

brookeewhite1
Quartz | Level 8

Here it is, except for my filepaths which I substituted <dummynames> for:

 

My original sample of 25:

 

TITLE; FOOTNOTE;

PROC SURVEYSELECT DATA=<mylibrary>.<mysourcedataset>()
OUT=WORK.<myoutputdataset>
METHOD=SRS
N=25
SEED=4262017;
RUN;

QUIT;

 

My attempt to expand the sample to 100:

 

TITLE; FOOTNOTE;

PROC SURVEYSELECT DATA=<mylibrary>.<mysourcedataset>()
OUT=WORK.<myoutputdataset>
METHOD=SRS
N=100
SEED=4262017;
RUN;

QUIT;

 

Thanks for your help!

Rick_SAS
SAS Super FREQ

I will briefly describe what you can do. Please study the code I include.

1. Add a row identifier to the original data.

2. Use PROC SQL to create a macro variable 'SelectedObs' that contains the ID values of the 25 rows that were previously selected.

3. Call SURVEYSEELCT again, but use a WHERE clause:

  where ID not in (&SelectedObs);

4. Concatenate the two samples.

 

You will get a new sample that has the original sample as the first 25 rows, and 100 new observations as the remaining rows. Here is an example that uses the SasHelp.Cars data set:

data Have;
set sashelp.cars;
ID = _N_;
run;

PROC SURVEYSELECT DATA=Have OUT=sample
METHOD=SRS
N=25 SEED=4262017;
RUN;
 
proc sql noprint;                              
 select ID into :SelectedObs separated by ','
 from sample;
quit;

PROC SURVEYSELECT DATA=Have OUT=sample2
METHOD=SRS
N=100 SEED=4262017;
where ID not in (&SelectedObs);
RUN;

data All;
set Sample Sample2;
run;
brookeewhite1
Quartz | Level 8

Thank you for your quick reply! I think I understand... basically I remove the ones I already selected and select 100 more at random from the remainder, right? I'm not a statistician so I guess I didn't realize I could do that, but it makes sense... we are continuing to allow each line item an equal probability of being selected.

 

Here is a follow-up question... We may not need the full 75 or 100 more, so I was wondering if it is also possible to get SAS to output the sample in the order lines were randomly selected (unsorted)? Then we can proceed through the list until we get at least 25 applicable lines - and may not necessarily have to look all of them up.

 

Thanks again,

Rick_SAS
SAS Super FREQ

yes, your statements in the first paragraph are correct.

 

It sounds like you think that PROC SURVEYSELECT generates 100 numbers between 1 and N and then outputs those rows. That is not what happens. It goes through the data set row by row. If you are selecting 100 obs, then the first row has a 100/N probability of being chosen.

 

Either the first row is selected (and written to the output data set) or it isn't.  If it is, then the next row has a 99/(N-1) chance of being chosen. If it isn't selected, then the next row has a 100/(N-1) probability. This process continues until 100 obs are selected.

 

This same algorithm explains why N=25 and N=100 yield different rows. For a DATA step version of the algorithm, see "Method 3" in this SAS article: http://support.sas.com/kb/24/722.html. If you want randomly sorted observations, you can use "Method 2" or the method presented at http://support.sas.com/kb/24/802.html

 

brookeewhite1
Quartz | Level 8

Thank you again! The code you originally posted did in fact get me 125 observations including my original 25 lickety-split. (Thanks!)

 

Since I promised my colleague (prematurely?) that we only had to go through the sample until we hit "25 applicable observations" I would like to press in and figure out how to make the SAMPLE2 list unsorted... where we can draw a line when we get to "25 applicable" and ignore the rest. You mentioned 3 options: 1) DATA step version (I wasn't sure what the pros/cons of this were) and 2) randomly sorted via Method 2 at http://support.sas.com/kb/24/722.html and 3) randomly sorted via "the method" at http://support.sas.com/kb/24/802.html (I saw 2 methods at this link and didn't know which to choose). Can you please help direct me a little futher?  

 

Thank you,

brookeewhite1
Quartz | Level 8

Thank you so much! This appears to have worked!  Here is the code I used for RANUNI statement - does this look right? I wasn't sure if I put the 4262017 seed number and 7,996 line count in the right place. The 7,996 line count is after the first 25 were removed from the original 8,021 using an Enterprise Guide join tables step resulting in a dataset named "notselected".

 

 

data <mylibrary>.sampleb(drop=i);
choice=int(ranuni(4262017)*7996)+1;
set notselected point=choice nobs=n;
i+1;
/* Enter the desired sample size, 100 in this case */
if i>100 then stop;
run;

 

/* This combines the 2 samples to one data set with the original 25 lines on top.*/
data <mylibrary>.largercombinedsample;
set <myfirstsamplefile> Sampleb;
run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 1795 views
  • 3 likes
  • 2 in conversation