turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- BI
- /
- Enterprise Guide
- /
- random number

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

01-08-2008 09:39 AM

Hi

I got 10 data sets which is having each 60000 obs. so i nees to get 100 random obs from each set . can you please let me know any one .

I got idea to use RANUNI but dont know how

thx

I got 10 data sets which is having each 60000 obs. so i nees to get 100 random obs from each set . can you please let me know any one .

I got idea to use RANUNI but dont know how

thx

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

01-08-2008 09:47 AM

Try PROC SURVEYSELECT (with method=SRS) in order to select a simple random sample of size N.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

01-08-2008 10:40 AM

Proc SurveySelect is not part of Base SAS, so you may not have it available to you.

Within SAS EG under Data is "Random Sample".

If coding here's an idea:

%macro select(inset,outset,size);

Data &outset;

set &inset nobs=N;

retain criteria count fudge 0;

if _n_ = 1 then criteria = N/&size;

if ranuni(-1) + fudge > criteria then do;

if count < 100 then do;

output;

count+1;

fudge+criteria;

end;

end;

drop criteria count fudge;

run;

quit;

%mend;

By increasing fudge, the probability of selecting a record increases, so that there is a greater change of selecting a particular record.

The downside to this method is that the actual probability distribution is not uniform. If fudge were not used, and "uniformity" maintained, then in a single pass through the dataset, you might not get all "size = 100" records/observations.

An alternative would be to use the POINT= set option

data &outset;

retain count 0;

I = ranuni(-1) * N;

set &inset NOBS=N POINT=I;

count+1;

if count = &size then stop;

drop count;

run;

quit;

This is probably a better method, and can also be encased in the above macro.

Within SAS EG under Data is "Random Sample".

If coding here's an idea:

%macro select(inset,outset,size);

Data &outset;

set &inset nobs=N;

retain criteria count fudge 0;

if _n_ = 1 then criteria = N/&size;

if ranuni(-1) + fudge > criteria then do;

if count < 100 then do;

output;

count+1;

fudge+criteria;

end;

end;

drop criteria count fudge;

run;

quit;

%mend;

By increasing fudge, the probability of selecting a record increases, so that there is a greater change of selecting a particular record.

The downside to this method is that the actual probability distribution is not uniform. If fudge were not used, and "uniformity" maintained, then in a single pass through the dataset, you might not get all "size = 100" records/observations.

An alternative would be to use the POINT= set option

data &outset;

retain count 0;

I = ranuni(-1) * N;

set &inset NOBS=N POINT=I;

count+1;

if count = &size then stop;

drop count;

run;

quit;

This is probably a better method, and can also be encased in the above macro.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

01-08-2008 10:43 AM

Another idea, that may work better:

%macro select(inset,outset,size);

data &outset;

retain count 0;

drop count;

I = ranuni(-1) * N;

set &inset NOBS=N POINT=I;

count+1;

if count = &size then stop;

run;

quit;

%mend;

%macro select(inset,outset,size);

data &outset;

retain count 0;

drop count;

I = ranuni(-1) * N;

set &inset NOBS=N POINT=I;

count+1;

if count = &size then stop;

run;

quit;

%mend;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

01-08-2008 12:32 PM

Chuck, your approach doesn't guarantee that a row could be selected multiple times, does it?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

01-08-2008 04:42 PM

Yes, you are correct for the POINT= method. There would need to be some way to check that that observation hadn't been used already.

Also, the calculation for I isn't quite right either since it doesn't guarantee an integer value.

I = round(ranuni(-1) * N) is an easy solution to the integer problem.

Solving the other problem takes a bit more work.

One way would be to use an array to keep a list of consumed records, and then use a linear search through the array to determine if the observation has been read before or not.

Another way to get a random subset of observations would require multiple passes through the dataset.

data dummy;

set &inset;

selection_key = ranuni(-1);

run;

proc sort data=dummy; by selection_key;

data &outset;

set dummy (obs=&size);

drop selection_key;

run;

But, this is still not perfectly generic, as none of the ideas are because they introduce at least one variable that may already be defined within the &inset dataset. So, no matter what is done, care must be taken, and some creativity on the part of the programmer.

Also, the calculation for I isn't quite right either since it doesn't guarantee an integer value.

I = round(ranuni(-1) * N) is an easy solution to the integer problem.

Solving the other problem takes a bit more work.

One way would be to use an array to keep a list of consumed records, and then use a linear search through the array to determine if the observation has been read before or not.

Another way to get a random subset of observations would require multiple passes through the dataset.

data dummy;

set &inset;

selection_key = ranuni(-1);

run;

proc sort data=dummy; by selection_key;

data &outset;

set dummy (obs=&size);

drop selection_key;

run;

But, this is still not perfectly generic, as none of the ideas are because they introduce at least one variable that may already be defined within the &inset dataset. So, no matter what is done, care must be taken, and some creativity on the part of the programmer.