DATA Step, Macro, Functions and more

Select Observations at Random for Data Checking

Accepted Solution Solved
Reply
Regular Contributor
Posts: 199
Accepted Solution

Select Observations at Random for Data Checking

Is there a simple way to select a number of observations at random when doing data checks?

 

I have been using PROC PRINT w/ the FIRSTOBS and OBS options, but there must be a way to select a number of observations at random.

 

Thanks!


Accepted Solutions
Solution
‎08-08-2016 06:21 PM
Super User
Posts: 10,550

Re: Select Observations at Random for Data Checking

If you want a random sample another way would be to use proc surveyselect. The parameter samprate is very easy to use to select a percentage of records or sampsize to select a specific number of records.

 

Depending on the number of records and what conditions may be involved for checking you might use code looking for specific things.

Suppose you have a variable that should never have a value greater than 10:

Proc sql;

   select count(*)

   from dataset

   where variablex > 10;

quit;

 

Would tell you how many records have an invalid value.

Conditions could be multiple such as variablex>5 and missing(variabley) if variabley should have a value whenever variablex is greater than 5.

 

If you have multiple variables that should have the same range of non-missing values then custom formats may help. Suppose I have some variables that should only have values of 1, 2, 3, 4, 5 ( 1 to 5 scale) or 9 to indicate no opinion in a typical survey question.

 

Proc format;

value validscale

1, 2, 3, 4, 5,9='Valid'

other='Invalid'

;

run;

 

proc freq data=have;

   tables <list the variables with that code>;

   format <same variables> validscale. ;

run;

would give you tables with counts of invalid values.

Unless the data is extremely large then you are validating all records not just a sample for those variables.

View solution in original post


All Replies
Trusted Advisor
Posts: 2,113

Re: Select Observations at Random for Data Checking

The approach that I have used is to create a new variable in the data set based on the RANUNI funciton and then use that to make a pseudo-random selection of a prorportion of the records for audit.  I'm sure you could set up something similar in PROC SQL so you didn't have to do multiple passes of the data.

Solution
‎08-08-2016 06:21 PM
Super User
Posts: 10,550

Re: Select Observations at Random for Data Checking

If you want a random sample another way would be to use proc surveyselect. The parameter samprate is very easy to use to select a percentage of records or sampsize to select a specific number of records.

 

Depending on the number of records and what conditions may be involved for checking you might use code looking for specific things.

Suppose you have a variable that should never have a value greater than 10:

Proc sql;

   select count(*)

   from dataset

   where variablex > 10;

quit;

 

Would tell you how many records have an invalid value.

Conditions could be multiple such as variablex>5 and missing(variabley) if variabley should have a value whenever variablex is greater than 5.

 

If you have multiple variables that should have the same range of non-missing values then custom formats may help. Suppose I have some variables that should only have values of 1, 2, 3, 4, 5 ( 1 to 5 scale) or 9 to indicate no opinion in a typical survey question.

 

Proc format;

value validscale

1, 2, 3, 4, 5,9='Valid'

other='Invalid'

;

run;

 

proc freq data=have;

   tables <list the variables with that code>;

   format <same variables> validscale. ;

run;

would give you tables with counts of invalid values.

Unless the data is extremely large then you are validating all records not just a sample for those variables.

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 2 replies
  • 233 views
  • 2 likes
  • 3 in conversation