BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
al165275
Fluorite | Level 6

I have a data set with 54 million observations, 10 categorical variables (with response 0 or 1, S or N) and 2 numeric variables and some variables have missing values.

I don't know how I can determine the sample size and which method to use in SAS.

 

(the goal is to do exploratory analysis and use cluster)

1 ACCEPTED SOLUTION

Accepted Solutions
SteveDenham
Jade | Level 19

Pardon my confusion, but if you have the whole population, why not summarize it, rather than sampling to estimate population parameters?  Think about that a bit. These days 54M records are not an extremely large dataset.

 

 Anyway, if you do want to work with samples:

 

As far as tools, the SURVEY procs are likely the best - SURVEYSELECT to sample the data, SURVEYFREQ for the categorical variables, SURVEYMEANS for the continuous variables  My recommendation is to work through all of the examples in the documentation, so that you get a feel for what the statements in each PROC enable you to do.  Then try to use the code there to address your questions.  If you run into trouble, come on back, but please don't say "It didn't work.".  Provide the code, some sample data and the log to show where and what isn't working.

 

SteveDenham

View solution in original post

4 REPLIES 4
SteveDenham
Jade | Level 19

What do you mean by determining a sample size?  With 54 million observations you have all the sample size you need for any inferential statistics, and sample size is not a consideration for exploratory statistics.  If you mean what proportion to sample so that you can do the exploratory stats, I would say 1% would be adequate.  If you feel you need a better estimate, then try bootstrapping with replacement.  Generate a hundred or so 1% samples and then look at the distribution of the sampled parameters.

 

SteveDenham

al165275
Fluorite | Level 6
 
I'm sorry, I guess I didn't specify that the 54 million observations is my entire population. 
To select 1%, which command will I use? SURVEYSELECT?
 
If I select 1%, How I can look at the distribution of the sampled parameters in SAS? Can you give me an example? 
 
(I'm new with SAS)
 
 
 
SteveDenham
Jade | Level 19

Pardon my confusion, but if you have the whole population, why not summarize it, rather than sampling to estimate population parameters?  Think about that a bit. These days 54M records are not an extremely large dataset.

 

 Anyway, if you do want to work with samples:

 

As far as tools, the SURVEY procs are likely the best - SURVEYSELECT to sample the data, SURVEYFREQ for the categorical variables, SURVEYMEANS for the continuous variables  My recommendation is to work through all of the examples in the documentation, so that you get a feel for what the statements in each PROC enable you to do.  Then try to use the code there to address your questions.  If you run into trouble, come on back, but please don't say "It didn't work.".  Provide the code, some sample data and the log to show where and what isn't working.

 

SteveDenham

ballardw
Super User

@al165275 wrote:
 
If I select 1%, How I can look at the distribution of the sampled parameters in SAS? Can you give me an example? 
 

Parameters? I don't see parameters but data. If you want the distribution of values for variables then likely proc freq is a place to start.

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 1723 views
  • 0 likes
  • 3 in conversation