I have a data set with 54 million observations, 10 categorical variables (with response 0 or 1, S or N) and 2 numeric variables and some variables have missing values.
I don't know how I can determine the sample size and which method to use in SAS.
(the goal is to do exploratory analysis and use cluster)
Pardon my confusion, but if you have the whole population, why not summarize it, rather than sampling to estimate population parameters? Think about that a bit. These days 54M records are not an extremely large dataset.
Anyway, if you do want to work with samples:
As far as tools, the SURVEY procs are likely the best - SURVEYSELECT to sample the data, SURVEYFREQ for the categorical variables, SURVEYMEANS for the continuous variables My recommendation is to work through all of the examples in the documentation, so that you get a feel for what the statements in each PROC enable you to do. Then try to use the code there to address your questions. If you run into trouble, come on back, but please don't say "It didn't work.". Provide the code, some sample data and the log to show where and what isn't working.
SteveDenham
What do you mean by determining a sample size? With 54 million observations you have all the sample size you need for any inferential statistics, and sample size is not a consideration for exploratory statistics. If you mean what proportion to sample so that you can do the exploratory stats, I would say 1% would be adequate. If you feel you need a better estimate, then try bootstrapping with replacement. Generate a hundred or so 1% samples and then look at the distribution of the sampled parameters.
SteveDenham
Pardon my confusion, but if you have the whole population, why not summarize it, rather than sampling to estimate population parameters? Think about that a bit. These days 54M records are not an extremely large dataset.
Anyway, if you do want to work with samples:
As far as tools, the SURVEY procs are likely the best - SURVEYSELECT to sample the data, SURVEYFREQ for the categorical variables, SURVEYMEANS for the continuous variables My recommendation is to work through all of the examples in the documentation, so that you get a feel for what the statements in each PROC enable you to do. Then try to use the code there to address your questions. If you run into trouble, come on back, but please don't say "It didn't work.". Provide the code, some sample data and the log to show where and what isn't working.
SteveDenham
@al165275 wrote:
If I select 1%, How I can look at the distribution of the sampled parameters in SAS? Can you give me an example?
Parameters? I don't see parameters but data. If you want the distribution of values for variables then likely proc freq is a place to start.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.