Hi,
Apologies if this is a silly question, I am a relatively new SAS user currently running 9.3.
I have a dataset (CITIES) of around 20,000 cities, each with population size classification, climate type, and national GDP classification.Some cities in the dataset also have data on transport, health, etc, which I have summarised into a single column called DataCoverage which counts the columns with known data for each city.
I'd like to do further analysis on a sub-sample of cities, and I would like to randomly select them in a manner which reflects the existing proportions of the data. I have done:
proc surveyselect data = CITIES out = samp1 method = srs sampsize=200 seed = 9876;
strata CLIMATE POPULATION_CLASS GDP_CLASS / alloc=proportional;
run;
What I would really like to do is select a subsample, which represents the proportions of the original dataset, but gives more weight to those with a larger DataCoverage (i.e. more known data, so I don't have to go find the data somewhere myself). Is such a thing possible?
Thanks,
Jon
The SIZE has to be non-zero in the basic use of surveyselect. Since we are using something that really isn't a population size counter then I would suggest add 1 to your datasource rate for all variables to get a 1 or greater and then at the end subtract the one out to get back to the original rank.
Also every record should have a datasource value or they would be excluded.
If you can provide a numeric variable that represents data coverage, with larger meaning more coverage, you might be able to get this with a PPS selection using that variable for the SIZE.
Depending on how you are defining "reflects the existing proportions" you may need to look at setting sample sizes per strata.
Thanks, I think this gets me close.
I'm defining "reflects the existing proportions" as, for example:
If CITIES with a Population_Class of <50K with GDP_Class of "LowerMiddle GDP" in CLIMATE "Temperate Humid" comprise 5% of all cities in the world, then I want them to be 5% of my sampled dataset. The percent of each strata in the sample should reflect that in the original dataset.
I have changed to:
proc surveyselect data = CITIES out = samp1 method = pps sampsize=200 seed = 9876;
strata CLIMATE POPULATION_CLASS GDP_CLASS / alloc=proportional;
size DataCoverage;
run;
I've got a few problems, namely:
1) it's not giving me any cities with DataCoverage=0. It's ok to have some in order to maintain proportions, I just want to minimise them if possible
2) Since DataCoverage isn't great, I am not getting a sample size of 200 (97, actually).
Thanks again for your help.
Jon
The SIZE has to be non-zero in the basic use of surveyselect. Since we are using something that really isn't a population size counter then I would suggest add 1 to your datasource rate for all variables to get a 1 or greater and then at the end subtract the one out to get back to the original rank.
Also every record should have a datasource value or they would be excluded.
Very clever! That seems to have done what I wanted it to. Thanks for your help!
Jon
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.