About AlexBeaver

AlexBeaver · ‎09-25-2024

SAS has made the decision to no longer deliver or support CData JDBC drivers for Facebook, Google Analytics, Google Drive, Microsoft OneDrive, and YouTube Analytics (the “Drivers”) in SAS Viya, effective immediately. This decision was made due to limited usage and technical issues that have caused restricted capabilities for the Drivers. SAS is removing these drivers from currently supported releases and all future releases of SAS Viya. For existing deployments that included the Drivers, SAS will continue to provide support in accordance with SAS Technical Support Policies. Please note that support documentation will remain accessible in accordance with SAS policies. In releases of the offering already deployed before 2024.07, customers will see no immediate change; the Drivers will be available with the software. However, if the software is re-downloaded and re-deployed – or updated to the 2024.07 or newer release – the Drivers will no longer be available. After this point any code using the Drivers will return an error stating that the Drivers do not exist. If you require the Drivers going forward, SAS recommends that you contact CData directly to obtain a license, work with another vender to obtain a license, to connect to these data sources. You can also learn how SAS connects with popular Microsoft 365 tools like Microsoft OneDrive, Teams and SharePoint. SAS Viya 4 releases shown below are affected by the removal of the Drivers. Current and 3 previous stable releases: 2024.06 2024.07 2024.08 2024.09 (Current) Current and 3 previous Long-Term Stable releases: 2022.09 LTS 2023.03 LTS 2023.10 LTS 2024.03 LTS (Current)

AlexBeaver · ‎09-25-2024

SAS has made the decision to no longer deliver or support CData JDBC drivers for Facebook, Google Analytics, Google Drive, Microsoft OneDrive, and YouTube Analytics (the “Drivers”) in SAS Viya, effective immediately. This decision was made due to limited usage and technical issues that have caused restricted capabilities for the Drivers. SAS is removing these drivers from currently supported releases and all future releases of SAS Viya. For existing deployments that included the Drivers, SAS will continue to provide support in accordance with SAS Technical Support Policies. Please note that support documentation will remain accessible in accordance with SAS policies. In releases of the offering already deployed before 2024.07, customers will see no immediate change; the Drivers will be available with the software. However, if the software is re-downloaded and re-deployed – or updated to the 2024.07 or newer release – the Drivers will no longer be available. After this point any code using the Drivers will return an error stating that the Drivers do not exist. If you require the Drivers going forward, SAS recommends that you contact CData directly to obtain a license, work with another vender to obtain a license, to connect to these data sources. You can also learn how SAS connects with popular Microsoft 365 tools like Microsoft OneDrive, Teams and SharePoint. SAS Viya 4 releases shown below are affected by the removal of the Drivers. Current and 3 previous stable releases: 2024.06 2024.07 2024.08 2024.09 (Current) Current and 3 previous Long-Term Stable releases: 2022.09 LTS 2023.03 LTS 2023.10 LTS 2024.03 LTS (Current)

AlexBeaver · ‎03-08-2024

Hi @ronan, we are migrating the Focus Areas pages! Scalability & Performance can now be found at https://support.sas.com/en/software/scalability-performance.html.

AlexBeaver · ‎03-07-2024

The SAS System provides the FULLSTIMER option to collect performance statistics on each SAS step, and for the job as a whole and place them in the SAS log. It is important to note that the FULLSTIMER measures only give you a snapshot view of performance at the step and job level. Each SAS port yields different FULLSTIMER statistics based on the host operating system. See the SAS host specific documentation for the exact statistics offered. FULLSTIMER is invoked as a SAS option and takes effect after the option invocation. If you would like to have the performance statistics written to a SAS data set, download the attached ZIP file which contains the experimental %LOGPARSE macro. Why start with the FULLSTIMER option for monitoring? The best reason is that it tells you what is happening with the SAS system specifically. The statistics it provides are at the job step and can help pinpoint performance problems down to the step. This is extremely helpful in narrowing troublesome activity, and relating it to what your code is telling the system to do. (Note: If the test execution is long, expensive, high impact to the environment, and is not easily set up, the SAS session monitoring can be done simultaneously with server and system performance monitoring.) FULLSTIMER measures can be used to help determine if more in-depth performance monitoring with host monitoring or third party tools is indicated. A sample result of a FULLSTIMER option UNIX output for a SAS Data Step is listed below: NOTE: DATA statement used: real time 0.06 seconds user cpu time 0.02 seconds system cpu time 0.00 seconds Memory 88k Page Faults 10 Page Reclaims 0 Page Swaps 0 Voluntary Context Switches 22 Involuntary Context Switches 0 Block Input Operations 10 Block Output Operations 12 It is important to know how these numbers are defined and what can be derived from them. FULLSTIMER Statistics Definition and Interpretation Real Time - the Real Time represents the elapsed time or "wall clock" time. This is the time spent to execute a job or step. This is the time the user experiences in wait for the job/step to complete. Note: As host system resources are heavily utilized the Real Time can go up significantly - representing a wait for various system resources to become available for the SAS job/step's usage. User CPU Time - the time spent by the processor to execute user-written code. This is user-written from the perspective of the operating system and not the customer's language statements. That is all SAS system code that is not operating system code. System CPU Time - the time spent by the processor to execute operating system tasks that support user-written code (all CPU tasks that were not executing user-written code). The user CPU time and system CPU time are mutually exclusive. Memory - Memory represents the amount of memory allocated to that job/step. This does not represent the entire amount of memory that the SAS session is consuming, as it does not reflect any SAS overhead activities (SAS manager, etc.). Page Faults - Represents the number of virtual memory page faults that occurred during the job/step. Page Faults are pages that required an I/O to retrieve (a read was done to the I/O subsystem). Page Reclaims - Represents the number of pages retrieved from the page list awaiting re-allocation (all done in memory). These pages did not require I/O activity to obtain. Page Swaps - The number of times a process was swapped out of main memory. Voluntary Context Switches - Represents the number of times a process releases its CPU time-slice voluntarily before it's time-slice allocation is expired. This usually occurs when the process needs an external resource, like making an I/O call for more data. Involuntary Context Switches - The number of times a process releases its CPU time-slice involuntarily. This usually happens when its CPU time-slice has expired before the task was finished, or a higher priority task takes its time-slice away. Block Input Operations - The number of "bufsize" reads that occur. These are I/O operations to read the data into memory for usage. Not all reads have to utilize an I/O operation since the page being requested may still be cached in memory from previous reads. Block Output Operations - This represents the number of "bufsize" writes that occur. These are the same as block input operations except that they pertain to the writes to files. As in the case of block input operations, not all block outputs will cause an I/O operation. Some files may still be cached in memory. Performance problems usually involve one or more of the following physical areas: CPU activity Memory activity I/O subsystem activity (disk and file systems) Network activity (this will be discussed outside the context of the SAS system later). By examining FULLSTIMER statistics, and interpreting what is happening with and between the factors producing the measures, we can get a quick idea of where the system is having problems. We can then resort to host-level and third-party measuring tools to obtain a very detailed picture of problem issue. If the host-level and third-party tools give such detail why not use them first? Very simply, there are many tools to use, and each is fairly good at one or more specific areas of investigation, such as CP, Memory, and I/O. Also some require Server Root-Level access to deploy. FULLSTIMER is quick and easy (incorporated in the SAS system), requires no special privileges, you can do it yourself, and it can help quickly narrow the field of things to test next. The following is a general list of interpretations you can make using FULLSTIMER: Real Time/CPU Time. The most valuable way to use FULLSTIMER is to compare timing information. By comparing the Real Time (elapsed time), with the total CPU time (system CPU time plus user CPU time) you can quickly determine if the problem is CPU related. If the Real time and total CPU time are within 15 percent of each other, this usually indicates that the system is moving data well (at least during the run time of that job/step processing). This means that the ratio of CPU process time is close to that of the total job. This indicates that the system memory, disk system, and file system are getting data to the CPU quickly enough to not be a problem. If you are experiencing bad task performance, and the real and CPU time are within 15 percent of each other, it most likely means that your task is CPU bound. The only way to improve the performance will be to get a faster CPU, split the process over more CPUs (multi-threading or parallel processing), or reengineer the code to be more efficient. If the Real time and total CPU time are routinely very disparate, (for example if there is a 50 percent margin between them), then you very likely have a problem in your system getting information to the CPU fast enough. Make a closer examination of the Memory and I/O subsystems using the host or third-party tools mentioned in the next section. Other valuable information from FULLSTIMER can be gained by looking at the other statistics: Memory. If a sizeable quantity of memory is used and your elapsed time differs greatly from your total CPU time, you may also want to take a close look at your memory using host or third-party tools that are mentioned in the next section. Involuntary Context Switches. If Involuntary Context Switches are consistently high across many steps and jobs over long time-periods, then your CPU system is under a heavy load, and you will want to examine that more closely with the tools mentioned in the next section. Page Swaps. If Page Swaps are consistently high then your memory system is being stressed, and needs more examination. Other statistics like Block Input and Output operations, Page Faults and Reclaims, and Voluntary Context switches can hint at issues, but require more corroboration from the measures previously discussed to make a case for narrowing down investigation. These measures could be high in-and-of themselves without being a symptom of performance problems. Once FULLSTIMER statistics have been examined, they should help indicate which area(s) should be examined in more detail. It is often the case on overloaded systems that multiple areas present themselves for examination. The FULLSTIMER activity should help point to tools that could be used to get a more detailed level picture of any hardware/file system issues. This comprises our next step, detecting performance issues at the host server system level. Note: This content was originally published on support.sas.com.

AlexBeaver · ‎01-09-2024

SPD Engine or SPD Server? The SPD Engine and the SPD Server product share a common heritage and, therefore, share a great number of features and performance benefits. However, there are some important differences, primarily in the execution environment. SPDE should be considered an entry-level scalable product. It runs as a libname engine in the SAS environment. SPD Server is a standalone client/server product. Applications initially developed on SPDE can be migrated to SPD Server with ease, as the need to move to a full client/server environment arises. Compared to SPDE, SPD Server: requires its computer be a mostly dedicated server; the more dedicated, the better. requires more skills to set up and administer than SPDE. supports multi-user client/server access. supports an Access Control List–based security model. supports the SQL functions: parallel BY-group (PBG) processing. implicit pass-through. is not available on Linux (LNX), OS/390 (MVS), HP/UX for the Itanium Processor Family (H6I), or OpenVMS Alpha (ALP). For more information, visit the SPD Server Learn & Support page. SPDE Engine or Base Engine? SPDE is optimized for the storage and sequential access of large and very large data sets (millions of rows, many GB of data). For medium to small data sets, the base engine is often a better performer. Compared to the base engine, SPDE: supports more than 32K columns in SAS 9 and later. The base engine supports more than 32K columns in SAS 9.1 and later. is the only SAS engine that supports more than 231 - 1 (approximately 2 billion) rows on 32-bit hosts. supports the implicit sort for BY processing. supports optimization of the WHERE expression with multiple indexes. supports optimization of the WHERE expression containing OR. supports partitioned data sets. locks at the member level; the base engine locks at the record level. requires an index-reorganization utility to rebalance the index tree. does not support some of the base engine features: utility (byte) files. catalogs. views. MDDBs. integrity constraints. data set generations. CEDA. audit trail. options for national language support. Content originally published in 2003. If you're looking for a solution for advanced analytics and real-time data processing, see SAS with SingleStore.

AlexBeaver · ‎01-09-2024

The attached documents and samples provide the detail on how to configure high availability of your critical SAS services using SAS Grid Manager. Visit the SAS Grid Manager support page for more information about this product. High Availability Services with SAS Grid Manager This document addresses the requirements for implementing High Availability (HA) services running in a SAS grid environment using the EGO capabilities of Platform Suite for SAS which is included with SAS Grid Manager. Configuration examples are included that provide details for configuring essential SAS services to be Highly Available in the grid. server_wrap.sh This is a UNIX shell script that can be used to wrap a service init script. It will keep execution in the foreground until the service daemons exit. ego_server.sh This is a UNIX shell script for interfacing with egosh. Installing and Configuring SAS Environment Manager in a SAS Grid Environment This document describes the additional configuration steps needed when deploying SAS Environment Manager in a SAS Grid with a shared configuration directory. It also documents the deploy-ev-agents.sh script that automates this process. This script is available at the link below. deploy-ev-agents.sh This is a UNIX shell script that automates the steps necessary to deploy SAS Environment Manager in a SAS Grid Environment.

AlexBeaver · ‎12-21-2023

Overview Researchers often use sample survey methodology to obtain information about a large population by selecting and measuring a sample from that population. Researchers apply probability-based scientific designs to select the sample in order to reduce the risk of a distorted view of the population and to enable statistically valid inferences to be made from the sample. The SURVEYMEANS, SURVEYFREQ, SURVEYREG, SURVEYLOGISTIC, and SURVEYPHREG procedures in SAS/STAT software properly analyze complex survey data by taking into account the sample design. You can use these procedures for multistage or single-stage designs, with or without stratification, and with or without unequal weighting. The survey analysis procedures provide a choice of variance estimation methods, which include Taylor series linearization, balanced repeated replication (BRR), and the jackknife. When you use most other SAS/STAT procedures, statistical inference is based on the assumption that the sample is drawn from an infinite population by simple random sampling. If the sample is in fact selected from a finite population by using a complex survey design, these procedures usually do not calculate the estimates and their variances according to the design that is actually used. Using analyses that are not appropriate for your sample design might lead to incorrect statistical inferences. However, there might be times when you want to analyze data that are sampled from a finite population by using a complex survey design, but the only SAS/STAT procedure capable of fitting the type of model that you need is not designed to account for sampling based on a complex survey design. In such cases, you can sometimes use a non-survey procedure to obtain valid point estimates of the model parameters, and use the SURVEYMEANS procedure and a little programming to obtain valid standard errors for the model parameter estimates. Specifically, this example demonstrates how to combine the generalized linear modeling capabilities of the GENMOD procedure and the delete-1 jackknife (resampling) method of the SURVEYMEANS procedure to fit a Poisson model to count data that are sampled from a finite population by using a complex survey design. Performing the delete-1 jackknife estimation of the standard errors of the model parameter estimates requires fitting a model to each of the jackknife replicates. As is typical in programming, there is more than one way to perform most tasks. This example demonstrates two different ways to accomplish the same task. Step 3a: Fit a Model to Each Replicate Sample by Using BY-Group Processing uses the GENMOD procedure’s BY-group processing capabilities to fit a model to each replicate; this is the most efficient method. Step 3b: Fit a Model to Each Replicate Sample by Looping Through the Replicates demonstrates how to perform the same task by using a SAS macro to loop through the replicates. Looping is less efficient than by-group processing but requires less computer memory, which might become an issue if you have a very large sample. Analysis Obtaining Point Estimates of Model Parameters Consider a finite population whose members are indexed by U = {1, 2, ...,N} and where F N is the set of values for the population. Suppose you specify a population density function ƒ(y,θ), where the parameter θ is of interest. If the entire population is observed, then this likelihood can be used to estimate . Let be the desired estimator. is obtained by maximizing the log likelihood can be used to estimate θ. Let θ N be the desired estimator. θ N is obtained by maximizing the log likelihood. with respect to θ. Assume that probability sample A is selected from the finite population U and π i is the selection probability for unit i. An estimator of the finite population log likelihood is A sample-based estimator for the finite population quantity θ N can be obtained by maximizing the pseudo-log-likelihood l π (θ) with respect to θ. The design-based variance for is obtained by assuming the set of finite population values F n to be fixed. For more information about maximum pseudo-likelihood estimators and other inferential approaches for survey data, see Kish and Frankel (1974); Godambe and Thompson (1986); Pfeffermann (1993); Korn and Graubard (1999, chapter 3); Chambers and Skinner (2003, chapter 2); and Fuller (2009, section 6.5). The practical implication of the preceding analysis is that if a SAS/STAT procedure performs weighted maximum likelihood estimation and the weights are applied such that the weights can be factored out of the log likelihood, then that procedure can generate valid point estimates of model parameters when the data are sampled according to a complex survey design. The WEIGHT statement in the GENMOD procedure identifies a variable in the input data set to be used as the exponential family dispersion parameter weight for each observation. The exponential family dispersion parameter is divided by the WEIGHT variable value for each observation. This is true regardless of whether the parameter is estimated by the procedure or specified in the MODEL statement by using the SCALE= option. It is also true for distributions such as the Poisson and binomial that are not usually defined to have a dispersion parameter. For these distributions, a WEIGHT variable weights the overdispersion parameter, which has the default value of 1. Consider a Poisson regression model of the observed number of counts, y i , on a set of covariates, x i , for units i ∈ A. Assume that y i ~ Piosson(θ i ) and the mean θ i of the response in the ith observation is related to a linear predictor through the link function log(θ i ) = x i 'β where β is a vector of unknown parameters. The log likelihood can be written as Because the weight, w i , can be factored out of the log likelihood, you can use PROC GENMOD with a WEIGHT statement to obtain valid point estimates of the model parameters. Caution However, the log likelihood for the negative binomial model is The weight, w i , cannot be factored out of the log likelihood, so you cannot use PROC GENMOD with a WEIGHT statement to obtain point estimates of the model parameters that account for the unequal weights. Whereas the weighted maximum likelihood point estimates that PROC GENMOD generates appropriately account for the unequal weights for distributions such as the Poisson, the weighted maximum likelihood variances and standard errors that PROC GENMOD computes do not account for the complex survey design. You must compute the variances and standard errors by using a different method. One such method is the delete-1 jackknife (resampling) method. Obtaining Variance Estimates by Using the Delete-1 Jackknife Method The jackknife method of variance estimation deletes one primary sampling unit (PSU) at a time from the full sample to create replicates. This method is also known as the delete-1 jackknife method, because it deletes exactly one PSU in every replicate. The total number of replicates R is the same as the total number of PSUs. In each replicate, the sampling weights of the remaining PSUs are modified by the jackknife coefficient α r . The modified weights are called replicate weights. Let PSU i in stratum h r be omitted from the rth replicate; then the jackknife coefficient and replicate weights are computed as and You can use the VARMETHOD=JACKKNIFE(OUTJKCOEFS=) method-option with any of the survey estimation procedures to store the jackknife coefficients in a SAS data set and use the VARMETHOD=JACKKNIFE(OUTWEIGHTS=) method-option to store the replicate weights in a SAS data set. Let be the estimated parameters from the full sample, and let be the estimated parameters for the rth replicate. You can estimate the covariance matrix of by It is common to assume that the distribution of can be approximated by using a x 2 distribution with R – H degrees of freedom, where R is the number of replicates and H is the number of strata, or R – 1 degrees of freedom when there is no stratification. If one or more components of cannot be calculated for some replicates, then you use only the replicates for which the parameters can be estimated. Estimability and nonconvergence are two common reasons why might not be available for a replicate sample even if is defined for the full sample. Let R α be the number of replicates where are available, and let R –R α be the number of replicates where are not available. Without loss of generality, assume that is available only for the first R α replicates; then the jackknife variance estimator is with R α – H degrees of freedom, where H is the number of strata. Example Consider a hypothetical regional survey that seeks to describe the number of visits to health professionals that are made annually by members of a population. The survey is conducted by using a stratified clustered sampling design. The following statements create the SAS data set Counts. The variable Visits is a count variable that records the number of visits to a health professional; Sex is a binary variable that records the gender of the respondent; Race is a categorical variable that records each respondent’s race; Marital is a categorical variable that records each respondent’s marital status; Private is a categorical variable that records whether a respondent has private health insurance, and if so, what type; Education is a categorical variable that records each respondent’s highest attained level of education; Person is a respondent’s unique identifier; Strata identifies the stratum from which each observation is sampled; PSU identifies the primary sampling units; and SamplingWeight records the sampling weights. data counts; input visits sex race marital private education person strata psu SamplingWeight @@; datalines; 5 1 1 2 1 5 71511 1 1 1002.59 1 2 1 4 2 3 307568 1 1 1002.59 2 1 1 4 4 3 457473 1 1 1002.59 9 1 1 3 1 5 849963 1 1 1002.59 3 2 1 3 2 5 892466 1 1 1002.59 0 2 1 2 3 3 249075 1 2 1002.59 3 1 1 2 4 1 835408 1 2 1002.59 1 2 1 4 2 4 159262 1 3 1002.59 ... more lines ... 2 2 1 1 4 2 244599 5 40 998.26 1 2 1 3 4 4 738928 5 40 998.26 2 2 1 3 2 2 830211 5 40 998.26 3 1 1 3 2 3 920025 5 40 998.26 ; run; Step 1: Generate the Jackknife Coefficients and Replicate Weights In the first step in the process, you generate the jackknife coefficients and replicate weights by using the SURVEYMEANS procedure and save the number of replicates and the number of strata in macro variables. The following statements analyze the variable Visits and save the jackknife coefficients and replicate weights in the data sets JKcoefs and JKweights, respectively. It does not matter which variable you choose to analyze; the jackknife coefficients and replicate weights are the same regardless of the variable that you choose. If the replicate weights are available to you, then you can skip the PROC SURVEYMEANS step. However, you still need to create the macro variables &Replicates and &H, which are generated to contain the number of replicates and the number of strata, respectively. ods select none; ods output VarianceEstimation=VE Summary=Summary; proc surveymeans data=counts plots=none varmethod=jackknife(outweights=jkweights outjkcoefs=jkcoefs); cluster psu; strata strata; weight SamplingWeight; var visits; run; The first statement suppresses all ODS output. You can omit this statement if you want to see the output from each step. The ODS OUTPUT statement saves variance estimation table in the data set VE and saves the sampling design summary information in the data set Summary. VE contains the number of jackknife replicates that are created, and Summary contains the number of strata. Both the number of jackknife replicates and the number of strata are later retrieved and saved in macro variables. The VARMETHOD=JACKKNIFE option in the PROC SURVEYMEANS statement specifies the delete-one jackknife variance estimation method. The OUTWEIGHTS= suboption saves the jackknife replicate weights in the data set JKweights. The OUTCOEFS= suboption saves the jackknife coefficients in the data set JKcoefs. The CLUSTER, STRATA, and WEIGHT statements specify the sampling design. The VAR statement names the variable to be analyzed. The following statements retrieve the number of replicates from the VE data set and the number of strata from the Summary data set. These values are stored in the macro variables &Replicates and &H, respectively. data _null_; set VE(where=(Label1="Number of Replicates")); call symput('replicates',cValue1); run; data _null_; set Summary; if Label1="Number of Strata" then do; call symput('H',cValue1); end; run; Step 2: Fit the Model by Using the Full Sample and the Original Sampling Weights In the second step you fit a model by using the full sample and the original sampling weights. You then compute the number of parameters that are estimated by using the full sample and save that value in a macro variable. The following statements use the GENMOD procedure to fit a Poisson model by using the full sample and the original sampling weights: ods output ParameterEstimates=FullSample(where=(Parameter ne "Scale") keep=Parameter Estimate Level1 rename=(Estimate=Estimate0)) ParameterEstimates=parms(keep=df); proc genmod data=jkweights; class sex race marital private education; weight SamplingWeight; model visits = sex race marital private education / dist=poisson; run; The ODS OUTPUT statement saves the parameter estimates from the Poisson model to the data set FullSample; the scale parameter is excluded and the variable Estimate, which contains the parameter estimates, is renamed Estimate0. The same statement also saves the variable DF, which contains the number of regression parameters that are estimated by using the full sample, in the data set Parms. The DATA= option in the PROC GENMOD statement specifies that the data set JKweights, which contains the original data as well as the replicate weights, be used. The CLASS statement names the classification variables to be used as explanatory variables in the analysis. The WEIGHT statement specifies that the variable SamplingWeight be used as the exponential family dispersion parameter weight for each observation. The MODEL statement specifies the response variable and the explanatory variables, and the DIST= option specifies the Poisson distribution. The following statements compute the number of parameters that are estimated by using the full sample and saves that value in the macro variable &P. This step is needed because the full model might not be defined in some replicate samples and you need to exclude replicate models that do not have the same number of parameters as the full model. ods output Statistics=statistics; proc surveymeans data=parms sum; var df; run; data _null_; set statistics; call symput('p',Sum); run; Step 3a: Fit a Model to Each Replicate Sample by Using BY-Group Processing In the third step, you need to prepare the data set that contains the original data and the jackknife weights (JKweights) so that you can use the GENMOD procedure’s BY-group processing capabilities. You then use the GENMOD procedure’s BY-group processing capabilities to fit a Poisson model to each replicate. The data set JKweights is in what is known as wide form. This means that there is one observation for each respondent and there are R variables that contain the replicate weights. To use BY-group processing, the data must be in what is known as long form. In long form, you have R observations for each respondent and a single variable that contains the jackknife replicate weights. The following statements create and call the macro %STACK, which reshapes the JKweights data set from wide form to long form. It creates the variable Replicate, which indexes the R copies of the original data, and the variable Repweight, which contains the replicate weights, and it sorts the newly reshaped data set by the variable Replicate. The macro has one required argument, DATA=, which specifies the name of the data set that contains the original data as well as the replicate weights. %macro stack(data=); data &data; set &data; %do i=1 %to &replicates; Replicate=&i; Repweight=RepWt_&i; output; %end; run; proc sort data=&data; by replicate; run; %mend stack; %stack(data=jkweights) The following statements fit a Poisson model to each replicate: ods output ParameterEstimates=jkparms(where=(Parameter ne "Scale") keep=Replicate Parameter Estimate Level1) ParameterEstimates=jkdf(where=(Parameter ne "Scale") keep= Replicate Parameter Level1 df) ConvergenceStatus=converge; proc genmod data=jkweights; class sex race marital private education; weight repweight; model visits = sex race marital private education / dist=poisson; by replicate; run; The ODS OUTPUT statement saves the parameter estimates for all R models in the data set JKparms, the degrees of freedom for all the models in the data set JKDF, and the convergence status for all the models in the data set Converge. The WEIGHT statement specifies that the variable RepWeight be used as the exponential family dispersion parameter weight for each observation. The BY statement requests separate analyses of observations in groups that are indexed by the variable Replicate. Step 3b: Fit a Model to Each Replicate Sample by Looping Through the Replicates Rather than fitting a model to each replicate sample by using the GENMOD procedure’s by-goup processing capabilities, you can write and execute the macro %JKLOOP. This method is less efficient but requires less computer memory, which might become an issue if you have a very large sample. The macro %JKLOOP has one required argument, REPLICATES=, which specifies the number of jackknife replicates. The macro loops through the R replicates and fits a Poisson model by using the appropriate replicate sample and jackknife replicate weights. The parameter estimates from each model are saved in the temporary data set Temp, the degrees of freedom for each model is saved in the temporary data set Temp2, and the convergence status of the model is saved in the temporary data set Temp3. A series of DATA steps then add the variable Replicate to Temp, Temp2, and Temp3. The data sets Temp, Temp2, and Temp3 are then appended to the data sets JKparms, JKDF, and Converge, respectively. Finally, the variable Estimate in the data set FullSample is renamed Estimate0. The following statements create the macro %JKLOOP: %macro jkloop(replicates=); %local _nopt; %let _nopt = %sysfunc(getoption(notes)); options nonotes; ods select none; %do i=1 %to &replicates; ods output ParameterEstimates=temp(where=(Parameter ne "Scale") keep=Parameter Estimate Level1) ParameterEstimates=temp2(where=(Parameter ne "Scale") keep=Parameter Level1 df) ConvergenceStatus=temp3; proc genmod data=jkweights; class sex race marital private education; weight RepWt_&i; model visits = sex race marital private education / dist=poisson; run; data temp; set temp; Replicate=&i; run; data temp2; set temp2; Replicate=&i; run; data temp3; set temp3; Replicate=&i; run; proc append base=jkparms data=temp; run; proc append base=jkdf data=temp2; run; proc append base=converge data=temp3; run; %end; data FullSample; set FullSample; rename estimate=estimate0; run; ods select all; options &_nopt; %mend jkloop; %jkloop(replicates=&replicates) Step 4: Compute the Jackknife Variances and Print the Results In the fourth step, you merge the full-sample parameter estimates, the parameter estimates from the R replicates, and the jackknife coefficients into a single data set; compute the jackknife variances of the parameter estimates; and print the results. Because generalized linear models are not guaranteed to converge and because the full model might not be defined in some replicate samples, the following statements check to see how many of the replicate models both converged and have the same number of parameters as the full-sample model. This number is retrieved and saved in the macro variable &R. The number is used later to compute confidence intervals for the parameter estimates. ods output Statistics=statistics(keep=replicate sum); proc surveymeans data=jkdf sum; var df; by replicate; run; data statistics; set statistics; full=ifn(sum=&p,0,1); run; data converge; merge converge statistics; by replicate; run; data converged; set converge(where=(Status=0 & full=0)); run; data nobs; dsid=open("converged"); converged_replicates=attrn(dsid, "nobs"); call symput('R',converged_replicates); run; The following statements create the data set JK by sorting and merging the data sets JKparms, Converge, FullSample, and JKcoefs, which contain the parameter estimates from the replicate models, the convergence status of the replicate models, the parameter estimates that were obtained by using the full sample, and the jackknife coefficients, respectively: proc sort data=jkparms; by parameter level1; run; proc sort data=FullSample; by parameter level1; run; data jk; merge jkparms FullSample; by parameter level1; run; proc sort data=jk; by replicate parameter level1; run; data jk; merge jk jkcoefs converge; by replicate; run; The next statements create the data set JKconverged by subsetting the data set JK so that JKconverged contains only parameter estimates from the replicate models that converged and that have the same number of parameters as the full-sample model. The variable SqrDev is created by computing the weighted squared deviations of the parameter estimates; the jackknife coefficients are used as the weights. JKconverged is then sorted by the variables Parameter and Level1. data jkconverged; set jk(where=(Status=0 & full=0)); sqrdev=JKCoefficient*(estimate-estimate0)**2; run; data vce; set jkconverged(keep= replicate parameter level1 estimate estimate0); diff=estimate0-estimate; run; proc sort data=jkconverged; by parameter Level1; run; The following statements compute the sum of squared deviations of the parameter estimated by using PROC SURVEYMEANS. The computed sums are in fact the jackknife variances of the parameter estimates. The ODS OUTPUT statement saves the computed variances in the data set JKvariance. ods output Statistics=jkVariance; proc surveymeans data=jkconverged sum plots=none; var sqrdev; by parameter Level1; run; The following DATA step merges the data set JKvariance, which contains the jackknife variances, with the data set FullSample, which contains the full-sample parameter estimates. The variable StdErr is created by computing the square roots of the variances; the full covariance matrix of the parameter estimates is computed later. The variables UL and LL are also created to contain the 95% confidence limits of the parameter estimates. data jkVariance(drop=stddev varname); merge jkVariance fullsample(rename=(estimate0=Estimate)); by parameter Level1; StdErr=sqrt(Sum); rename Sum=Variance; DF=&R - &H; t=quantile('T', .975, &R-&H); ul=estimate+t*stderr; ll=estimate-t*stderr; label ul="Upper 95% CL"; label ll="Lower 95% CL"; run; The following statements print the parameter estimates, the standard errors, the degrees of freedom, and the 95% confidence limits: ods select all; title "Survey Poisson Regression"; title2 "with Delete-1 Jackknife Variance Estimation"; proc print data=jkVariance noobs label; var Parameter Level1 Estimate StdErr DF ll ul; run; title;title2; Output 1 displays the parameter estimates, the jackknife standard errors, the degrees of freedom, and the 95% confidence limits. The table displays how the numbers of visits made by different groups are different. For example, the average number of visits made by a female is exp(0.08) times higher than the average number of visits made by males, after adjusting for race, education, marital status, and private insurance coverage in the study population. However, because the 95% confidence interval contains 0, the difference is not statistically significant at the 0.05 level. Output 1: Parameter Estimates and Jackknife Confidence Intervals Survey Poisson Regression with Delete-1 Jackknife Variance Estimation Parameter Level1 Estimate StdErr DF Lower 95% CL Upper 95% CL Intercept 0.2854 0.10767 195 0.07304 0.49772 education 1 0.0412 0.09177 195 -0.13976 0.22221 education 2 0.1228 0.06841 195 -0.01214 0.25770 education 3 0.0355 0.06498 195 -0.09266 0.16365 education 4 -0.0217 0.06656 195 -0.15294 0.10960 education 5 0.0000 0.00000 195 0.00000 0.00000 marital 1 0.0366 0.06832 195 -0.09816 0.17133 marital 2 0.0489 0.06579 195 -0.08083 0.17868 marital 3 0.2383 0.05623 195 0.12739 0.34917 marital 4 0.0000 0.00000 195 0.00000 0.00000 private 1 1.3705 0.06588 195 1.24061 1.50047 private 2 -0.0805 0.06181 195 -0.20245 0.04137 private 3 -0.1291 0.09129 195 -0.30919 0.05090 private 4 0.0000 0.00000 195 0.00000 0.00000 race 1 0.1219 0.08017 195 -0.03617 0.28006 race 2 0.2789 0.09114 195 0.09912 0.45863 race 3 0.0000 0.00000 195 0.00000 0.00000 sex 1 0.0848 0.04523 195 -0.00444 0.17398 sex 2 0.0000 0.00000 195 0.00000 0.00000 Step 5: Compute the Full Jackknife Covariance Matrix In the fifth and final step, you use statements such as the following to generate the covariance matrix of the parameter estimates, which you need if you want to perform hypothesis tests that involve more that one parameter: proc sort data=jkdf; by replicate parameter level1; run; data temp; merge vce jkdf; by replicate parameter level1; run; proc transpose data=temp(where=(df=1)) out=temp2(drop=_name_) prefix=parm; by replicate; var diff; run; data temp3(drop=donorstratum); merge temp2 jkcoefs; by replicate; do i=1 to &p; row=i; output; end; run; data temp3(drop=parm: jkcoefficient i j); set temp3; array col[&p]; array parm[*] parm:; do i=1 to &p; if row=i then do; do j = 1 to &p; col[j]=jkcoefficient*parm[i]*parm[j]; end; end; end; run; proc sort data=temp3; by row; run; ods select none; ods output Statistics=statistics(drop=StdDev); proc surveymeans data=temp3 sum plots=none; var col1-col14; by row; run; ods select all; proc transpose data=statistics out=CovB(drop=_name_ row) prefix=parm; var sum; by row; run; proc print data=covb noobs; run; Output 2 displays the covariance matrix. Output 2: Parameter Estimates Covariance Matrix parm1 parm2 parm3 parm4 parm5 parm6 parm7 parm8 parm9 parm10 parm11 parm12 parm13 parm14 0.011592 -0.002057 -0.002977 -0.002101 -0.002247 -0.001635 -0.001023 -0.001872 -0.002992 -0.002539 -0.002066 -0.005925 -0.006169 -0.001456 -0.002057 0.008421 0.002328 0.001955 0.001429 -0.000579 -0.000520 -0.000238 0.000445 0.000262 -0.000020050 0.000185 0.000172 0.000033987 -0.002977 0.002328 0.004680 0.002394 0.002534 0.000059913 -0.000012181 0.000237 -0.000311 -0.000336 -0.000905 0.000299 0.000561 0.000466 -0.002101 0.001955 0.002394 0.004223 0.002354 0.000010428 0.000209 0.000257 -0.000263 -0.000492 -0.000900 0.000306 0.000008650 -0.000108 -0.002247 0.001429 0.002534 0.002354 0.004430 0.000297 0.000194 0.000095299 -0.000622 -0.000470 -0.000869 0.000372 0.000142 -0.000077882 -0.001635 -0.000579 0.000059913 0.000010428 0.000297 0.004668 0.001877 0.001714 0.000139 0.000295 0.001283 -0.000067801 0.000103 -0.000332 -0.001023 -0.000520 -0.000012181 0.000209 0.000194 0.001877 0.004328 0.001706 -0.000478 -0.000080900 0.000821 -0.000177 -0.000350 -0.000416 -0.001872 -0.000238 0.000237 0.000257 0.000095299 0.001714 0.001706 0.003161 0.000479 0.000203 0.000504 -0.000323 -0.000230 0.000121 -0.002992 0.000445 -0.000311 -0.000263 -0.000622 0.000139 -0.000478 0.000479 0.004340 0.002663 0.002782 0.000424 0.000725 0.000012122 -0.002539 0.000262 -0.000336 -0.000492 -0.000470 0.000295 -0.000080900 0.000203 0.002663 0.003821 0.003159 0.000039949 0.000069672 -0.000154 -0.002066 -0.000020050 -0.000905 -0.000900 -0.000869 0.001283 0.000821 0.000504 0.002782 0.003159 0.008334 -0.000706 -0.000586 0.000038905 -0.005925 0.000185 0.000299 0.000306 0.000372 -0.000067801 -0.000177 -0.000323 0.000424 0.000039949 -0.000706 0.006427 0.005783 0.000355 -0.006169 0.000172 0.000561 0.000008650 0.000142 0.000103 -0.000350 -0.000230 0.000725 0.000069672 -0.000586 0.005783 0.008307 0.000708 -0.001456 0.000033987 0.000466 -0.000108 -0.000077882 -0.000332 -0.000416 0.000121 0.000012122 -0.000154 0.000038905 0.000355 0.000708 0.002046 References Chambers, R.L., and Skinner, C. J. (2003). Analysis of Survey Data. Chichester, UK: John Wiley & Sons. Fuller, W.A. (2009). Sampling Statistics. Hoboken, NJ: John Wiley & Sons. Godambe, V.P., and Thompson, M.E. (1986). “Parameters of Superpopulation and Survey Population: Their Relationships and Estimation.” International Statistical Review 54:127–138. Kish, L., and Frankel, M.R. (1974). “Inference from Complex Samples.” Journal of the Royal Statistical Society, Series B 36:1–37. Korn, E.L., and Graubard, B.I. (1999). Analysis of Health Surveys. New York: John Wiley & Sons. Pfeffermann, D. (1993). “The Role of Sampling Weights When Modeling Survey Data.” International Statistical Review 61:317–337.

AlexBeaver · ‎12-21-2023

Overview This example uses PROC SURVEYMEANS to obtain poststratified totals, means, and ratios. The data are sampled from county-level data sets that are publicly available from the USDA Economic Research Service website, at http://www.ers.usda.gov/data-products/county-level-data-sets.aspx. The sample consists of the county-level information about population size, the number of individuals in the labor force, and the number of unemployed persons in the 48 contiguous states of the United States of America in 2011. The sampling frame is stratified by state, and a simple random sample of two counties per state is selected. The analysis consists of a comparison between the non-poststratified estimates and the poststratified estimates of the total and average labor force size, number of unemployed, population size, and two ratios: the unemployment rate and the labor force participation rate. Table 1 describes the contents of the sample data set Unemployment , and Table 2 describes the interpretation of the six levels of the National Center for Health Statistics (NCHS) urban-rural classification for each county. Table 1: Example Data Set Unemployment Variable Description FIPS Federal information processing standards (FIPS) code for counties ST_FIPS FIPS code for states State Abbreviation of state name County County name Code2006 National Center for Health Statistics (NCHS) 2006 urban-rural classification code Population Resident total population estimate as of July 1, 2011 LaborForce Number of individuals in the civilian labor force in 2011 Unemployed Number of unemployed individuals in 2011 SamplingWeight Sampling weight generated by yhe SURVEYSELECT procedure Table 2: 2006 NCHS Urban-Rural Classification Scheme Code Urbanization Level Classification Rules 1 Large metro, central Counties in micropolitan statistical area (MSA) with population of 1 million or more that have the following characteristics: 1) contain the entire population of the largest principal city of the MSA, or 2) are completely contained within the largest principal city of the MSA, or 3) contain at least 250,000 residents of any principal city in the MSA 2 Large metro, fringe Counties in MSA with 1 million or more population that do not qualify as large central 3 Medium metro Counties in MSA with 250,000–999,999 population 4 Small metro Counties in MSA with 50,000–249,999 population 5 Micropolitan Counties in micropolitan statistical area 6 Noncore Counties not in micropolitan statistical area The following SAS statements create the SAS data set Unemployment : data unemployment; input FIPS 1-5 ST_FIPS 7-8 State $ 10-11 County $ 13-34 Code2006 35 Population 37-45 LaborForce 46-52 Unemployed 53-58 SamplingWeight 59-64; datalines; 1005 1 AL Barbour County 5 27313 9761 1110 33.5 1019 1 AL Cherokee County 6 26094 11696 1020 33.5 4021 4 AZ Pinal County 2 383553 139864 14466 7.5 4027 4 AZ Yuma County 4 200374 89500 24270 7.5 5105 5 AR Perry County 3 10384 4788 414 37.5 ... more lines ... 55119 55 WI Taylor County 6 20759 10406 915 36.0 56025 56 WY Natrona County 4 76356 42907 2537 11.5 56037 56 WY Sweetwater County 5 44078 25138 1271 11.5 ; run; You begin the comparative analysis by using PROC SURVEYMEANS as in the following statements to estimate the means, totals, and ratios of interest. The MEAN and SUM keywords in the PROC SURVEYMEANS statement request estimates of the population means and totals, respectively. The VAR statement requests estimates of the variables LaborForce , Unemployed , and Population . So, for example, if you specify the keyword MEAN in the PROC SURVEYMEANS statement and the variable Unemployed in the VAR statement, you are requesting an estimate of how many unemployed persons, on average, reside in a county. The first RATIO statement requests an estimate of the population’s unemployment rate, which is the ratio of the number of unemployed to the size of the labor force. The second RATIO statement requests an estimate of the labor force participation rate, which is the ratio of the size of the labor force to the size of the population of the county. The STRATA and WEIGHT statements identify the sampling design: the STRATA statement specifies that the strata are identified by the variable ST_FIPS , and the WEIGHT statement specifies that the sampling weights are contained in the variable SamplingWeight . proc surveymeans data=unemployment mean sum; strata st_fips; weight SamplingWeight; var LaborForce Unemployed Population; ratio 'Unemployment Rate' Unemployed / LaborForce; ratio 'Labor Force Participation Rate' LaborForce / Population; run; Output 1 displays the estimated means, totals, ratios, and their standard errors. For example, on average there are 110,064 individuals in a county and 53,472 individuals in the labor force, and 4,925 individuals are unemployed. On average, the unemployment rate is 9.2%, and the labor force participation rate is 48.58%. Output 1: Stratified Design The SURVEYMEANS Procedure Data Summary Number of Strata 48 Number of Observations 96 Sum of Weights 3108 Statistics Variable Mean Std Error of Mean Sum Std Dev LaborForce 53472 6488.570784 166190527 20166478 Unemployed 4924.943050 594.657745 15306723 1848196 Population 110064 13105 342078597 40729501 Ratio Analysis: Unemployment Rate Numerator Denominator Ratio Std Err Unemployed LaborForce 0.092103 0.003090 Ratio Analysis: Labor Force Participation Rate Numerator Denominator Ratio Std Err LaborForce Population 0.485826 0.004186 In addition to the sample, the NCHS urban-rural classification code (Ingram and Franco, 2012) for each county in the sample and the total number of counties in the population that have each of the six levels of the NCHS classification are known. If the totals, means, and ratios of the variables of interest are homogeneous for counties that have the same NCHS urban-rural classification, but there is significant heterogeneity between counties whose classifications differ, then poststratifying by the NCHS urban-rural classification can potentially yield more efficient estimates. The following SAS statements create the poststratum totals data set Poststrata . This data set is to be used in the PSTOTAL= option of the SURVEYMEANS procedure’s POSTSTRATA statement. A poststratum total data set must contain all the poststratification variables that are listed in the POSTSTRATA statement, and it must have a variable named _PSTOTAL_ that contains the poststratum totals. In the Poststrata data set, the variable Code2006 contains the poststratum identification code, and the variable _PSTOTAL_ contains the total number of counties in that poststratum in 2011. data poststrata; input Code2006 _PSTOTAL_ ; datalines; 1 62 2 354 3 329 4 340 5 688 6 1336 ; run; Figure 1 compares the distributions of Code2006 in the population and the weighted sample. Based on the weighted sample, counties that have values of 3 and 4 are overrepresented in the sample, and counties that have values of 5 and 6 are underrepresented in the sample. Poststratifying on Code2006 reweights the data such that the poststratified weighted sample distribution of Code2006 equals the population distribution. Figure 1: Population Distribution versus Weighted Sample Distribution of Code2006 To perform a poststratified analysis, you simply add a POSTSTRATA statement to the SURVEYMEANS procedure, as in the following statements. Specifically, you designate Code2006 as the poststratification variable, and you specify the SAS data set Poststrata in the PSTOTAL= option. The OUT= option saves the poststratification weights to the SAS data set Pswgt . proc surveymeans data=unemployment mean sum; strata st_fips; weight SamplingWeight; var LaborForce Unemployed Population; ratio 'Unemployment Rate' Unemployed / LaborForce; ratio 'Labor Force Participation Rate' LaborForce / Population; poststrata code2006 / pstotal=poststrata out=pswgt; run; Figure 2 shows the ratios of the poststratification weights to the original sampling weights for each category of Code2006 . Poststratification reduces the weights for counties that have Code2006 values of 3 and 4 and increases the weights for counties that have Code2006 values of 5 and 6. Figure 2: Ratio of Poststratification Weights to Sampling Weights Figure 3 shows that, as expected, the poststratified weighted sample has the same distribution as the population. Figure 3: Population Distribution versus Poststratified Weighted Sample Distribution of Code2006 Output 2 displays the poststratified estimates and their standard errors. All the poststratified estimates of the population means and totals are smaller than the non-poststratified estimates, but the two poststratified ratio estimates are larger. For example, the poststratified estimates indicate that on average there are 100,215 individuals in a county and 48,755 individuals in the labor force, and 4,518 individuals are unemployed. On average, the unemployment rate is 9.3%, and the labor force participation rate is 48.65%. Without exception, the variances of the estimates are smaller for the poststratified analysis, indicating that the poststratified estimates are more efficient for this sample. Output 2: Poststratified Analysis The SURVEYMEANS Procedure Data Summary Number of Strata 48 Number of Poststrata 6 Number of Observations 96 Sum of Weights 3108 Statistics Variable Mean Std Error of Mean Sum Std Dev LaborForce 48755 4808.671480 151579056 14950160 Unemployed 4517.976061 477.440072 14046388 1484361 Population 100215 9964.992605 311568502 30981162 Ratio Analysis: Unemployment Rate Numerator Denominator Ratio Std Err Unemployed LaborForce 0.092667 0.002727 Ratio Analysis: Labor Force Participation Rate Numerator Denominator Ratio Std Err LaborForce Population 0.486503 0.003853 Example: Age-Adjusted Mortality Rates Suppose you want to compare the mortality rates of Florida and California. If you have samples from the two populations, computing the crude mortality rate for each population is straightforward. However, because many health outcomes vary by age and the two populations have different age distributions, a direct comparison of the crude mortality rates might be inappropriate. To make a relative comparison, you can use age-adjusted mortality rates. A common method of computing age-adjusted rates is called direct standardization; it is mathematically equivalent to poststratification. The following SAS statements create the data sets Florida and California , which contain samples from a one-stage clustered sampling design that has a sampling rate of 0.5; the clusters consist of counties from the respective states, and the observations are age-specific groups. Each observation records the variable FIPS , which identifies the clusters (counties); the categorical variable Age , which identifies the age group; the variable Population , which records the total number of individuals in an age-specific group in 1968; the variable Deaths , which records the total number of recorded deaths in an age-specific group in 1968; and the variable SamplingWeights , which is the inverse of the probability of selecting a county in the sample. The data are sampled from the Compressed Mortality File (CMF), which is publicly available from the Centers for Disease Control and Prevention website, at http://www.cdc.gov/nchs/data_access/cmf.htm#data_availability. data Florida; input FIPS Age Population Deaths; SamplingWeight=1.9705882353; datalines; 12011 4 7730 177 12011 5 32956 44 12011 6 49587 22 12011 7 49407 23 12011 8 40175 46 12011 9 29425 52 ... more lines ... 12133 11 1048 5 12133 12 1149 13 12133 13 1252 20 12133 14 896 33 12133 15 425 33 12133 16 92 27 ; data California; input FIPS Age Population Deaths; SamplingWeight=2; datalines; 6001 4 17412 348 6001 5 72709 58 6001 6 101367 41 6001 7 95572 33 6001 8 89730 87 6001 9 107173 124 ... more lines ... 6115 11 5421 11 6115 12 3720 34 6115 13 2766 58 6115 14 1752 77 6115 15 796 74 6115 16 180 39 ; Table 3 describes the different levels of the categorical variable Age . Table 3: Age Categories Age Category Description 4 Less than 1 year 5 1–4 years 6 5–9 years 7 10–14 years 8 15–19 years 9 20–24 years 10 25–34 years 11 35–44 years 12 45–54 years 13 55–64 years 14 65–74 years 15 75–84 years 16 85+ years The following SAS statements use the SURVEYMEANS procedure to estimate the crude mortality rates for Florida and California. The RATE= option in the PROC SURVEYMEANS statement identifies the sampling rate. The SURVEYMEANS procedure uses the sampling rate to compute a finite population correction for the Taylor series variance estimates. The RATIO and SUM keywords in the PROC SURVEYMEANS statement request estimates of the population ratios and totals, respectively. The VAR statement requests estimates of the variables Deaths and Population . The CLUSTER statement specifies that the variable FIPS identify the primary sampling units. The WEIGHT statement specifies that the variable SamplingWeight contain the sampling weights. The RATIO statement identifies the ratio of interest to be the number of deaths divided by the population size. proc surveymeans data=Florida ratio sum rate=.5; cluster fips; weight SamplingWeight; var deaths population; ratio 'Florida Crude Mortality Rate' deaths/population; run; proc surveymeans data=California ratio sum rate=.5; cluster fips; weight SamplingWeight; var deaths population; ratio 'California Crude Mortality Rate' deaths/population; run; Output 3 and Output 4 show the estimation results. Output 3: Crude Mortality Rate for Florida The SURVEYMEANS Procedure Data Summary Number of Clusters 34 Number of Observations 442 Sum of Weights 871 Ratio Analysis: Florida Crude Mortality Rate Numerator Denominator Ratio Std Err Deaths Population 0.010774 0.000464 Output 4: Crude Mortality Rate for California The SURVEYMEANS Procedure Data Summary Number of Clusters 29 Number of Observations 377 Sum of Weights 754 Ratio Analysis: California Crude Mortality Rate Numerator Denominator Ratio Std Err Deaths Population 0.007702 0.000595 The estimated crude mortality rates for Florida and California are 1.08% and 0.77%, respectively. The ratio of the crude mortality rates is 1.40. However, before you conclude that the mortality rate is higher in Florida than in California, consider the following two exhibits. Figure 4 shows that the age-specific mortality rates are decidedly a function of age in both states. Figure 4: Age-Specific Crude Rates versus Age in Florida and California Figure 5 shows that the populations in Florida and California exhibit different age distributions. The percentage of residents in the age groups 13, 14, and 15 is higher in Florida than in California, whereas the percentage of residents in the age groups 5, 6, 7, 8, 9, 10, and 11 is lower in Florida than in California. Together these facts indicate that the crude mortality rates are not an appropriate measure for comparing differences between these two populations (Curtin and Klein, 1995). Figure 5: Estimated Age Distributions in Florida and California Note: The SAS statements that generate Figure 4 and Figure 5 are not shown here but are included in the downloadable SAS program that is available with this web example. Because the crude rate is not appropriate, and because age-specific mortality rates provide too much detail and require a large number of comparisons, you can use a summary measure that controls for a population’s age distribution. A commonly used measure is the age-adjusted mortality rate, which you can compute by performing direct standardization (Curtin and Klein, 1995). As mentioned earlier, direct standardization is mathematically equivalent to poststratification. The difference between poststratification for the purpose of performing direct standardization and other forms of poststratification is this: when you perform direct standardization, the poststratum totals or proportions represent a standard or reference population rather than the population from which your sample was drawn. To compute comparable age-adjusted rates for Florida and California by using poststratification, you need a data set that contains the age distribution proportions from a standard or reference population. The following SAS statements create the data set USbyAge , which contains the age-specific proportions for the US population in 1968: data USbyAge; input Age _PSPCT_; datalines; 4 0.01755 5 0.07291 6 0.10231 7 0.10202 8 0.09116 9 0.07545 10 0.11879 11 0.11822 12 0.11391 13 0.09065 14 0.06103 15 0.02980 16 0.00621 ; You can then use PROC SUVEYMEANS to compute age-adjusted mortality rates for Florida and California. The procedure specification in the following SAS statements is the same as when you compute the crude rates, except that you add a POSTSTRATA statement, which specifies poststratification on the variable Age , and the PSPCT= option, which specifies that the population proportions be contained in the data set USbyAge . proc surveymeans data=Florida ratio rate=.5; cluster fips; weight SamplingWeight; var deaths population; poststrata age / pspct=USbyAge; ratio 'Florida Standardized Mortality Rate' deaths/population; run; proc surveymeans data=California ratio rate=.5; cluster fips; weight SamplingWeight; var deaths population; poststrata age / pspct=USbyAge; ratio 'California Standardized Mortality Rate' deaths/population; run; Output 5 and Output 6 show the estimation results. The age-adjusted mortality rates for Florida and California are 0.70% and 0.48%, respectively. The ratio of the age-adjusted mortality rates is 1.45. Therefore, on an age-adjusted basis, the mortality rate in Florida in 1968 is almost 1.5 times the mortality rate in California in the same year. Output 5: Standardized Mortality Rate for Florida The SURVEYMEANS Procedure Data Summary Number of Clusters 34 Number of Poststrata 13 Number of Observations 442 Sum of Weights 871 Ratio Analysis: Florida Standardized Mortality Rate Numerator Denominator Ratio Std Err Deaths Population 0.006952 0.000248 Output 6: Standardized Mortality Rate for California The SURVEYMEANS Procedure Data Summary Number of Clusters 29 Number of Poststrata 13 Number of Observations 377 Sum of Weights 754 Ratio Analysis: California Standardized Mortality Rate Numerator Denominator Ratio Std Err Deaths Population 0.004791 0.000385 References Curtin, L. R. and Klein, R. J. (1995), “Direct Standardization (Age-Adjusted Death Rates),” Healthy People 2000: Statistical Notes, DHHS Publication No. (PHS) 95-1237. Ingram, D. D. and Franco, S. J. (2012), “NCHS Urban-Rural Classification Scheme for Counties,” Vital and Health Statistics, Series 2: Data Evaluation and Methods Research no. 154, DHHS publication no. (PHS) 2012-1354. Lehtonen, R. and Pahkinen, E. (2004), Practical Methods for Design and Analysis of Complex Surveys, 2nd Edition, Chichester, UK: John Wiley & Sons.Lohr, S. L. (2010), Sampling: Design and Analysis, 2nd Edition, Boston: Brooks/Cole. Särndal, C. E., Swensson, B., and Wretman, J. (1992), Model Assisted Survey Sampling, New York: Springer-Verlag.

AlexBeaver · ‎12-21-2023

Overview Fractional hot-deck imputation (FHDI) (Kalton and Kish 1984; Fay 1996; Kim and Fuller 2004; Fuller and Kim 2005), also known as fractional imputation (FI), is a variation of hot-deck imputation in which one missing item for a recipient is imputed from multiple donors. Each donor donates a fraction of the original weight of the recipient such that the sum of the fractional weights from all the donors is equal to the original weight of the recipient. PROC SURVEYIMPUTE in SAS/STAT implements a FHDI method along with the fully efficient fractional imputation method, and some hot-deck imputation methods. For more information about the fractional hot-deck imputation method available in PROC SURVEYIMPUTE, see "The SURVEYIMPUTE Procedure". This example imputes missing values in both categorical and continuous variables by applying the FHDI method to a data set from the third National Health and Nutrition Examination Survey (NHANES III). The data set contains a set of BRR replicate weights. The REPWEIGHTS statement in PROC SURVEYIMPUTE accepts the BRR weights and creates imputation-adjusted replicate weights. The imputed data set and the imputation-adjusted replicate weights are then used in PROC SURVEYMEANS and PROC SURVEYREG to perform domain analysis and regression analysis, respectively. The objective of NHANES is to study the health and nutritional status of the US population. NHANES uses a multistage stratified area sample with typically two PSUs per stratum. Strata are created on the basis of geographic location, metropolitan statistical area (MSA), and other demographic information. MSAs or a group of counties are selected as PSUs from each stratum. Sampling weights are unequal because of different selection probabilities among different subgroups and for reasons such as nonresponse and undercoverage. For more information about NHANES, see http://www.cdc.gov/nchs/nhanes/about_nhanes.htm. NHANES III data contain missing values in many items. Multiple imputation was used to impute some of the missing items. Five multiply imputed data sets are available for public use. Because FHDI is used in this example to impute the missing values, you need the observed data, the missing (or imputation) flag for every item, and only one imputed data set. The data sets Core and IMP1 have been downloaded from the Centers for Disease Control and Prevention’s website (https://www.cdc.gov/). The Core data set contains the demographic variables, full sample weights, replicate weights, and imputation flags. The replicate weights are created by using Fay’s BRR method, with a Fay coefficient of 0.3. The IMP1 data set contains the first version of the five multiply imputed data sets. The data set HealthMiss is obtained by merging Core and IMP1 data sets by the observation sequence number SEQN . The HealthMiss data set contains observation units that are between 17 and 60 years of age. Missing values are added according to the imputation flag in the Core data set. The following items are available in the HealthMiss data set for each observation unit: SEQN : observation sequence number WTPFQX6 : observation weight, ranging from 220.18 to 140916.28 WTPQRP1 to WTPQRP52 : 52 replicate weights from the BRR method HSSEX : gender; 1 for male and 2 for female HFF1MI : anyone smokes cigarettes in the home; 1 for yes and 2 for no HAT28MI : activity level compared to others; –9 for not applicable, 1 for more active, 2 for less active, and 3 for about the same BMPHTMI : standing height (cm), ranging from 130.6 to 206.5 BMPWTMI : body weight (kg), ranging from 26.75 to 241.80 PEP6G3MI : K5 diastolic blood pressure (mmHg), ranging from 0 to 136 HSAGEIR : age in years, ranging from 17 to 60 HSHSIZER : household size; categories from 1 to 10 Married : marital status; 1 for married and 0 for not married There are no missing values in the variables SEQN , HSAGEIR , HSHSIZER , Married , and WTPFQX6 and in the replicate weight variables. The variables HFF1MI , HAT28MI , BMPHTMI , BMPWTMI , and PEP6G3MI contain missing values, which are imputed in this example. Although HFF1MI and HAT28MI have two and four observed levels, respectively, BMPHTMI , BMPWTMI , and PEP6G3MI have many observed levels. Because these last three variables have many observed levels, FHDI is applied to impute missing values jointly in all five variables. Example: Imputation of Missing Values by Using FHDI Before you apply the FHDI method to a data set, you should (1) incorporate auxiliary information in the imputation by creating imputation cells and (2) create bins for variables that have many levels. Imputation cells divide the data into groups of similar units such that the recipient units have characteristics similar to those of the donor units in the same group. Characteristics of imputation cells might come from the same survey or from other sources, such as census data or previous surveys. The cell identification is known for every unit in the sample. For a helpful review, see Brick and Kalton (1996). For the purpose of this example, a cluster variable is created by using two demographic variables, HSAGEIR and HSHSIZER , and by using the FASTCLUS procedure in SAS/STAT. Both variables are available in the Core data set, and they do not contain missing values. These clusters are identified by the variable Cluster in the HealthMiss data set. Levels of the variables Cluster and Married are used to create imputation cells. If you request FHDI, then the variables that have many levels (these are identified by the variables that you specify in the VAR statement but not in the CLASS statement) are first levelized to create bins. You can use the CLEVVAR= option to specify the variable that contains the bins for a numeric variable. Alternatively, you use the CLEVELS=k option to divide the observed range of the numeric variable into k equally spaced bins. First-stage FEFI is applied to the CLEVVAR= variables and to the variables that you specify in the CLASS statement. In this example, the bins are created for a numeric variable by using the observed 33rd and 66th percentiles for that variable. The CLEVVAR= variable should contain a missing value for every observation unit in which the corresponding numeric variable has a missing value. The following statements create the CLEVVAR= variables: *---Create bins for continuous variables---; data HealthMiss; set HealthMiss; if bmphtmi = . then bmphtlev=.; else if bmphtmi <= 162.6 then bmphtlev=1; else if bmphtmi <= 171.5 then bmphtlev=2; else bmphtlev=3; if bmpwtmi = . then bmpwtlev=.; else if bmpwtmi <= 65.7 then bmpwtlev=1; else if bmpwtmi <= 80.2 then bmpwtlev=2; else bmpwtlev=3; if pep6g3mi = . then pep6g3lev=.; else if pep6g3mi <= 65.3 then pep6g3lev=1; else if pep6g3mi <= 75.8 then pep6g3lev=2; else pep6g3lev=4; label bmphtlev = "Bin values for BMPHTMI" bmpwtlev = "Bin values for BMPWTMI" pep6g3lev = "Bin values for PEP6G3MI"; run; The following statements use the FHDI method to impute the missing values: proc surveyimpute data=HealthMiss method=fhdi varmethod=brr ndonors=5 seed=9388401; id seqn; class hff1mi hat28mi; weight wtpfqx6; repweights wtpqrp:; cells cluster married; var hff1mi hat28mi bmphtmi (clevvar=bmphtlev) bmpwtmi (clevvar=bmpwtlev) pep6g3mi(clevvar=pep6g3lev); output out=HealthFHDI; run; The PROC SURVEYIMPUTE statement invokes the procedure, the DATA= option specifies the input data set HealthMiss , the METHOD= option requests the FHDI method, the VARMETHOD= option requests the imputation-adjusted BRR replication weights, the NDONORS= options specifies the maximum number of second-stage donors, and the SEED= option specifies the random number generator seed. The variable SEQN in the ID statement identifies the observation units. The WEIGHT statement identifies the weight variable, and the REPWEIGHTS statement identifies the variables that contain the unadjusted BRR replicate weights. The CELLS statement identifies the imputation cell variables Cluster and Married , and the OUT= option in the OUTPUT statement names the output data set HealthFHDI . The VAR statement specifies the variables in which the missing values are to be imputed. The CLASS statement identifies the categorical variables. Only first-stage FEFI will be applied to the CLASS variables. Second-stage FEFI and FHDI will be performed for the variables that are specified in the VAR statement but not in the CLASS statement. The CLEVVAR= option for these variables identifies the corresponding bin variables in which first-stage FEFI will be performed. In this example, you requested first-stage FEFI for the variables hff1mi , hat28mi , bmphtlev , bmpwtlev , and pep6g3lev ; and second-stage FEFI and FHDI for the variables bmphtmi , bmpwtmi , and pep6g3mi . You request that all five variables be imputed jointly and that the imputed data be saved in the HealthFHDI data set. The number of observations and the CLASS level information are shown in Figure 1. The Sum of Weights Read row shows that the 13,721 observation units in the sample represent over 149 million observation units in the population. The "Class Level Information" table displays the observed levels for the CLASS variables. Figure 1: Imputation Information The SURVEYIMPUTE Procedure Number of Observations Read 13721 Number of Observations Used 13721 Sum of Weights Read 149546400 Sum of Weights Used 149546400 Class Level Information Class Levels Values HFF1MI 2 1 2 HAT28MI 3 1 2 3 The "Missing Data Patterns" table shows an arbitrary missing pattern. There are 13 different missing pattern groups. An "X" denotes that the variable is observed in that group, and a "." denotes that the variable is missing. Almost 87.42% of the observation units have no missing values (Group 1), 7.17% of the observation units have missing values for the variables BMPHTMI , BMPWTMI , and PEP6G3MI (Group 6), and 0.03% of the observation units have missing values in all five variables (Group 13). Figure 2: Missing Data Patterns Missing Data Patterns Group HFF1MI HAT28MI BMPHTMI BMPWTMI PEP6G3MI Freq Sum of Weights Unweighted Percent Weighted Percent Group Means BMPHTMI BMPWTMI PEP6G3MI HFF1MI 1 HFF1MI 2 HAT28MI 1 HAT28MI 2 HAT28MI 3 1 X X X X X 11995 129773812 87.42 86.78 169.387674 75.258624 71.746316 0.403270 0.596730 0.310384 0.231793 0.457823 2 X X X X . 475 3493749.6 3.46 2.34 168.189385 79.263271 . 0.464648 0.535352 0.301674 0.252490 0.445836 3 X X X . X 10 16064.94 0.07 0.01 154.554452 . 53.348310 0.370184 0.629816 0 0.164085 0.835915 4 X X X . . 1 12796.75 0.01 0.01 156.500000 . . 0 1.000000 0 0 1.000000 5 X X . . X 6 49022.69 0.04 0.03 . . 87.096205 0.870574 0.129426 0 0.468602 0.531398 6 X X . . . 984 13080725.9 7.17 8.75 . . . 0.402061 0.597939 0.336804 0.214523 0.448674 7 X . X X X 167 2193493.99 1.22 1.47 168.361303 76.971926 72.511286 0.322006 0.677994 . . . 8 X . X X . 11 138241.25 0.08 0.09 168.749387 79.516017 . 0.470061 0.529939 . . . 9 X . . X X 1 2742.56 0.01 0.00 . 70.450000 84.000000 0 1.000000 . . . 10 X . . . . 20 307483.94 0.15 0.21 . . . 0.483659 0.516341 . . . 11 . X X X X 15 42007.38 0.11 0.03 166.962303 84.023762 75.389658 . . 0.158780 0.312686 0.528534 12 . X . . . 32 390724.57 0.23 0.26 . . . . . 0.255939 0.237827 0.506235 13 . . . . . 4 45534.74 0.03 0.03 . . . . . . . . The "Imputation Summary" table in Figure 3 displays the number of observation units without any missing items (11,995), the number of observation units that contain at least one missing item (1,726), and the number of units in which the missing values are imputed. Missing values in all 1,726 units are imputed. Figure 3: Imputation Summary Imputation Summary Observation Status Number of Observations Sum of Weights Nonmissing 11995 129773812 Missing 1726 19772588.3 Missing, Imputed 1726 19772588.3 Missing, Not Imputed 0 0 Missing, Partially Imputed 0 0 Because fractional imputation replaces one observed unit with several observation rows that contain imputed values, the 13,721 observed units in the input data set HealthMiss generates 127,002 observation rows in the imputed data set HealthFHDI . The following note displays the number of observation rows (127,007) in the imputed data set HealthFHDI : NOTE: The data set WORK.HEALTHFHDI has 127007 observations and 134 variables. Example: Analysis for Fractionally Imputed Data You can use the imputed data set, the imputation-adjusted replicate weights, and the appropriate Fay coefficient to compute any estimators from your imputed data. However, you must use the REPWEIGHTS statement in SAS/STAT survey analysis procedures to specify the imputation-adjusted replicate weights. The following two examples describe a domain analysis and a regression analysis that use the imputed data. The following PROC SURVEYMEANS statements estimate the mean diastolic blood pressure in the year 2000 population, and in the subpopulation of smokers and nonsmokers. ods graphics on; proc surveymeans data=HealthFHDI varmethod=brr(Fay=0.3) plots=domain; weight ImpWt; repweights ImpRepWt_:; var pep6g3mi; domain hff1mi; run; The "Data Summary" table in Figure 4 displays the number of observation rows (127,007) and the sum of weights (149,546,400). Because fractional imputation is used, the number of observation rows is not equal to the number of observation units (13,721). However, the sum of weights from the observation rows, which is an estimate of the population size, is the same as the sum of weights from the observation units. The "Variance Estimation" table in Figure 4 shows that Fay’s BRR with 52 replicate weights and a Fay coefficient of 0.3 are used for variance estimation. Figure 4: Data Summary The SURVEYMEANS Procedure Data Summary Number of Observations 127007 Sum of Weights 149546400 Variance Estimation Method BRR Replicate Weights HEALTHFHDI Number of Replicates 52 Fay Coefficient 0.3 The "Mean Diastolic Blood Pressure" table in Figure 5 displays the mean diastolic blood pressure for the overall population as 71.78 with a standard error of 1.34. For smokers the mean diastolic blood pressure is 70.67 with a standard error of 1.63, and for nonsmokers it is 72.55 with a standard error of 1.45. The "N" column displays the number of observation rows, not the number of observation units. Figure 5: Mean Diastolic Blood Pressure Statistics Variable Label N Mean Std Error of Mean 95% CL for Mean PEP6G3MI K5, diastolic, for 1st BP (mmHg) 127007 71.788404 0.268699 71.2492203 72.3275871 The SURVEYMEANS Procedure Statistics for HFF1MI Domains HFF1MI Variable Label N Mean Std Error of Mean 95% CL for Mean 1 PEP6G3MI K5, diastolic, for 1st BP (mmHg) 49876 70.665325 0.326577 70.0100015 71.3206493 2 PEP6G3MI K5, diastolic, for 1st BP (mmHg) 77131 72.548913 0.289446 71.9680973 73.1297289 A box plot of the weighted distribution of diastolic blood pressure is displayed in Figure 6. The first box is for the overall population, and the other two boxes are for the two domains defined by smoking habits. Figure 6: Diastolic Blood Pressure The following PROC SURVEYREG statements estimate the regression coefficients for regressing diastolic blood pressure on smoking status, gender, height, weight, and age. Imputation-adjusted weights and imputation-adjusted replicate weights are used for point estimation and variance estimation, respectively. The SOLUTION option in the MODEL statement displays the parameter estimates. Estimated values for the regression parameters along with their standard errors are shown in Figure 7. The OUT= option in the OUTPUT statement saves the residuals and the fitted values in the SAS data set Resid . proc surveyreg data=HealthFHDI varmethod=brr(Fay=0.3); weight ImpWt; repweights ImpRepWt_:; class hff1mi hssex; model pep6g3mi = hff1mi hssex bmphtmi bmpwtmi hsageir / solution; output out=Resid residual=Residuals predicted=Fitted; run; Estimated regression parameters and their standard errors are displayed in Figure 7. All covariates except height ( BMPHTMI ) have small standard errors compared to their estimated values. Thus, they are all important in describing the regression relationship in the NHANES III population between 17 and 60 years of age. The degrees of freedom for the t tests is 52, which is equal to the number of BRR replicates. Figure 7: Parameter Estimates The SURVEYREG Procedure Regression Analysis for Dependent Variable PEP6G3MI Estimated Regression Coefficients Parameter Estimate Standard Error t Value Pr > |t| Intercept 45.6902005 4.00157249 11.42 <.0001 HFF1MI 1 -1.3855994 0.33509056 -4.13 0.0001 HFF1MI 2 0.0000000 0.00000000 . . HSSEX 1 3.2312257 0.28893940 11.18 <.0001 HSSEX 2 0.0000000 0.00000000 . . BMPHTMI 0.0038192 0.02220508 0.17 0.8641 BMPWTMI 0.1792811 0.00940559 19.06 <.0001 HSAGEIR 0.3021222 0.02373061 12.73 <.0001 Note: The degrees of freedom for the t tests is 52. Matrix X'WX is singular and a generalized inverse was used to solve the normal equations. Estimates are not unique. The following PROC SURVEYREG statements request a residual plot by using the residuals and fitted values from the Resid data set. Imputation-adjusted weights are displayed by using a heat map, as shown in Figure 8. The SHAPE=HEXAGONAL option requests hexagonal bins, and the NBINS=60 option specifies that 60 bins be used. For more information about how to create customized graphs by using ODS graphs, see Kuhfeld (2016). The weighted residual plot does not reveal any major violations from the model assumptions—namely, linearity and equal variance. Because fractional imputation increases the number of observation rows, you must use imputation-adjusted weights in all analyses that involve fractionally imputed data sets. ods graphics on; proc surveyreg data=resid plots(nbins=60)=fit(shape=hexagonal); model Residuals=Fitted; weight impwt; run; Figure 8: Plot of Residuals versus Fitted Values References Brick, J. M., and Kalton, G. (1996). “Handling Missing Data in Survey Research.” Statistical Methods in Medical Research 5:215–238. Fay, R. E. (1996). “Alternative Paradigms for the Analysis of Imputed Survey Data.” Journal of the American Statistical Association 91:490–498. Fuller, W. A., and Kim, J. K. (2005). “Hot Deck Imputation for the Response Model.” Survey Methodology 31:139–149. Kalton, G., and Kish, L. (1984). “Some Efficient Random Imputation Methods.” Communications in Statistics—Theory and Methods 13:1919–1939. Kim, J. K., and Fuller, W. A. (2004). “Fractional Hot Deck Imputation.” Biometrika 91:559–578. Kuhfeld, W. F. (2016). “Highly Customized Graphs Using ODS Graphics.” In Proceedings of the SAS Global Forum 2016 Conference. Cary, NC: SAS Institute Inc. http://support.sas.com/resources/papers/proceedings16/SAS1800-2016.pdf.

AlexBeaver · ‎12-20-2023

Overview The finite population standard deviation of a variable provides a measure of the amount of variation in the corresponding attribute of the study population’s members, thus helping to describe the distribution of a study variable. Whether your survey is measuring crop yields, adult alcohol consumption, or the body mass index (BMI) of school children, a small population standard deviation is indicative of uniformity in the population, while a large standard deviation is indicative of a more diverse population. Suppose you have data that were sampled according to some complex survey design. The SURVEYMEANS procedure enables you to estimate sample totals, means, and ratios, as well as the design-based variances of the estimated quantities, but it does not directly compute the standard deviation of a variable. However, because a standard deviation can be expressed mathematically as a function of a total, you can easily estimate the finite population standard deviation S of a variable by using PROC SURVEYMEANS plus a little SAS programming. Whenever you estimate a population parameter such as a mean or a standard deviation, you should also report the precision of the estimate. The most commonly reported measure of precision is the variance (or its square root, the standard error). The survey analysis procedures in SAS/STAT software currently provide three different variance estimation methods for complex survey designs: the Taylor series linearization method, the delete-one jackknife method, and the balanced repeated replication (BRR) method. This example demonstrates how to use all three methods to estimate the variance . Analysis Suppose you want to estimate the standard deviation of a variable y from a finite population by using data that were collected using some complex survey design. The finite population standard deviation of y is (1) where N is the total number of elements in the population, y i is the ith observation of the variable y, and is the population mean of y. A sample-based statistic of S is (2) where is an estimator of the population total , is an estimator of the population mean, n is the number of elements in the sample, and π k is the probability that element k is observed in the sample. To estimate , you first estimate both and with PROC SURVEYMEANS. Next, you generate a variable (call it z) such that each observation z k is equal to (3) Now you use PROC SURVEYMEANS to estimate the total of z. The square root of the estimated weighted total of z is equal to . Estimating , the variance of , requires some additional SAS programming. Using the Taylor Series Linearization Method to Estimate To estimate by using the Taylor series linearization method, construct a variable u, such that (4) where is computed as in equation (2). Use PROC SURVEYMEANS to estimate the total (and the variance of the total) of u. The total that is computed by PROC SURVEYMEANS is of no interest, but the variance of the total is equal to , the variance of the estimate (Särndal, Swensson, and Wretman 1992 , chap. 5.5). The following steps summarize how you estimate S, the finite population standard deviation of a variable y, and , the variance of the finite population standard deviation estimator (using the Taylor series linearization method): Use PROC SURVEYMEANS to estimate the sample mean of the variable y, and save the estimated mean. PROC SURVEYMEANS also computes the sum of the sampling weights, which is the value of in the analysis. Save that value also; it is used in the construction of z. Using the sample mean from step 1, construct the variable u as in equation (3). Use PROC SURVEYMEANS to estimate the weighted total of the variable z. Save the estimated total, which is the estimate of the population variance ( ). Take the square root of the weighted total. Save the result, which is the estimate of the finite population standard deviation. Construct the variable u as in equation (4). Use PROC SURVEYMEANS to estimate the weighted total (and the variance of the total) of the variable u. The estimated variance of this total obtained from PROC SURVEYMEANS is an estimator of the variance of . Example Ice Cream Study Data Set This example uses the IceCreamStudy data set from the example "Stratified Cluster Sample Design" in the chapter "The SURVEYMEANS Procedure" of the SAS/STAT User's Guide. The study population is a junior high school with a total of 4,000 students in grades 7, 8, and 9. In the original example, researchers want to know how much these students spend weekly for ice cream, on the average, and what percentage of students spend at least $10 weekly for ice cream. This example measures the variability of the students’ expenditures by estimating S 2 , the variance of the variable that contains the students’ expenditures. Suppose that every student belongs to a study group and that study groups are formed within each grade level. Each study group contains between two and four students. Table 1 shows the total number of study groups and the total number of students for each grade. Table 1 Study Groups and Students by Grade Grade Number of Study Groups Number of Students 7 608 1,824 8 252 1,025 9 403 1,151 It is quicker and more convenient to collect data from students in the same study group than to collect data from students individually. Therefore, this study uses a stratified clustered sample design. The primary sampling units are study groups. The list of all study groups in the school is stratified by grade level. From each grade level, a sample of study groups is randomly selected, and all students in each selected study group are interviewed. The sample consists of eight study groups from the 7th grade, three groups from the 8th grade, and five groups from the 9th grade. The SAS data set IceCreamStudy saves the responses of the selected students: data IceCreamStudy; input Grade StudyGroup Spending Weight @@; datalines; 7 34 7 76.0 7 34 7 76.0 7 412 4 76.0 9 27 14 80.6 7 34 2 76.0 9 230 15 80.6 9 27 15 80.6 7 501 2 76.0 9 230 8 80.6 9 230 7 80.6 7 501 3 76.0 8 59 20 84.0 7 403 4 76.0 7 403 11 76.0 8 59 13 84.0 8 59 17 84.0 8 143 12 84.0 8 143 16 84.0 8 59 18 84.0 9 235 9 80.6 8 143 10 84.0 9 312 8 80.6 9 235 6 80.6 9 235 11 80.6 9 312 10 80.6 7 321 6 76.0 8 156 19 84.0 8 156 14 84.0 7 321 3 76.0 7 321 12 76.0 7 489 2 76.0 7 489 9 76.0 7 78 1 76.0 7 78 10 76.0 7 489 2 76.0 7 156 1 76.0 7 78 6 76.0 7 412 6 76.0 7 156 2 76.0 9 301 8 80.6 ; Table 2 identifies the variables contained in the data set IceCreamStudy. Table 2 Variables in IceCreamStudy Data Set Variable Description Grade Student’s grade (strata) StudyGroup Student’s study group (PSU) Spending Student’s expenditure per week for ice cream, in dollars Weight Sampling weights The SAS data set StudyGroup is created to provide PROC SURVEYMEANS with the sample design information shown in Table 1. The variable Grade identifies the strata, and the variable _TOTAL_ contains the total number of study groups in each stratum. data StudyGroups; input Grade _total_; datalines; 7 608 8 252 9 403 ; Step 1: Compute and Use PROC SURVEYMEANS to obtain an estimate of the sample mean. Specify the MEAN and STACKING options in the PROC SURVEYMEANS statement. The STACKING option causes the procedure to create an output data set with a single observation. This table structure makes it easy in later steps to identify the saved estimates and to assign their values to macro variables. The WEIGHT statement specifies that the variable Weight contain the sampling weights. The STRATA statement specifies that the variable Grade identifies strata membership. The CLUSTER statement specifies that the variable StudyGroup identifies cluster (or PSU) membership. The ODS OUTPUT statement requests output data sets for the statistics and data summary tables, to be named Statistics and Summary, respectively. The sample mean is stored in the data set Statistics. The data set Summary contains the sum of the sampling weights, the number of strata, and the number of clusters. The sum of the sampling weights is needed to compute ; the number of strata and the number of clusters are used later to compute confidence limits for . proc surveymeans data=IceCreamStudy mean stacking ; weight Weight; strata Grade; cluster StudyGroup; var Spending; ods output Statistics = Statistics Summary = Summary; run; The following DATA step saves the sample mean of the variable Spending in a macro variable named Spending_Mean: data _null_; set Statistics; call symput("Spending_Mean",Spending_Mean); run; The next DATA step saves the sum of the sampling weights in a macro variable named N, the number of strata in a macro variable named H, and the number of clusters in a macro variable named C: data Summary; set Summary; if Label1="Sum of Weights" then call symput("N",cValue1); if Label1="Number of Strata" then call symput("H",cValue1); if Label1="Number of Clusters" then call symput("C",cValue1); run; Step 2: Construct the Variable z Construct the variable z in a DATA step by using the macro variables Spending_Mean and N: data Working; set IceCreamStudy; z=(1/(&N-1))*(Spending-&Spending_Mean)**2; run; Step 3: Estimate the Total of z and Take the Square Root of the Total Use PROC SURVEYMEANS to estimate the weighted total of the variable z. Specify the SUM and STACKING options in the PROC SURVEYMEANS statement. The ODS OUTPUT statement saves the statistics table to a data set named Result. proc surveymeans data = Working sum stacking; weight Weight; var z; ods output Statistics = Result; run; The following DATA step retrieves the estimated total of z and stores it in a macro variable named Variance. The total of z is equal to . Take the square root of the estimated total and store it in a macro variable named StdDev. The square root of the estimated total is the finite population standard deviation . data Result; set Result; StdDev=sqrt(z_Sum); call symput("Variance",z_Sum); call symput("StdDev",StdDev); run; Step 4: Construct the Variable u Construct the variable u by using the macro variables Spending_Mean, N, Variance, and StdDev. data Taylor; set IceCreamStudy; u=((Spending-&Spending_Mean)**2 - &Variance)/(2*&StdDev*(&N-1)); run; Step 5: Estimate the Total of u Use PROC SURVEYMEANS to estimate the total of the variable u. Specify the SUM, VARSUM, TOTAL=, and STACKING options in the PROC SURVEYMEANS statement. The VARSUM option computes the variance of the total. In this step, the computation of interest is the variance of the estimated total rather than the total itself. Therefore, the sampling design must be appropriately represented in the SURVEYMEANS procedure. The TOTAL= option enables the procedure to apply a finite population correction in the variance computation. The STRATA statement specifies that the strata be identified by the variable Grade, and the CLUSTER statement specifies that cluster membership be identified by the variable StudyGroup. The ODS OUTPUT statement saves the statistics table in a data set named Result. proc surveymeans data = Taylor sum varsum stacking total=StudyGroups; strata Grade; cluster StudyGroup; weight Weight; var u; ods output Statistics = Result; run; The following DATA step creates the variable Estimate in the data set Result and assigns it the value of that is stored in the macro variable StdDev. The 95% confidence limits are computed, and the data set Result is prepared for printing. %let df=%eval(&C - &H); data Result; set Result(rename=(u_VarSum=Variance u_StdDev=StdErr)); Estimate=&StdDev; LowerCL= Estimate + StdErr*TINV(.025,&df); UpperCL= Estimate + StdErr*TINV(.975,&df); label Estimate=Population Standard Deviation Estimate Variance=Variance of Estimate StdErr=Standard Error of Estimate LowerCL=Lower Confidence Limit UpperCL=Upper Confidence Limit; Variable='Spending'; run; Use PROC PRINT to print the contents of the data set Result: title 'Parameter Estimates'; proc print data=Result label noobs; var Variable Estimate Variance StdErr LowerCL UpperCL; run; title ; Output 1 displays the results. The estimate of the population standard deviation of the variable Spending is 5.33. The variance of the estimate is 0.245. The standard error of the estimate is 0.49, and the estimated lower and upper 95% confidence limits are 4.27 and 6.40, respectively. Output 1 Estimate of Finite Population Standard Deviation Parameter Estimates Variable Population Standard Deviation Estimate Variance of Estimate Standard Error of Estimate Lower Confidence Limit Upper Confidence Limit Spending 5.33483 0.244809 0.494782 4.26592 6.40374 Using the Delete-One Jackknife Method to Estimate The delete-one jackknife resampling method of variance estimation deletes one primary sampling unit (PSU) at a time from the full sample to create R replicates, where R is the total number of PSUs. In each replicate, the sample weights of the remaining PSUs are modified by the jackknife coefficient α r . The modified weights are called replicate weights. If is the estimate of S obtained using only the data and the replicate weights from the rth replicate, the jackknife variance estimate is (5) with R – H degrees of freedom, where α r is the jackknife coefficient for the rth replicate, R is the number of replicates, and H is the number of strata (or R – 1 when there is no stratification). See the section Jackknife Method in the chapter "The SURVEYMEANS Procedure" of the SAS/STAT User's Guide for more details. Recall that when you construct z k , you use estimates of and that are computed by using the full sample. However, the jackknife variance estimator requires that the be computed from the rth replicate. Thus, the jackknife estimate of the variance of the total of z is not equal to the jackknife estimate of the variance of . The following steps summarize how you estimate , the finite population standard deviation of a variable y, and , the variance of the finite population standard deviation estimator (using the delete-one jackknife method): Use PROC SURVEYMEANS to estimate the sample mean and the sum of the weights for the full sample. Save both estimates as they are used in the construction of z. Construct z k as in equation (3), using the full-sample estimates of and obtained in step 1. Use PROC SURVEYMEANS to estimate the weighted total of the variable z. Take the square root of the total, and save the result, which is the full-sample estimate of the population standard deviation ( ). When you estimate the total, specify the VARMETHOD=JACKKNIFE option and the OUTWEIGHTS= and OUTJKCOEFS= method-options in the PROC SURVEYMEANS statement. Both the OUTWEIGHTS= and OUTJKCOEFS= data sets are used in later steps. For each replicate, use PROC SURVEYMEANS to compute the sample mean and the sum of the weights by using only the data and replicate weights for the rth replicate. Save the estimates for later use. For each replicate, using the estimates for and that were obtained in step 4, construct the variable z such that (6) Use PROC SURVEYMEANS to estimate the weighted total of z by replicate. Take the square root of each estimated total, and save the results for later use. The square root of the estimated weighted total of z r is equal to for the rth replicate. Construct a variable (call it u) by using the estimates from step 6, the jackknife coefficients, and the full-sample estimate from step 3 such that Use PROC SURVEYMEANS to estimate the unweighted total of the variable u from step 7. The estimated unweighted total of u is , the delete-one jackknife estimate of the variance of . Example This example uses the same IceCreamStudy data set that was described in the section Ice Cream Study Data Set and reproduces the steps described in the section Using the Delete-One Jackknife Method to Estimate . Steps 1 and 2 are identical to the first two steps in the previous example but are repeated here for completeness. Step 1: Compute and for the Full Sample Use PROC SURVEYMEANS to obtain an estimate of the sample mean. Specify the MEAN and STACKING options in the PROC SURVEYMEANS statement. The WEIGHT statement specifies that the variable Weight contain the sampling weights. The STRATA statement specifies that the variable Grade identifies strata membership. The CLUSTER statement specifies that the variable StudyGroup identifies cluster (or PSU) membership. The ODS OUTPUT statement creates output data sets for the statistics and data summary tables, to be named Statistics and Summary, respectively. The sample mean is stored in the data set Statistics. The data set Summary contains the sum of the sampling weights and the number of strata. proc surveymeans data=IceCreamStudy mean stacking ; weight Weight; strata Grade; cluster StudyGroup; var Spending; ods output Statistics = Statistics Summary = Summary; run; The following DATA step saves the sample mean of the variable Spending in a macro variable named Spending_Mean: data _null_; set Statistics; call symput("Spending_Mean",Spending_Mean); run; The next DATA step saves the sum of the sampling weights in a macro variable named N and the number of strata in a macro variable named H: data Summary; set Summary; if Label1="Sum of Weights" then call symput("N",cValue1); if Label1="Number of Strata" then call symput("H",cValue1); run; Step 2: Construct the Variable z Using the Full-Sample Estimates of and Construct the variable z in a DATA step using the macro variables Spending_Mean and N: data Working; set IceCreamStudy; Z=(1/(&N-1))*(Spending-&Spending_Mean)**2; run; Step 3: Estimate the Total of z for the Full Sample Use PROC SURVEYMEANS to estimate the weighted total of the variable z. Specify the SUM and STACKING options in the PROC SURVEYMEANS statement. Also specify the VARMETHOD=JACKKNIFE option with the OUTJKCOEFS= and OUTWEIGHTS= method-options. The OUTJKCOEFS= method-option saves the jackknife coefficients in a SAS data set named Jkcoefs. The OUTWEIGHTS= method-option saves the replicate weights in a SAS data set named Jkweights. In this step you must fully specify the sampling design so that the jackknife coefficients and replicate weights are computed correctly. The STRATA statement specifies that the strata be identified by the variable Grade. The CLUSTER statement specifies that the PSUs be identified by the variable StudyGroup. The WEIGHT statement specifies that the full-sample sampling weights be contained in the variable Weight. The ODS OUTPUT statement saves the statistics table to a data set named Result and the variance estimation table to a data set named VarianceEstimation. proc surveymeans data=Working sum stacking varmethod=JACKKNIFE(outjkcoefs=Jkcoefs outweights=Jkweights); strata Grade /list; cluster StudyGroup; weight Weight; var z; ods output Statistics = Result VarianceEstimation=VarianceEstimation; run; You can see from the "Variance Estimation" table in Output 2 that there are 16 replicates. Output 2 Estimate of Population Variance The SURVEYMEANS Procedure Data Summary Number of Strata 3 Number of Clusters 16 Number of Observations 40 Sum of Weights 3162.6 Variance Estimation Method Jackknife Number of Replicates 16 The next DATA step retrieves the number of replicates and stores the value in a macro variable named R: data _null_; set VarianceEstimation; where label1="Number of Replicates"; call symput("R",cvalue1); run; %let R=%eval(&R); The data set Jkcoefs has 16 observations, one for each replicate. The rth observation contains the jackknife coefficient for the rth replicate. The data set Jkweights contains the original variables from the IceCreamStudy data set and 16 new variables named RepWgt_1 through RepWgt_16; there are n = 40 observations. The following DATA step retrieves the estimated total of the variable z, takes the square root of the estimated total, and stores it in a macro variable named StdDev. The square root of the weighted total of the variable z is . data _null_; set Result; StdDev=sqrt(Z_Sum); call symput("StdDev",StdDev); run; Step 4: Compute and for Replicate Samples Before computing and , use the following DATA step to convert the data set Jkweights from wide form to long form; doing so enables you to use BY-group processing with PROC SURVEYMEANS. data Long(drop= RepWt_1 - RepWt_&R Z); set Jkweights; array num (*) RepWt_1 - RepWt_&R; do replicate=1 to dim(num); Jkweight=num(replicate); output; end; run; The data set Long has 40 x 16 = 640 observations. There are 16 copies of the original variables from the IceCreamStudy data set stacked on top of each other, and each copy is identified by the variable Replicate. Instead of the 16 replicate weight variables, RepWgt_1 through RepWgt_16, there is now one variable, Jkweight, which is constructed by stacking the variables RepWgt_1 through RepWgt_16 on top of each other. Thus, the first 40 observations contain a copy of the original variables, the contents of RepWgt_1, and the variable Replicate has a value of 1. The second 40 observations contain a copy of the original variables, the contents of RepWgt_2, and the variable Replicate has a value of 2. The remaining observations are constructed and identified similarly. Next, sort the data set Long by Replicate: proc sort data=Long out=Long; by Replicate; run; Use PROC SURVEYMEANS to estimate the mean of Spending by Replicate. Doing so produces the estimates of and for each replicate. The WEIGHT statement specifies that the sampling weights be contained in the variable Jkweight. The ODS OUTPUT statement saves the sample means ( ) in a SAS data set named JKMeans and the sums of the replicate weights ( ) in a data set named JKN. By default, the means are stored in a variable named Mean and the sums of the replicate weights are stored in a variable named N. proc surveymeans data=Long mean; weight Jkweight; var Spending; by Replicate; ods output Statistics = JKMeans(keep=Replicate Mean) Summary = JKN; run; Step 5: Construct the Variable z for Replicate Samples Before you can construct the variable z for the replicate samples, you must merge the data sets JKMeans and JKN with Long, by Replicate: proc sort data=JKMeans out=JKMeans; by Replicate; run; data JKN(keep=N replicate ); set JKN(rename=(nvalue1=N)); where Label1="Sum of Weights"; run; proc sort data=JKN out=JKN; by Replicate; run; data Long; merge Long JKN JKMeans; by Replicate; run; Now construct the variable z using the merged data set. data Long; set Long; z=(1/(N-1))*(Spending-Mean)**2; run; Step 6: Estimate the Total of z for Replicate Samples Use PROC SURVEYMEANS to estimate the total of the variable z by Replicate. The WEIGHT statement specifies that the sampling weights be contained in the variable Jkweight. You do not need to specify the STRATA and CLUSTER statements. The ODS OUTPUT statement saves the estimated totals in the variable JKEstimate in a SAS data set named Statistics. The estimated totals are the estimates for each replicate. proc surveymeans data=Long sum stacking; weight Jkweight; var z; by Replicate; ods output Statistics=Statistics(rename=(Z_Sum=JKEstimate)); run; Take the positive square roots of the estimated totals. The results are the estimates for each replicate. data Statistics; set Statistics(drop=Z_StdDEV z); JKEstimate=sqrt(JKEstimate); run; Step 7: Construct the Variable u Before you can construct the variable u, you must sort and merge, by Replicate, the data sets Statistics and Jkcoefs: proc sort data=Statistics out=Statistics; by Replicate; run; proc sort data=Jkcoefs out=Jkcoefs; by Replicate; run; data Statistics; merge Statistics Jkcoefs; by Replicate; run; The data set Statistics now contains the jackknife coefficients α r in the variable JKcoefficients and the estimates in the variable JKEstimate. Construct the variable u by using these variables and the full-sample estimate that is saved in the macro variable StdDev. data Statistics; set Statistics; u=JKcoefficient*(JKEstimate-&StdDev)**2; run; Step 8: Estimate the Total of u Use PROC SURVEYMEANS to compute the unweighted total of u. Specify the SUM option in the PROC SURVEYMEANS statement. The ODS OUTPUT statement saves the total in a variable named Variance in a SAS data set named Result. proc surveymeans data=Statistics sum; var u; ods output Statistics=Result(rename=(sum=Variance)); run; The following DATA step computes the standard error of the estimate and the upper and lower 95% confidence limits. In this example, the confidence limits are computed using a t distribution with R – H = 16 – 3 = 13 degrees of freedom. The variable Estimate is generated and assigned the estimated value of that is stored in the macro variable StdDev. Labels are created for the existing variables, a new variable Variable is generated, and its value is specified to be the name of the variable that is being analyzed (Spending). %let df=%eval(&R-&H); data Result; set Result; StdErr=sqrt(Variance); Estimate=&StdDev; UpperCL=Estimate + StdErr*TINV(.975,&df); LowerCL=Estimate + StdErr*TINV(.025,&df); label Estimate=Population Standard Deviation Estimate Variance=Variance of Estimate StdErr=Standard Error of Estimate LowerCL=Lower Confidence Limit UpperCL=Upper Confidence Limit; Variable='Spending'; run; Use the PRINT procedure to print the contents of the data set Result: title 'Parameter Estimates'; proc print data=Result label noobs; var Variable Estimate Variance StdErr LowerCL UpperCL; run; title ; Output 3 displays the results. The estimate of the population standard deviation for the variable Spending is 5.33. The variance of the estimate is 0.27, and the standard error of the estimate is 0.52. The estimated lower and upper 95% confidence limits are 4.21 and 6.46, respectively. Output 3 Estimate of Finite Population Standard Deviation Parameter Estimates Variable Population Standard Deviation Estimate Variance of Estimate Standard Error of Estimate Lower Confidence Limit Upper Confidence Limit Spending 5.33483 0.271465 0.52102 4.20923 6.46043 Using the BRR Method to Estimate The BRR method requires that the full sample be drawn by using a stratified sample design with two PSUs per stratum. If H is the total number of strata, the total number of replicates R is the smallest multiple of four that is greater than H. Each replicate is obtained by deleting one PSU per stratum according to the corresponding Hadamard matrix and adjusting the original weights for the remaining PSUs. The new weights are called replicate weights. If is the estimate of S obtained by using only the data and the replicate weights from the rth replicate, the BRR variance estimate is (7) with H degrees of freedom. See the section Balanced Repeated Replication (BRR) Method in the chapter "The SURVEYMEANS Procedure" of the SAS/STAT User's Guide for more details. Recall that when you construct z k , you use estimates of and that are computed by using the full sample. However, the BRR variance estimator requires that the be computed from the rth replicate. Thus, the BRR estimate of the variance of the total of z is not equal to the BRR estimate of the variance of . The following steps summarize how you estimate S, the finite population standard deviation of a variable y, and , the variance of the finite population standard deviation estimator (using the BRR method): Use PROC SURVEYMEANS to estimate the sample mean and the sum of the weights for the full sample. Save both estimates for later use: they are used in the construction of z. Also save the number of strata H for later use. Construct z k as in equation (3) by using the full-sample estimates of and obtained in step 1. Use PROC SURVEYMEANS to estimate the weighted total of the variable z, take the square root of the estimated total, and save the result. The square root of the estimated total is the full-sample estimate of the population standard deviation ( ). When you estimate the total, specify the VARMETHOD=BRR option and the OUTWEIGHTS= method-option in the PROC SURVEYMEANS statement. The OUTWEIGHTS= SAS data set is used in later steps. Also save the number of replicates R for later use. For each replicate, use PROC SURVEYMEANS to estimate the sample mean and the sum of the weights by using only the data and replicate weights for the rth replicate. Save the estimates for later use. For each replicate, using the estimates for and that were obtained in step 4, construct the variable z such that (8) Use PROC SURVEYMEANS to estimate the weighted total of z by replicate, take the positive square root of each estimated total, and save the results for later use. The square root of the estimated weighted total of z r is equal to for the rth replicate. Construct a variable (call it u) by using the estimates from step 6, the number of replicates R, and the full-sample estimate from step 3 such that Use PROC SURVEYMEANS to estimate the unweighted total of the variable u from step 7. The estimated unweighted total of u is , the BRR estimate of the variance of . Example This example uses the MUNIsurvey data set from the section Variance Estimation Using Replication Methods in the chapter "The SURVEYMEANS Procedure" of the SAS/STAT User's Guide. The data are not shown here, but a SAS program that generates the data is included in the sample SAS code that you can download for this example. In the original example, the San Francisco Municipal Railway (MUNI) conducted a survey to estimate the average waiting time for MUNI subway system’s passengers. This example estimates the standard deviation of the passengers’ waiting time. The study uses a stratified cluster sample design. Each MUNI subway line is a stratum. The subway lines included in the study are 'J-Church,' 'K-Ingleside,' 'L-Taraval,' 'M-Ocean View,' 'N-Judah,' and the street car 'F-Market & Wharves.' The MUNI vehicles in service for these lines during a day are the primary sampling units. Within each stratum, two vehicles (PSUs) are randomly selected. Then the waiting times of passengers for a selected MUNI vehicle are collected. The collected data are saved in the SAS data set MUNIsurvey. Table 3 identifies the variables contained in the data set. Table 3 Variables in MUNIsurvey Data Set Variable Description Line The MUNI line that a passenger is riding (strata) Vehicle The vehicle that a passenger is boarding (PSU) Waittime The time (in minutes) that a passenger waited Weight Sampling weights Step 1: Compute and for the Full Sample Use PROC SURVEYMEANS to obtain estimates of the sample mean ( ) and the sum of the sampling weights ( ) for the full sample. Specify the MEAN and STACKING options in the PROC SURVEYMEANS statement. The WEIGHT statement specifies that the sampling weights be contained in the variable Weight. The STRATA statement specifies that the strata be identified by the variable Line. The CLUSTER statement specifies that the PSUs be identified by the variable Vehicle. The ODS OUTPUT statement produces output data sets for the statistics and data summary tables, to be named Statistics and Summary, respectively. The sample mean is stored in the data set Statistics. The sum of the sampling weights and the number of strata are stored in the data set Summary. proc surveymeans data=MUNIsurvey mean stacking ; weight Weight; strata Line; cluster Vehicle; var Waittime; ods output Statistics = Statistics Summary = Summary; run; The following DATA step saves the sample mean ( ) of the variable Waittime in a macro variable named Waittime_Mean: data _null_; set Statistics; call symput("Waittime_Mean",Waittime_Mean); run; The next DATA step saves the sum of the sampling weights in a macro variable named N and the number of strata in a macro variable named H: data Summary; set Summary; if Label1="Sum of Weights" then call symput("N",cValue1); if Label1="Number of Strata" then call symput("H",cValue1); run; Step 2: Construct the Variable z Using the Full-Sample Estimates of and Construct the variable z in a DATA step by using the macro variables Waittime_Mean and N: data Working; set MUNIsurvey; Z=(1/(&N-1))*(Waittime-&Waittime_Mean)**2; run; Step 3: Estimate the Total of z for the Full Sample Use PROC SURVEYMEANS to estimate the total of the variable z. Specify the SUM and STACKING options in the PROC SURVEYMEANS statement. Also specify the VARMETHOD=BRR OUTWEIGHTS= method-options. The OUTWEIGHTS= method-option saves the replicate weights in a SAS data set named BRRweights. In this step you must fully specify the sampling design so that the replicate weights are computed correctly. The STRATA statement specifies that the strata be identified by the variable Line. The CLUSTER statement specifies that the PSUs be identified by the variable Vehicle. The WEIGHT statement specifies that the full-sample sampling weights be contained in the variable Weight. The ODS OUTPUT statement saves the statistics table to a data set named Estimate and the variance estimation table to a data set named VarianceEstimation. proc surveymeans data=Working sum stacking varmethod=brr(outweights=BRRweights); strata Line; cluster Vehicle; weight Weight; var z; ods output Statistics = Estimate VarianceEstimation=VarianceEstimation; run; Output 4 Estimate of Population Variance The SURVEYMEANS Procedure Data Summary Number of Strata 6 Number of Clusters 12 Number of Observations 1937 Sum of Weights 143040 Variance Estimation Method BRR Number of Replicates 8 There are n = 1,937 observations and R = 8 replicates. The data set BRRweights contains the original variables from the Munisurvey data set and eight new variables named RepWgt_1 through RepWgt_8. The following DATA step retrieves the estimated total of the variable z, takes the square root of the total, and stores the result in a macro variable named StdDev. The square root of the total of the variable z is equal to . data _null_; set Estimate; StdDev=sqrt(Z_Sum); call symput("StdDev",StdDev); run; The next DATA step retrieves the number of replicates and stores the value in a macro variable named R: data _null_; set VarianceEstimation; where label1="Number of Replicates"; call symput("R",cvalue1); run; %let R=%eval(&R); Step 4: Compute and for Replicate Samples Before computing and , use the following DATA step to convert the data set BRRweights from wide form to long form; doing so enables you to use BY-group processing with PROC SURVEYMEANS. data Long(drop= RepWt_1 - RepWt_&R Z); set BRRweights; array num (*) RepWt_1 - RepWt_&R; do replicate=1 to dim(num); BRRweight=num(replicate); output; end; run; The data set Long has 1,937 x 8 = 15,496 observations. There are eight copies of the original variables from the Munisurvey data set stacked on top of each other, and each copy is identified by the variable Replicate. Instead of the eight replicate weight variables, RepWgt_1 through RepWgt_8, there is now one variable, BRRweight, which is constructed by stacking the variables RepWgt_1 through RepWgt_8 on top of each other. Thus, the first 1,937 observations contain a copy of the original variables and the contents of RepWgt_1, and the variable Replicate has a value of 1. The second 1,937 observations contain a copy of the original variables and the contents of RepWgt_2, and the variable Replicate has a value of 2. The remaining observations are constructed and identified similarly. Next, sort the data set Long by Replicate: proc sort data=Long out=Long; by Replicate; run; Use PROC SURVEYMEANS to estimate the mean of Waittime by Replicate. Doing so produces the estimates of and for each replicate. The WEIGHT statement specifies that the sampling weights be contained in the variable BRRweight. The ODS OUTPUT statement saves the sample means in a SAS data set named BRRMeans and the sum of the replicate weights in a data set named BRRN. proc surveymeans data=Long mean; weight BRRweight; var Waittime; by Replicate; ods output Statistics = BRRMeans(keep=Replicate Mean) Summary = BRRN; run; Step 5: Construct the Variable z Before you can construct the variable z, you must merge the data sets BRRMeans and BRRN with Long by Replicate: proc sort data=BRRMeans out=BRRMeans; by Replicate; run; data BRRN(keep=N replicate ); set BRRN(rename=(nvalue1=N)); where Label1="Sum of Weights"; run; proc sort data=BRRN out=BRRN; by Replicate; run; data Long; merge Long BRRN BRRMeans; by Replicate; run; Now construct the variable z using the merged data set: data Long; set Long; z=(1/(N-1))*(Waittime-Mean)**2; run; Step 6: Estimate the Total of z for the Replicate Samples Use PROC SURVEYMEANS to estimate the total of the variable z by Replicate. The WEIGHT statement specifies that the sampling weights be contained in the variable BRRweight. You do not need to specify the STRATA and CLUSTER statements. The ODS OUTPUT statement saves the estimated totals in the variable BRREstimate in a SAS data set named Statistics. The estimated totals are the estimates for each replicate. proc surveymeans data=Long sum stacking; weight BRRweight; var z; by Replicate; ods output Statistics=Statistics(rename=(Z_Sum=BRREstimate)); run; Take the square root of each estimated total. The results are the estimates for each replicate. data Statistics; set Statistics(drop= Z_StdDEV z); BRREstimate=sqrt(BRREstimate); run; Step 7: Construct the Variable u data Statistics; set Statistics; u=(1/&R)*(BRREstimate-&StdDev)**2; run; Step 8: Estimate the Total of u Use PROC SURVEYMEANS to compute the unweighted total of z. Specify the SUM option in the PROC SURVEYMEANS statement. The ODS OUTPUT statement saves the total in a variable named Variance in a SAS data set named Result. proc surveymeans data=Statistics sum; var u; ods output Statistics=Result(rename=(sum=Variance)); run; The following DATA step computes the standard error of the estimate and the upper and lower 95% confidence limits. The confidence limits for this example are computed by using a t distribution with H=6 degrees of freedom. The variable Estimate is generated and assigned the estimated value of , which is stored in the macro variable StdDev. The data set is also prepared for printing. data Result; set Result; StdErr=sqrt(Variance); Estimate=&StdDev; UpperCL=Estimate + StdErr*TINV(.975,&H); LowerCL=Estimate + StdErr*TINV(.025,&H); Variable='Waittime'; label Estimate=Population Standard Deviation Estimate Variance=Variance of Estimate StdErr=Standard Error of Estimate LowerCL=Lower Confidence Limit UpperCL=Upper Confidence Limit; run; Use the PRINT procedure to print the contents of the data set Result: title 'Parameter Estimates'; proc print data=Result label noobs; var Variable Estimate Variance StdErr LowerCL UpperCL; run; title ; Output 5 displays the results. The estimate of the population standard deviation for the variable Waittime is 4.24. The variance of the estimate is 0.03, and the standard error of the estimate is 0.17. The estimated lower and upper 95% confidence limits are 3.82 and 4.67, respectively. Output 5 Estimate of Finite Population Standard Deviation Parameter Estimates Variable Population Standard Deviation Estimate Variance of Estimate Standard Error of Estimate Lower Confidence Limit Upper Confidence Limit Waittime 4.24495 0.029935 0.17302 3.82159 4.66831 References Särndal, C. E., Swensson, B., and Wretman, J. (1992), Model Assisted Survey Sampling, New York: Springer-Verlag.

AlexBeaver · ‎12-20-2023

Overview The finite population variance of a variable provides a measure of the amount of variation in the corresponding attribute of the study population’s members, thus helping to describe the distribution of a study variable. Whether you are studying a population’s income distribution in a socioeconomic study, rainfall distribution in a meteorological study, or scholastic aptitude test (SAT) scores of high school seniors, a small population variance is indicative of uniformity in the population while a large variance is indicative of a more diverse population. Another use for the population variance is to determine sample size. For example, the U.S. Environmental Protection Agency uses estimated population variances from pilot studies such as the Environmental Monitoring and Assessment Program–Surface Waters Northeast Lakes Pilot study to assist in planning future sampling strategies (Courbois and Urquhart; 2004). Suppose you have data that were sampled according to some complex survey design. The SURVEYMEANS procedure enables you to estimate finite population totals, means, and ratios in addition to the design-based variances of the estimated quantities, but it does not directly estimate the finite population variance of a variable. However, because a variance can be expressed mathematically as a total, you can easily estimate the finite population variance S 2 of a variable by using PROC SURVEYMEANS plus a little SAS programming. Whenever you estimate a population parameter such as a mean or a variance, you should also report the precision of the estimate. The most commonly reported measure of precision is the variance (or its square root, the standard error). The survey analysis procedures in SAS/STAT software currently provide three different variance estimation methods for complex survey designs: the Taylor series linearization method, the delete-one jackknife method, and the balanced repeated replication (BRR) method. This example demonstrates how to use all three methods to estimate the variance . Because the finite population parameter of interest in this example is the variance of a variable, the measure of precision of the estimate is the variance of a variance. Therefore, as you consider the example, it is important to keep in mind the distinction between the two different meanings of the word variance. In one context, a variance is estimated in order to describe the distribution of a variable. A variance used in this context is denoted S 2 and its estimator is denoted . In the other context, a variance is estimated in order to describe the sampling distribution of an estimator. A variance used in this context is denoted and its estimator is denoted . Analysis Suppose you want to estimate the variance of a variable y from a finite population using data that were sampled according to some complex survey design. The finite population variance of y is (1) where N is the total number of elements in the population, y i is the ith observation of the variable y, and is the population mean of y. A sample-based estimator of S 2 is (2) where is an estimator of the population total N, is an estimator of the population mean, n is the number of elements in the sample, and π k is the probability that element k is observed in the sample. To estimate , you first estimate both and with PROC SURVEYMEANS. Next, you generate a variable (call it z) such that each observation z k is equal to (3) Now you use PROC SURVEYMEANS to estimate the total of z. The estimated weighted total of z is equal to . However, the variance of the weighted total of z that is computed by PROC SURVEYMEANS, regardless of which VARMETHOD= option you select, is not equal to , the variance of the estimate . Computing requires some additional SAS programming. Using the Taylor Series Linearization Method to Estimate To estimate by using the Taylor series linearization method, construct a variable u, such that (4) where is computed as in equation (2). Use PROC SURVEYMEANS to estimate the total (and the variance of the total) of u. The total that is computed by PROC SURVEYMEANS is of no interest, but the variance of the total is equal to , the variance of the estimate (Särndal, Swensson, and Wretman 1992, chap. 5.5). The following steps summarize how you estimate S 2 , the finite population variance of a variable y, and , the variance of the finite population variance estimator (using the Taylor series linearization method): Use PROC SURVEYMEANS to estimate the sample mean of the variable y, and save the estimated mean. PROC SURVEYMEANS also computes the sum of the sampling weights, which is the value of in the analysis. Save that value also; it is used in the construction of z. Using the sample mean from step 1, construct the variable z as in equation (3). Use PROC SURVEYMEANS to estimate the weighted total of the variable z. Save the estimated total, which is the estimate of the population variance ( ). Using the sample mean from step 1 and the estimate of S 2 obtained in step 3, construct the variable u as in equation (4). Use PROC SURVEYMEANS to estimate the weighted total of the variable u. The estimated variance of this total obtained from PROC SURVEYMEANS is an estimator of the variance of . Example Ice Cream Study Data Set This example uses the IceCreamStudy data set from the example "Stratified Cluster Sample Design" in the chapter "The SURVEYMEANS Procedure" of the SAS/STAT User's Guide. The study population is a junior high school with a total of 4,000 students in grades 7, 8, and 9. In the original example, researchers want to know how much these students spend weekly for ice cream, on the average, and what percentage of students spend at least $10 weekly for ice cream. This example measures the variability of the students’ expenditures by estimating S 2 , the variance of the variable that contains the students’ expenditures. Suppose that every student belongs to a study group and that study groups are formed within each grade level. Each study group contains between two and four students. Table 1 shows the total number of study groups and the total number of students for each grade. Table 1 Study Groups and Students by Grade Grade Number of Study Groups Number of Students 7 608 1,824 8 252 1,025 9 403 1,151 It is quicker and more convenient to collect data from students in the same study group than to collect data from students individually. Therefore, this study uses a stratified clustered sample design. The primary sampling units are study groups. The list of all study groups in the school is stratified by grade level. From each grade level, a sample of study groups is randomly selected, and all students in each selected study group are interviewed. The sample consists of eight study groups from the 7th grade, three groups from the 8th grade, and five groups from the 9th grade. The SAS data set IceCreamStudy saves the responses of the selected students: data IceCreamStudy; input Grade StudyGroup Spending Weight @@; datalines; 7 34 7 76.0 7 34 7 76.0 7 412 4 76.0 9 27 14 80.6 7 34 2 76.0 9 230 15 80.6 9 27 15 80.6 7 501 2 76.0 9 230 8 80.6 9 230 7 80.6 7 501 3 76.0 8 59 20 84.0 7 403 4 76.0 7 403 11 76.0 8 59 13 84.0 8 59 17 84.0 8 143 12 84.0 8 143 16 84.0 8 59 18 84.0 9 235 9 80.6 8 143 10 84.0 9 312 8 80.6 9 235 6 80.6 9 235 11 80.6 9 312 10 80.6 7 321 6 76.0 8 156 19 84.0 8 156 14 84.0 7 321 3 76.0 7 321 12 76.0 7 489 2 76.0 7 489 9 76.0 7 78 1 76.0 7 78 10 76.0 7 489 2 76.0 7 156 1 76.0 7 78 6 76.0 7 412 6 76.0 7 156 2 76.0 9 301 8 80.6 ; Table 2 identifies the variables contained in the data set IceCreamStudy. Table 2 Variables in IceCreamStudy Data Set Variable Description Grade Student’s grade (strata) StudyGroup Student’s study group (PSU) Spending Student’s expenditure per week for ice cream, in dollars Weight Sampling weights The SAS data set StudyGroup is created to provide PROC SURVEYMEANS with the sample design information shown in Table 1. The variable Grade identifies the strata, and the variable _TOTAL_ contains the total number of study groups in each stratum. data StudyGroups; input Grade _total_; datalines; 7 608 8 252 9 403 ; Step 1: Compute and Use PROC SURVEYMEANS to obtain an estimate of the sample mean. Specify the MEAN and STACKING options in the PROC SURVEYMEANS statement. The STACKING option causes the procedure to create an output data set with a single observation. This table structure makes it easy in later steps to identify the saved estimates and to assign their values to macro variables. The WEIGHT statement specifies that the variable Weight contains the sampling weights. The STRATA statement specifies that the variable Grade identifies strata membership. The CLUSTER statement specifies that the variable StudyGroup identifies cluster (or PSU) membership. The ODS OUTPUT statement requests output data sets for the statistics and data summary tables, to be named Statistics and Summary, respectively. The sample mean is stored in the data set Statistics. The data set Summary contains the sum of the sampling weights, the number of strata, and the number of clusters. The sum of the sampling weights is needed to compute ; the number of strata and the number of clusters are used later when computing confidence limits for . proc surveymeans data=IceCreamStudy mean stacking ; weight Weight; strata Grade; cluster StudyGroup; var Spending; ods output Statistics = Statistics Summary = Summary; run; The following DATA step saves the sample mean of the variable Spending in a macro variable named Spending_Mean: data _null_; set Statistics; call symput("Spending_Mean",Spending_Mean); run; The next DATA step saves the sum of the sampling weights in a macro variable named N, the number of strata in a macro variable named H, and the number of clusters in a macro variable named C: data Summary; set Summary; if Label1="Sum of Weights" then call symput("N",cValue1); if Label1="Number of Strata" then call symput("H",cValue1); if Label1="Number of Clusters" then call symput("C",cValue1); run; Step 2: Construct the Variable z Construct the variable z in a DATA step by using the macro variables Spending_Mean and N: data Working; set IceCreamStudy; z=(1/(&N-1))*(Spending-&Spending_Mean)**2; run; Step 3: Estimate the Total of z Use PROC SURVEYMEANS to estimate the weighted total of the variable z. Specify the SUM and STACKING options in the PROC SURVEYMEANS statement. The ODS OUTPUT statement saves the statistics table to a data set named Result. proc surveymeans data = Working sum stacking; weight Weight; var z; ods output Statistics = Result(keep=z_Sum); run; The following DATA step retrieves the estimated total of z and stores it in a macro variable named Variance. The total of z is equal to . data _null_; set Result; call symput("Variance",z_Sum); run; Step 4: Construct the Variable u Construct the variable u by using the macro variables Spending_Mean, N, and Variance: data Taylor; set IceCreamStudy; u=(1/(&N-1))*((Spending-&Spending_Mean)**2 - &Variance); run; Step 5: Estimate the Total of u Use PROC SURVEYMEANS to estimate the total of the variable u. Specify the SUM, VARSUM, TOTAL=, and STACKING options in the PROC SURVEYMEANS statement. The VARSUM option computes the variance of the total. In this step, the computation of interest is the variance of the estimated total rather than the total itself. Therefore, the sampling design must be appropriately represented in the SURVEYMEANS procedure. The TOTAL= option is specified to enable the procedure to apply a finite population correction in the variance computation. The STRATA statement specifies that the strata be identified by the variable Grade, and the CLUSTER statement specifies that cluster membership be identified by the variable StudyGroup. The ODS OUTPUT statement saves the statistics table in a data set named TaylorResult. proc surveymeans data = Taylor sum varsum stacking total=StudyGroups; strata Grade; cluster StudyGroup; weight Weight; var u; ods output Statistics = TaylorResult; run; The following DATA step creates the variable Estimate in the TaylorResult data set and assigns it the value of , which is stored in the macro variable Variance. The 95% confidence limits are computed, and the TaylorResult data set is prepared for printing. Note: The 95% confidence limits are computed in this example by using a t distribution with &df = &C – &H degrees of freedom. This results in a confidence interval that is symmetric about the estimated parameter. Confidence intervals constructed in this manner have good coverage properties, however negative lower confidence limits are possible. There are alternative methods for computing confidence intervals that will exclude the possibility of negative lower confidence limits. For example, if the study variable is approximately normally distributed, confidence limits can be computed using a chi-square distribution. Another possibility is to to use the t distribution with the lower confidence limit computed as . In the simple case that is presented in this example, the latter method is acceptable. However, there are situations where it is not. Whatever method you choose, it is important that the confidence intervals be constructed in a manner that is consistent with any assumptions you make about the underlying data and the parameter estimation method. %let df=%eval(&C - &H); data TaylorResult; set TaylorResult(rename=(u_VarSum=Variance u_StdDev=StdErr)); Estimate=&Variance; LowerCL= Estimate + StdErr*TINV(.025,&df); UpperCL= Estimate + StdErr*TINV(.975,&df); label Estimate=Population Variance Estimate Variance=Variance of Estimate StdErr=Standard Error of Estimate LowerCL=Lower Confidence Limit UpperCL=Upper Confidence Limit; Variable='Spending'; run; Use PROC PRINT to print the contents of the data set TaylorResult: title 'Parameter Estimates'; proc print data=TaylorResult label noobs; var Variable Estimate Variance StdErr LowerCL UpperCL; run; title ; Output 1 displays the results. The estimate of the population variance of the variable Spending is 28.46. The variance of the estimate is 27.87. The standard error of the estimate is 5.28, and the estimated lower and upper 95% confidence limits are 17.05 and 39.86, respectively. Output 1 Estimate of Population Variance Parameter Estimates Variable Population Variance Estimate Variance of Estimate Standard Error of Estimate Lower Confidence Limit Upper Confidence Limit Spending 28.4604 27.869473 5.279155 17.0555 39.8653 Using the Delete-One Jackknife Method to Estimate The delete-one jackknife resampling method of variance estimation deletes one primary sampling unit (PSU) at a time from the full sample to create R replicates, where R is the total number of PSUs. In each replicate, the sample weights of the remaining PSUs are modified by the jackknife coefficient α. The modified weights are called replicate weights. If is the estimate of S 2 obtained by using only the data and the replicate weights from the rth replicate, the jackknife variance estimate is (5) with R - H degrees of freedom, where α r is the jackknife coefficient for the rth replicate, R is the number of replicates, and H is the number of strata (or R – 1 when there is no stratification). See the section "Jackknife Method" in the chapter "The SURVEYMEANS Procedure" of the SAS/STAT User's Guide for more details. Recall that when you construct z k , you use estimates of and that are computed by using the full sample. However, the jackknife variance estimator requires that the be computed from the rth replicate. Thus, the jackknife estimate of the variance of the total of z is not equal to the jackknife estimate of the variance of . The following steps summarize how you estimate , the finite population variance of a variable y, and , the variance of the finite population variance estimator (using the delete-one jackknife method): Use PROC SURVEYMEANS to estimate the sample mean and the sum of the weights for the full sample. Save both estimates as they are used in the construction of z. Construct z k as in equation (3), using the full-sample estimates of and obtained in step 1. Use PROC SURVEYMEANS to estimate the weighted total of the variable z. Save the estimated total, which is the full-sample estimate of the population variance ( ). When you estimate the total, specify the VARMETHOD=JACKKNIFE option and the OUTWEIGHTS= and OUTJKCOEFS= method-options in the PROC SURVEYMEANS statement. Both the OUTWEIGHTS= and OUTJKCOEFS= data sets are used in later steps. For each replicate, use PROC SURVEYMEANS to compute the sample mean and the sum of the weights by using only the data and replicate weights for the rth replicate. Save the estimates for later use. For each replicate, using the estimates for and that were obtained in step 4, construct the variable z such that (6) Use PROC SURVEYMEANS to estimate the weighted total of z by replicate, and save the estimates for later use. The estimated weighted total of z r is equal to for the rth replicate. Construct a variable (call it u) by using the estimates from step 6, the jackknife coefficients, and the full-sample estimate from step 3 such that Use PROC SURVEYMEANS to estimate the unweighted total of the variable u from step 7. The estimated unweighted total of is, the delete-one jackknife estimate of the variance of . Example This example uses the same IceCreamStudy data set that was described in the section Ice Cream Study Data Set and reproduces the steps described in the section Using the Delete-One Jackknife Method to Estimate . Steps 1 and 2 are identical to the first two steps in the previous example but are repeated here for completeness. Step 1: Compute and for the Full Sample Use PROC SURVEYMEANS to obtain an estimate of the sample mean. Specify the MEAN and STACKING options in the PROC SURVEYMEANS statement. The WEIGHT statement specifies that the variable Weight contain the sampling weights. The STRATA statement specifies that the variable Grade identifies strata membership. The CLUSTER statement specifies that the variable StudyGroup identifies cluster (or PSU) membership. The ODS OUTPUT statement creates output data sets for the statistics and data summary tables, to be named Statistics and Summary, respectively. The sample mean is stored in the data set Statistics. The data set Summary contains the sum of the sampling weights and the number of strata. The sum of the sampling weights is needed to compute ; the number of strata is used later when computing confidence limits for . proc surveymeans data=IceCreamStudy mean stacking ; weight Weight; strata Grade; cluster StudyGroup; var Spending; ods output Statistics = Statistics Summary = Summary; run; The following DATA step saves the sample mean of the variable Spending in a macro variable named Spending_Mean: data _null_; set Statistics; call symput("Spending_Mean",Spending_Mean); run; The next DATA step saves the sum of the sampling weights in a macro variable named N and the number of strata in a macro variable named H: data Summary; set Summary; if Label1="Sum of Weights" then call symput("N",cValue1); if Label1="Number of Strata" then call symput("H",cValue1); run; Step 2: Construct the Variable z by Using the Full-Sample Estimates of and Construct the variable z in a DATA step by using the macro variables Spending_Mean and N: data Working; set IceCreamStudy; z=(1/(&N-1))*(Spending-&Spending_Mean)**2; run; Step 3: Estimate the Total of z for the Full Sample Use PROC SURVEYMEANS to estimate the weighted total of the variable z. Specify the SUM and STACKING options in the PROC SURVEYMEANS statement. Also specify the VARMETHOD=JACKKNIFE option with the OUTJKCOEFS= and OUTWEIGHTS= method-options. The OUTJKCOEFS= method-option saves the jackknife coefficients in a SAS data set named Jkcoefs. The OUTWEIGHTS= method-option saves the replicate weights in a SAS data set named Jkweights. In this step you must fully specify the sampling design so that the jackknife coefficients and replicate weights are computed correctly. The STRATA statement specifies that the strata be identified by the variable Grade. The CLUSTER statement specifies that the PSUs be identified by the variable StudyGroup. The WEIGHT statement specifies that the full-sample sampling weights be contained in the variable Weight. The ODS OUTPUT statement saves the statistics table to a data set named Result and the variance estimation table to a data set named VarianceEstimation. proc surveymeans data=Working sum stacking varmethod=JACKKNIFE(outjkcoefs=Jkcoefs outweights=Jkweights); strata Grade; cluster StudyGroup; weight Weight; var z; ods output Statistics = Result VarianceEstimation=VarianceEstimation; run; data _null_; set Result; call symput("Variance",z_Sum); run; You can see from the "Variance Estimation" table in Output 2 that there are 16 replicates. Output 2 Estimate of Population Variance The SURVEYMEANS Procedure Data Summary Number of Strata 3 Number of Clusters 16 Number of Observations 40 Sum of Weights 3162.6 Variance Estimation Method Jackknife Number of Replicates 16 The next DATA step retrieves the number of replicates and stores the value in a macro variable named R: data _null_; set VarianceEstimation; where label1="Number of Replicates"; call symput("R",cvalue1); run; %let R=%eval(&R); The data set Jkcoefs has 16 observations, one for each replicate. The rth observation contains the jackknife coefficient for the rth replicate. The data set Jkweights contains the original variables from the IceCreamStudy data set and 16 new variables named RepWgt_1 through RepWgt_16; there are n = 40 observations. Step 4: Compute and for Replicate Samples Before computing and , use the following DATA step to convert the data set Jkweights from wide form to long form; doing so enables you to use BY-group processing with PROC SURVEYMEANS. data Long(drop= RepWt_1 - RepWt_&R Z); set Jkweights; array num (*) RepWt_1 - RepWt_&R; do replicate=1 to dim(num); Jkweight=num(replicate); output; end; run; The data set Long has 40 x 16 = 640 observations. There are 16 copies of the original variables from the IceCreamStudy data set stacked on top of each other, and each copy is identified by the variable Replicate. Instead of the 16 replicate weight variables, RepWgt_1 through RepWgt_16, there is now one variable, Jkweight, which is constructed by stacking the variables RepWgt_1 through RepWgt_16 on top of each other. Thus, the first 40 observations contain a copy of the original variablesand the contents of RepWgt_1, and the variable Replicate has a value of 1. The second 40 observations contain a copy of the original variables and the contents of RepWgt_2, and the variable Replicate has a value of 2. The remaining observations are constructed and identified similarly. Next, sort the data set Long by Replicate: proc sort data=Long out=Long; by Replicate; run; Use PROC SURVEYMEANS to estimate the mean of Spending by Replicate. Doing so produces the estimates of and for each replicate. The WEIGHT statement specifies that the sampling weights be contained in the variable Jkweight. The ODS OUTPUT statement saves the sample means ( ) in a SAS data set named JKMeans and saves the sum of the replicate weights ( ) in a data set named JKN. By default, the means are stored in a variable named Mean and the sum of the replicate weights are stored in a variable named N. proc surveymeans data=Long mean; weight Jkweight; var Spending; by Replicate; ods output Statistics = JKMeans(keep=Replicate Mean) Summary = JKN; run; Step 5: Construct the Variable z for the Replicate Samples Before you can construct the variable z for the replicate samples, you must merge the data sets JKMeans and JKN with Long, by Replicate: proc sort data=JKMeans out=JKMeans; by Replicate; run; data JKN(keep=N replicate ); set JKN(rename=(nvalue1=N)); where Label1="Sum of Weights"; run; proc sort data=JKN out=JKN; by Replicate; run; data Long; merge Long JKN JKMeans; by Replicate; run; Now construct the variable z using the merged data set: data Long; set Long; z=(1/(N-1))*(Spending-Mean)**2; run; Step 6: Estimate the Total of z for Replicate Samples Use PROC SURVEYMEANS to estimate the total of the variable z by Replicate. The WEIGHT statement specifies that the sampling weights be contained in the variable Jkweight. You do not need to specify the STRATA and CLUSTER statements. The ODS OUTPUT statement saves the estimated totals in the variable JKEstimate in a SAS data set named Statistics. The estimated totals are the estimates for each replicate. proc surveymeans data=Long sum stacking; weight Jkweight; var z; by Replicate; ods output Statistics=Statistics(drop=Z_StdDEV rename=(Z_Sum=JKEstimate)); run; Step 7: Construct the Variable u Before you can construct the variable u, you must sort and merge, by Replicate, the data sets Statistics and Jkcoefs: proc sort data=Statistics out=Statistics; by Replicate; run; proc sort data=Jkcoefs out=Jkcoefs; by Replicate; run; data Statistics; merge Statistics Jkcoefs; by Replicate; run; The data set Statistics now contains the jackknife coefficients α r in the variable JKcoefficients and the estimates in the variable JKEstimate. Construct the variable u by using these variables and the full-sample estimate that is saved in the macro variable Variance: data Statistics; set Statistics; u=JKcoefficient*(JKEstimate-&Variance)**2; run; Step 8: Estimate the Total of u Use PROC SURVEYMEANS to compute the unweighted total of u. Specify the SUM option in the PROC SURVEYMEANS statement. The ODS OUTPUT statement saves the total in a variable named Variance in a SAS data set named JKResult. proc surveymeans data=Statistics sum; var u; ods output Statistics=JKResult(rename=(sum=Variance)); run; The following DATA step computes the standard error of the estimate and the upper and lower 95% confidence limits. In this example, the confidence limits are computed using a t distribution with R – H = 16 – 3 = 13 degrees of freedom. The variable Estimate is generated and assigned the estimated value of , which is stored in the macro variable Variance. Labels are created for the existing variables. A new variable Variable is generated, and its value is specified to be the name of the variable that is being analyzed (Spending). %let df=%eval(&R-&H); data JKResult; set JKResult; StdErr=sqrt(Variance); Estimate=&Variance; LowerCL= Estimate + StdErr*TINV(.025,&df); UpperCL= Estimate + StdErr*TINV(.975,&df); label Estimate=Population Variance Estimate Variance=Variance of Estimate StdErr=Standard Error of Estimate LowerCL=Lower Confidence Limit UpperCL=Upper Confidence Limit; Variable='Spending'; run; Use the PRINT procedure to print the contents of the data set JKResult: title 'Parameter Estimates'; proc print data=JKResult label noobs; var Variable Estimate Variance StdErr LowerCL UpperCL; run; title ; Output 3 displays the results. The estimate of the population variance for the variable Spending is 28.46. The variance of the estimate is 30.27, and the standard error of the estimate is 5.50. The estimated lower and upper 95% confidence limits are 16.57 and 40.35, respectively. Output 3 Estimate of Population Variance Parameter Estimates Variable Population Variance Estimate Variance of Estimate Standard Error of Estimate Lower Confidence Limit Upper Confidence Limit Spending 28.4604 30.267500 5.50159 16.5750 40.3459 Using the BRR Method to Estimate The BRR method requires that the full sample be drawn by using a stratified sample design with two PSUs per stratum. If H is the total number of strata, the total number of replicates R is the smallest multiple of four that is greater than H. Each replicate is obtained by deleting one PSU per stratum according to the corresponding Hadamard matrix and adjusting the original weights for the remaining PSUs. The new weights are called replicate weights. If is the estimate of obtained by using only the data and the replicate weights from the rth replicate, the BRR variance estimate is with H degrees of freedom. See the section "Balanced Repeated Replication (BRR) Method" in the chapter "The SURVEYMEANS Procedure" of the SAS/STAT User's Guide for more details. Recall that when you construct z k , you use estimates of and that are computed by using the full sample. However, the BRR variance estimator requires that the be computed from the rth replicate. Thus, the BRR estimate of the variance of the total of z is not equal to the BRR estimate of the variance of . The following steps summarize how you estimate S 2 , the finite population variance of a variable y, and , the variance of the finite population variance estimator (using the BRR method): Use PROC SURVEYMEANS to estimate the sample mean and the sum of the weights for the full sample. Save both estimates for later use: they are used in the construction of z. Construct z k as in equation (3) by using the full-sample estimates of and obtained in step 1. Use PROC SURVEYMEANS to estimate the weighted total of the variable z, and save the estimated total. This total is the full-sample estimate of the population variance ( ). When you estimate the total, specify the VARMETHOD=BRR option and the OUTWEIGHTS= method-option in the PROC SURVEYMEANS statement. The OUTWEIGHTS= SAS data set is used in later steps. Also save the number of strata H and the number of replicates R for later use. For each replicate, use PROC SURVEYMEANS to estimate the sample mean and the sum of the weights by using only the data and replicate weights for the rth replicate. Save the estimates for later use. For each replicate, using the estimates for and that were obtained in step 4, construct the variable z such that Use PROC SURVEYMEANS to estimate the weighted total of z by replicate, and save the estimates for later use. The estimated weighted total of z r is equal to for the rth replicate. Construct a variable (call it u) by using the estimates from step 6, the number of replicates R, and the full-sample estimate from step 3 such that Use PROC SURVEYMEANS to estimate the unweighted total of the variable u from step 7. The estimated unweighted total of u is , the BRR estimate of the variance of . Example The MUNIsurvey Data Set This example uses the MUNIsurvey data set from the section "Variance Estimation Using Replication Methods" in the chapter "The SURVEYMEANS Procedure" of the SAS/STAT User's Guide. The data are not shown here, but a SAS program that generates the data is included in the sample SAS code that you can download for this example. In the original example, the San Francisco Municipal Railway (MUNI) conducted a survey to estimate the average waiting time for MUNI subway system’s passengers. This example estimates the variance of the passengers’ waiting time. The study uses a stratified cluster sample design. Each MUNI subway line is a stratum. The subway lines included in the study are 'J-Church,' 'K-Ingleside,' 'L-Taraval,' 'M-Ocean View,' 'N-Judah,' and the street car 'F-Market & Wharves.' The MUNI vehicles in service for these lines during a day are the primary sampling units. Within each stratum, two vehicles (PSUs) are randomly selected. Then the waiting times of passengers for a selected MUNI vehicle are collected. The collected data are saved in the SAS data set MUNIsurvey. Table 3 identifies the variables contained in the data set. Table 3 Variables in MUNIsurvey Data Set Variable Description Line The MUNI line that a passenger is riding (strata) Vehicle The vehicle that a passenger is boarding (PSU) Waittime The time (in minutes) that a passenger waited Weight Sampling weights Step 1: Compute and for the Full Sample Use PROC SURVEYMEANS to obtain estimates of the sample mean ( ) and the sum of the sampling weights ( ) for the full sample. Specify the MEAN and STACKING options in the PROC SURVEYMEANS statement. The WEIGHT statement specifies that the variable Weight contain the sampling weights. The STRATA statement specifies that the variable Line identify stratum membership. The CLUSTER statement specifies that the variable Vehicle identify PSU or cluster membership. The ODS OUTPUT statement produces output data sets for the statistics and data summary tables, to be named Statistics and Summary, respectively. The sample mean is stored in the data set Statistics, and the sum of the sampling weights is stored in the data set Summary. proc surveymeans data=MUNIsurvey mean stacking ; weight Weight; strata Line; cluster Vehicle; var Waittime; ods output Statistics = Statistics Summary = Summary; run; The following DATA step saves the sample mean ( ) of the variable Waittime in a macro variable named Waittime_Mean: data _null_; set Statistics; call symput("Waittime_Mean",Waittime_Mean); run; The next DATA step saves the sum of the sampling weights in a macro variable named N and the number of strata in a macro variable named H: data Summary; set Summary; if Label1="Sum of Weights" then call symput("N",cValue1); if Label1="Number of Strata" then call symput("H",cValue1); run; Step 2: Construct the Variable z by Using the Full-Sample Estimates of and Construct the variable z in a DATA step by using the macro variables Waittime_Mean and N: data Working; set MUNIsurvey; Z=(1/(&N-1))*(Waittime-&Waittime_Mean)**2; run; Step 3: Estimate the Total of and for the Full Sample Use PROC SURVEYMEANS to estimate the total of the variable z. Specify the SUM and STACKING options in the PROC SURVEYMEANS statement. Also specify the VARMETHOD=BRR(OUTWEIGHTS=) option. The OUTWEIGHTS= method-option saves the replicate weights in a SAS data set named BRRweights. In this step you must fully specify the sampling design so that the replicate weights are computed correctly. The STRATA statement specifies that the strata be identified by the variable Line. The CLUSTER statement specifies that the PSUs be identified by the variable Vehicle. The WEIGHT statement specifies that the full-sample sampling weights be contained in the variable Weight. The ODS OUTPUT statement saves the statistics table to a data set named Estimate and the variance estimation table to a data set named VarianceEstimation. proc surveymeans data=Working sum stacking varmethod=brr(outweights=BRRweights); strata Line; cluster Vehicle; weight Weight; var z; ods output Statistics = Estimate VarianceEstimation=VarianceEstimation; run; Output 4 Estimate of Population Variance The SURVEYMEANS Procedure Data Summary Number of Strata 6 Number of Clusters 12 Number of Observations 1937 Sum of Weights 143040 Variance Estimation Method BRR Number of Replicates 8 You can see from Output 4 that there are eight replicates and 1,937 observations. The data set BRRweights contains the original variables from the Munisurvey data set and eight new variables named RepWgt_1 through RepWgt_8. The following DATA step retrieves the estimated total of the variable z and stores it in a macro variable named Variance. The total of the variable z is equal to . data _null_; set Estimate; call symput("Variance",Z_Sum); run; The next DATA step retrieves the number of replicates and stores the value in a macro variable named R: The number of replicates is used later to construct the variable u. data _null_; set VarianceEstimation; where label1="Number of Replicates"; call symput("R",cvalue1); run; %let R=%eval(&R); Step 4: Compute and for Replicate Samples Before computing and , use the following DATA step to convert the data set BRRweights from wide form to long form; doing so enables you to use BY-group processing with PROC SURVEYMEANS. data Long(drop= RepWt_1 - RepWt_&R Z); set BRRweights; array num (*) RepWt_1 - RepWt_&R; do replicate=1 to dim(num); BRRweight=num(replicate); output; end; run; The data set Long has 1,937 x 8 = 15,496 observations. There are eight copies of the original variables from the Munisurvey data set stacked on top of each other, and each copy is identified by the variable Replicate. Instead of the eight replicate weight variables, RepWgt_1 through RepWgt_8, there is now one variable, BRRweight, which is constructed by stacking the variables RepWgt_1 through RepWgt_8 on top of each other. Thus, the first 1,937 observations contain a copy of the original variables and the contents of RepWgt_1, and the variable Replicate has a value of 1. The second 1,937 observations contain a copy of the original variables and the contents of RepWgt_2, and the variable Replicate has a value of 2. The remaining observations are constructed and identified similarly. Next, sort the data set Long by Replicate: proc sort data=Long out=Long; by Replicate; run; Use PROC SURVEYMEANS to estimate the mean of Waittime by Replicate. Doing so produces the estimates of and for each replicate. The WEIGHT statement specifies that the sampling weights be contained in the variable BRRweight. The ODS OUTPUT statement saves the sample means in a SAS data set named BRRMeans and the sum of the replicate weights in a data set named BRRN. proc surveymeans data=Long mean; weight BRRweight; var Waittime; by Replicate; ods output Statistics = BRRMeans(keep=Replicate Mean) Summary = BRRN; run; Step 5: Construct the Variable z Before you can construct the variable z, you must merge the data sets BRRMeans and BRRN with Long by Replicate: proc sort data=BRRMeans out=BRRMeans; by Replicate; run; data BRRN(keep=N replicate ); set BRRN(rename=(nvalue1=N)); where Label1="Sum of Weights"; run; proc sort data=BRRN out=BRRN; by Replicate; run; data Long; merge Long BRRN BRRMeans; by Replicate; run; Now construct the variable z using the merged data set: data Long; set Long; z=(1/(N-1))*(Waittime-Mean)**2; run; Step 6: Estimate the Total of z for the Replicate Samples Use PROC SURVEYMEANS to estimate the total of the variable z by Replicate. The WEIGHT statement specifies that the variable BRRweight contain the sampling weights. You do not need to specify the STRATA and CLUSTER statements. The ODS OUTPUT statement saves the estimated totals in the variable BRREstimate in a SAS data set named Statistics. The estimated totals are the estimates for each replicate. proc surveymeans data=Long sum stacking; weight BRRweight; var z; by Replicate; ods output Statistics=Statistics(drop=Z_StdDEV rename=(Z_Sum=BRREstimate)); run; Step 7: Construct the Variable u data Statistics; set Statistics; u=(1/&R)*(BRREstimate-&Variance)**2; run; Step 8: Estimate the Total of u Use PROC SURVEYMEANS to compute the unweighted total of u. Specify the SUM option in the PROC SURVEYMEANS statement. The ODS OUTPUT statement saves the total in a variable named Variance in a SAS data set named BRRResult. proc surveymeans data=Statistics sum; var u; ods output Statistics=BRRResult(rename=(sum=Variance)); run; The following DATA step computes the standard error of the estimate and the upper and lower 95% confidence limits. The confidence limits for this example are computed by using a t distribution with H=6 degrees of freedom. The variable Estimate is generated and assigned the estimated value of , which is stored in the macro variable Variance. The data set is also prepared for printing. data BRRResult; set BRRResult; StdErr=sqrt(Variance); Estimate=&Variance; LowerCL= Estimate + StdErr*TINV(.025,&H); UpperCL= Estimate + StdErr*TINV(.975,&H); Variable='Waittime'; label Estimate=Population Variance Estimate Variance=Variance of Estimate StdErr=Standard Error of Estimate LowerCL=Lower Confidence Limit UpperCL=Upper Confidence Limit; run; Use the PRINT procedure to print the contents of the data set BRRResult: title 'Parameter Estimates'; proc print data=BRRResult label noobs; var Variable Estimate Variance StdErr LowerCL UpperCL; run; title ; Output 5 displays the results. The estimate of the population variance for the variable Waittime is 18.02. The variance of the estimate is 2.17, and the standard error of the estimate is 1.47. The estimated lower and upper 95% confidence limits are 14.41 and 21.63, respectively. Output 5 Estimate of Population Variance Parameter Estimates Variable Population Variance Estimate Variance of Estimate Standard Error of Estimate Lower Confidence Limit Upper Confidence Limit Waittime 18.0196 2.172780 1.47404 14.4128 21.6264 Appendix: Estimating the Finite Population Standard Deviation and Computing by Using the Delta Method After you have an estimate of the finite population variance of a variable and a design-based estimator of the variance , you can estimate the finite population standard deviation of the variable and a design-based estimator of its variance by means of a simple transformation. Specifically, an estimator of the finite population standard deviation is and, by application of the so-called delta method, an estimator of the variance of is where is the derivative of with respect to S 2 evaluated at . Substituting the sample-based estimators and for S 2 and , respectively, yields and Example Consider the BRR example provided in the section Using the BRR Method to Estimate . The estimation results are stored in the data set BRRResult. To compute the finite population standard deviation, its variance, and confidence limits, perform the transformations in the following DATA step. Note that the order of the first two assignment statements is critical. data BRRStdDev; set BRRResult; Variance=(1/(4*Estimate))*Variance; Estimate=sqrt(Estimate); StdErr=sqrt(Variance); LowerCL= Estimate + StdErr*TINV(.025,&H); UpperCL= Estimate + StdErr*TINV(.975,&H); label Estimate=Population Standard Deviation Estimate; run; Use the PRINT procedure to print the contents of the data set BRRStdDev: title 'Parameter Estimates'; proc print data=BRRStdDev label noobs; var Variable Estimate Variance StdErr LowerCL UpperCL; run; title ; Output 6 displays the results. The estimate of the population standard deviation for the variable Waittime is 4.24. The variance of the estimate is 0.03, and the standard error of the estimate is 0.17. The estimated lower and upper 95% confidence limits are 3.82 and 4.67, respectively. Output 6 Estimate of Population Standard Deviation Parameter Estimates Variable Population Standard Deviation Estimate Variance of Estimate Standard Error of Estimate Lower Confidence Limit Upper Confidence Limit Waittime 4.24495 0.030145 0.17362 3.82011 4.66979 References Courbois, J.-Y. P. and Urquhart, N. S. (2004), “Comparison of Survey Estimates of the Finite Population Variance,” Journal of Agricultural, Biological, and Environmental Statistics, 9(2), 236–251. Särndal, C. E., Swensson, B., and Wretman, J. (1992), Model Assisted Survey Sampling, New York: Springer-Verlag.

AlexBeaver · ‎12-18-2023

Overview PROC SPP and other spatial analysis procedures in SAS/STAT are designed to handle projected coordinate systems, where the distance between two points can be computed using the Euclidean formula, . If your data are collected in a spherical coordinate system—for example, longitude and latitude—then you should convert it to a projected system before applying PROC SPP. This example walks you through a sequence of steps that demonstrate how to handle data that have spherical coordinates in order to analyze them by using PROC SPP. Example You are a geologist studying the relationship between the locations of earthquakes and the locations of geothermal activity in the western United States. You have earthquake data, courtesy of the United States Geological Service (USGS) and data about hot springs, courtesy of the National Oceanic and Atmospheric Administration (NOAA). The Earthquakes data set is the collection of earthquakes with magnitude greater than 2.5 (on the Richter scale) in the continental United States collected from 2005 to 2015. In addition to the latitude and longitude of the epicenter of each earthquake, the data include other attributes, such as the earthquake’s magnitude. The following DATA step reads the data and creates the data set Earthquakes: data earthquakes; length Type $ 10; infile "https://support.sas.com/rnd/app/data/earthquakes.txt" url; input Latitude Longitude Depth Magnitude dNearestStation RootMeanSquareTime Type $; run; The Hotsprings data set is the collection of hot spring locations in the continental United States. Again, along with the latitude and longitude of each hot spring, the data include other attributes, such as its temperature and popular name. The following DATA step reads the data and creates the data set Hotsprings: data hotsprings; length Type $ 10; infile "https://support.sas.com/rnd/app/data/hotsprings.txt" url; input Latitude 1-6 Longitude 8-15 TemperatureFarenheit $ 17-19 TemperatureCelsius $ 21-23; Type = "hotspring"; run; You can view both of the data sets as spatial point patterns that are given in spherical coordinates. To explore whether the locations of hot springs and earthquakes are correlated, you first merge the two data sets into a single marked spatial point pattern, with a type variable to denote an earthquake, explosion, landslide, or hot spring, as shown in the following code. data QuakesAndSprings; set earthquakes hotsprings; where (type in ('hotspring','earthquake')); run; The combined data set, QuakesAndSprings, contains locations in spherical coordinates, given as Longitude and Latitude. You can use the GPROJECT procedure in SAS/GRAPH® to transform these spherical coordinates to projected coordinates. PROC GPROJECT requires that the data have an identification variable and that the spherical coordinates be named X and Y, respectively. The following statements prepare the data, apply PROC GPROJECT, and then prepare the resulting projected data for analysis by PROC SPP: data GProjectIn; set QuakesAndSprings; ID = _N_; rename latitude=y longitude=x; run; proc gproject data=GProjectIn out=GProjectOut degrees; id ID; run; data GProjectOut; set GProjectOut; format _character_; informat _character_; run; The resulting data set, GProjectOut, contains the data in projected coordinates. In this form, you can use the projected data with PROC SPP to analyze the relationship between earthquakes and hot springs by using the following statements ods graphics on; proc spp data=GProjectOut plots(unpack)=(all observ(attr=mark)); process p = (X,Y / mark=type) / G cross=types('hotspring','earthquake') maxdist=max; run; Figure 1: Projected locations of earthquakes and hot springs Figure 1 shows the locations of earthquakes, hot springs, landslides and explosions of different kinds. Figure 2: Cross G-function between Hot Spring and Earthquake Figure 2 shows the plot of the edge-corrected cross G-function computed between earthquakes and hot springs. The blue line, which represents the empirical cross G-function, is far above the dashed red line. The confidence interval of the cross G-function is shown by the blue band around the blue line, also does not intersect with the dashed red line. This suggests that earthquakes are indeed clustered around hot spring locations.

AlexBeaver · ‎12-18-2023

Overview Model-based clustering is one of the many uses for finite mixture models and SAS/STAT software’s FMM procedure. The finite mixture model approach to clustering assumes that the observations to be clustered are drawn from a mixture of a specified number of populations in varying proportions (McLachlan and Basford; 1988). After the finite mixture model is fit to estimate the model parameters and the posterior probabilities of population membership, each observation is assigned membership to the population for which it has the highest estimated posterior probability of belonging. As with any method of cluster analysis, the practitioner faces the problem of assessing the accuracy of the population membership allocations that are obtained, because in practice, population membership is not observed. To address this assessment problem, Basford and McLachlan (1985) propose a method of estimating the correct allocation rates for individual populations and for the overall mixture that is based on averaging appropriate functions of the maximum posterior probabilities. Because the proposed estimators for the allocation rates are known to be biased, bootstrap methods are used to estimate the bias, enabling the production of bias-adjusted estimates of the correct allocation rates. This example shows how to produce the bias-adjusted allocation rate estimates by using the FMM procedure and a little SAS programming. The SAS output in this example is generated using SAS/STAT 12.1. Analysis The underlying premise of the finite mixture model approach to clustering is that the population from which the response variable of interest is sampled can be partitioned into J mutually exclusive subpopulations. It is assumed that there is a distinct probability distribution with a known parametric form associated with each subpopulation. These subpopulation probability distributions govern the distribution of values of the response variable, given that an observation originates from a particular subpopulation. Thus, the marginal distribution of the response variable is a mixture of J distinct distributions. The prior probability that an observation is drawn from a particular subpopulation is equal to the proportion of the overall population that is a member of that particular subpopulation. Finite mixture models represent the marginal distribution of the response variable as a linearly weighted sum of component probability distributions: The component (subpopulation) distributions, ƒ j (y ; θ i ), can be discrete or continuous distributions; θ i is a vector of parameters for the jth component probability distribution. The mixing probabilities, π j , measure the prior probabilities of component (subpopulation) membership. Given a realization of the response variable and the parameters of the component distributions, the posterior probability that the ith observation is a member of the jth subpopulation is Each realization is assigned membership to the subpopulation to which it has the highest posterior probability of belonging. Because , the maximum estimated posterior probabilities have a range of (I/J, I). If the maximum estimated posterior probabilities are near 1 for most of the observations in a sample, this is compelling evidence that the finite mixture model can cluster the sample with a high degree of certainty. As the maximum estimated posterior probabilities approach I/J, this indicates that the components of the fitted mixture model are too close together for the sample to be clustered with any certainty. A visual inspection of a graph of the maximum posterior probabilities can be informative, but such assessments are necessarily subjective. It would be useful to have a summary statistic that provides an objective measure of how well the data have been clustered. The correct allocation rates measure the proportion of observations that have been allocated to the correct subpopulation according to the maximum posterior probability criterion. If you could observe component (subpopulation) membership, computing the correct allocation rates would be straightforward. To do so you would create three sets of indicator variables. Denote the first set as , where each z j is associated with a particular subpopulation. For each observation in the sample, set z ij = 1 if the observation originates from the jth subpopulation, and set z ij = 0 otherwise. Denote the second set as , where each variable similarly corresponds to a particular subpopulation. For each observation in the sample, set if and set otherwise, where is the estimated posterior probability of subpopulation membership. Denote the third set as , where if and otherwise. The correct allocation rate for the jth subpopulation is computed as where The true overall correct allocation rate for the mixture is Because the correct allocation rates A j and A depend on the unobserved indicator variables z, in practice they must be estimated. Ganesalingam and McLachlan (1980) propose estimating the overall correct allocation rate by Basford and McLachlan (1985) propose estimating the individual correct allocation rates by If the parameters θ of the finite mixture model are estimated consistently, then the biases, T – A andT j – A j , converge in probability to 0 as . However, results from both McLachlan and Basford (1988) and Basford and McLachlan (1985) indicate that T andT j tend to overestimate the correct allocation rates in finite samples, and so some method of bias correction is recommended. Basford and McLachlan (1985) propose using the parametric bootstrap method to estimate the bias in order to produce bias-adjusted estimates of the correct allocation rates. The parametric bootstrap method works because, although you cannot observe the subpopulation of origin in the original sample, you can observe the subpopulation of origin in the bootstrap samples. This makes it possible to compute the overall correct allocation rate A and the subpopulation correct allocation rates A j for each bootstrap sample. You can then estimate the bias by taking the average of the differences A – T and A j – T j over the bootstrap samples. Conceptually, there are five steps in the process of computing the bias-adjusted estimates of the correct allocation rates by using the parametric bootstrap method: Generate K parametric bootstrap samples. To do this, generate K replicates of the original data set. For each observation in the K data sets, generate a pseudorandom variable that takes the values I,...,Jwith probabilities . Replace the response variable observations with a pseudorandom variable that has a mixture distribution equivalent to that of the finite mixture model that was fitted by using the original data Fit a finite mixture model to each of the K bootstrap samples by using the model specification that was applied to the original sample, and compute the posterior probabilities of component membership. Compute A, A j , T, and T j for each of the bootstrap samples. Estimate the bias of T and the standard error of the bias estimate as Then estimate the bias of each T j and the standard error of the bias estimate as Compute the bias-adjusted estimate of the overall correct allocation rate as T* = T – b Then compute the bias-adjusted estimates of the subpopulation correct allocation rates as T* j =T j – b j j = 1,...,J You can also compute the root mean square errors for T and T j to use as measures of precision: and The ratio of the bias of an estimator to its root mean square errors can be used to indicate how serious a problem bias is for a given estimator. If the ratio , then bias is probably not a serious problem. Example The data used in this example are from a study by Symons, Grimson, and Yuan (1983) that investigates the incidence of sudden infant death syndrome (SIDS) and the problem of identifying the counties at high risk of SIDS in North Carolina. The data available are the number of deaths due to sudden infant death syndrome (SIDS) and the number of live births in the 100 counties of North Carolina for the years 1974–1978. The study models the number of incidences of SIDS as a mixture of two Poisson distributions, one representing a "normal" subpopulation and the other representing a "high-risk" subpopulation. The following SAS statements create a SAS data set named NCSIDS. In addition to the original variables SIDS and Births, the variables Logrisk and Rate are created. Logrisk is the natural logarithm of the number of births, and Rate is the incidence rate. data ncsids; input County $ 1-12 Births SIDS; Rate=SIDS/Births; Logrisk=log(Births); datalines; Alamance 4672 13 Alexander 1333 0 Alleghany 487 0 Anson 1570 15 Ashe 1091 1 Avery 781 0 Beaufort 2692 7 ... more lines ... Wayne 6638 18 Wilkes 3146 4 Wilson 3702 11 Yadkin 1269 1 Yancey 770 0 ; The following SAS statements use the FMM procedure to fit the two-component finite mixture model. The K= option in the MODEL statement specifies that two component distributions be fit. The DIST= option specifies that the two distributions be Poisson, and the OFFSET= option specifies that the variable Logrisk be used as an offset. The CLASS option in the OUTPUT statement adds the estimated component membership to the OUTPUT data set in a variable that is named Class by default. The MAXPOST options adds the maximum posterior probability to the output data set in a variable that is named Maxpost by default. The ODS OUTPUT statement saves the "Parameter Estimates," "Number of Observations," and "Mixing Probabilities" tables to SAS data sets. proc fmm data=ncsids; model SIDS = / dist=poisson k=2 offset=Logrisk; output out=model class maxpost prior; ods output ParameterEstimates=Parameters NObs=NObs(where=(label="Number of Observations Used") keep=label N) MixingProbs=MixingProbs(keep=prob); run; Output 1 displays selected portions of the FMM procedure’s output. In the Inverse Linked Estimates column of the "Parameter Estimates" table, you can see that the mean for the component 1 Poisson distribution is 0.001693 and the mean for the component 2 Poisson distribution is 0.003805. Thus, component 1 represents the "normal-risk" subpopulation and component 2 represents the "high-risk" subpopulation. You can also see from the "Mixing Probabilities" table that the prior probability for component 1 is 0.7969; the prior probability for component 2 can be inferred from the fact that the sum of the prior probabilities equals 1. Output 1 Two Component Finite Mixture Model of SIDS Data The FMM Procedure Parameter Estimates for 'Poisson' Model Component Parameter Estimate Standard Error z Value Pr > |z| Inverse Linked Estimate 1 Intercept -6.3813 0.08812 -72.42 <.0001 0.001693 2 Intercept -5.5715 0.2288 -24.35 <.0001 0.003805 Parameter Estimates for Mixing Probabilities Parameter Linked Scale Probability Estimate Standard Error z Value Pr > |z| Probability 1.3669 0.8024 1.70 0.0885 0.7969 Fit Statistics -2 Log Likelihood 474.3 AIC (smaller is better) 480.3 AICC (smaller is better) 480.5 BIC (smaller is better) 488.1 Pearson Statistic 113.3 Effective Parameters 3 Effective Components 2 Parameter Estimates for 'Poisson' Model Component Parameter Estimate Standard Error z Value Pr > |z| Inverse Linked Estimate 1 Intercept -6.3813 0.08812 -72.42 <.0001 0.001693 2 Intercept -5.5715 0.2288 -24.35 <.0001 0.003805 Parameter Estimates for Mixing Probabilities Parameter Linked Scale Probability Estimate Standard Error z Value Pr > |z| Probability 1.3669 0.8024 1.70 0.0885 0.7969 Figure 1 is a plot of the ordered incidence rates, with color-coded markers to indicate whether a county is classified by the finite mixture model as normal risk or high risk. The plot reveals that, with a couple of exceptions, the counties with the highest rates tend to be classified as high risk by the finite mixture model. Figure 1 Ordered Incidence Rates You can generate plots of the maximum posterior probabilities to visually assess how accurately the finite mixture model can cluster the data. For example, the following SAS statements generate a plot of the ordered maximum posterior probabilities and a plot of the maximum posterior probabilities versus the incidence rates: proc rank data=model out=model descending; var maxpost; ranks postorder; run; title "Ordered Maximum Posterior Probabilities"; proc sgplot data=model; scatter y=maxpost x=postorder / markerattrs=(symbol=CircleFilled size=4px) group=class; run; title; title "Maximum Posterior Probabilities by Incidence Rates"; proc sgplot data=model; scatter y=maxpost x=rate / markerattrs=(symbol=CircleFilled size=4px) group=class; run; title; The plot on the left in Figure 2 shows that about half the counties have relatively large (> 0.9) maximum posterior probabilities, indicating that they can be classified with a high degree of certainty. The maximum posterior probabilities then decline at an accelerating rate, indicating a deteriorating degree of certainty about the correct allocation. The counties that are classified as high risk appear to account for a disproportionate share of those with relatively low (< 0.7) maximum posterior probabilities. The plot on the right shows that counties with incidence rates between approximately 0.0026 and 0.005 tend to have the highest degree of uncertainty about the correct allocation. Figure 2 Maximum Posterior Probabilities Ordered Maximum Posterior Probabilities Maximum Posterior Probabilities by Incidence Rates You can compute the estimated correct allocation rates T, T1, and T2 as follows by using the data that are saved in the output data sets Model, Nobs, and MixingProbs. The results are saved in the data set ModelPost. proc sort data=model out=model; by Class; run; proc means data=model sum; by Class; var Maxpost; output out=ModelPost(drop=_TYPE_ _FREQ_) sum(Maxpost)=Maxpost; run; proc transpose data=ModelPost prefix=MaxPost out=ModelPost(drop=_LABEL_ _NAME_); var Maxpost; id Class; run; data ModelPost; merge ModelPost NObs(drop=label) MixingProbs; T=(MaxPost1 + MaxPost2)/N; T1=MaxPost1/(N*Prob); T2=MaxPost2/(N*(1-Prob)); label T=Estimated Overall Correct Allocation Rate T1=Estimated Component 1 Correct Allocation Rate T2=Estimated Component 2 Correct Allocation Rate; run; proc print data=ModelPost noobs label; var T T1 T2; run; Output 2 shows the estimated correct allocation rates. The estimate of the overall rate is .88, the rate for the normal-risk subpopulation is .96, and the rate for the high-risk subpopulation is .59. Even though these estimated allocation rates might be biased, they indicate that the finite mixture model can correctly classify observations from the normal-risk population with a high degree of accuracy, but they also indicate that the degree of accuracy is considerably lower for observations from the high-risk population. This implies that the model is much more likely to incorrectly classify a high-risk county as normal risk than to incorrectly classify a normal-risk county as high risk. Output 2 Allocation Rates Estimated Overall Correct Allocation Rate Estimated Component 1 Correct Allocation Rate Estimated Component 2 Correct Allocation Rate 0.88775 0.96290 0.59292 Computing Bias-Adjusted Correct Allocation Rates by Using the Parametric Bootstrap Method The section Analysis describes a five-step procedure that uses the parametric bootstrap method to produce bias-adjusted correct allocation rates. The five subsections that follow describe each step in greater detail. Step 1: Generate the Bootstrap Samples Step 1 is to generate the bootstrap samples. One of the easiest ways to do this is to use the SURVEYSELECT procedure. If you use the METHOD=SRS option, SAMPRATE=1 option, and REPS=200 option, PROC SURVEYSELECT generates 200 exact replicates of the original data set. The procedure automatically generates a variable named Replicate that indexes the samples. Using the SEED option ensures reproducibility. The OUTPUT option saves the 200 bootstrap samples in a data set named Bootstrap. proc surveyselect data=model(keep=county sids logrisk) out=bootstrap seed=872398 method=srs samprate=1 rep=200; run; Next, you retrieve the estimated Poisson parameters and the first component mixing probability from the output data sets Parameters and MixingProbs and store those values in macro variables for later use. data parameters; set parameters; if component=1 then call symput('theta1',estimate); if component=2 then call symput('theta2',estimate); run; data MixingProbs; set MixingProbs; call symput('pi1',prob); run; Then, you generate the pseudorandom variable Group, which takes on the value 1 or 2 with the probability or , respectively. Also, you replace the response variable SIDS with the pseudorandom numbers that have a distribution equivalent to the estimated finite mixture model. Then, to prepare the data set Bootstrap for BY-group processing, you sort it by Replicate. data bootstrap; set bootstrap; call streaminit(987346598); Group=rand('TABLE',&pi1); mu1=exp(&theta1+logrisk); mu2=exp(&theta2+logrisk); SIDS=ifn(group=1,rand('POISSON',mu1),rand('POISSON',mu2)); run; proc sort data=bootstrap out=bootstrap; by Replicate; run; Step 2: Fit a Finite Mixture Model to Each Bootstrap Sample Step 2 is to use PROC FMM with the original finite mixture model specification and BY-group processing to fit a finite mixture model to each of the 200 bootstrap samples. In an OUTPUT statement, you specify the OUT=, CLASS, MAXPOST, and PRIOR options. The ODS OUTPUT statement saves the parameter estimates for each bootstrap sample in the data set BootParameters and saves the model convergence status in the data set Converge. These two data sets and the output data set ML will be merged so that you can perform some recommended filtering of the output data set ML. ods select none; proc fmm data=bootstrap; by Replicate; model SIDS = / dist=poisson k=2 offset=Logrisk; output out=ml class maxpost prior; ods output ParameterEstimates=BootParameters(keep=replicate component estimate) ConvergenceStatus=Converge(keep=replicate status); run; Before you perform any analysis, it is recommended that you perform two filtering operations on the output data set ML. The first filtering operation concerns the convergence status of the models that have been fit to each bootstrap sample. Before you accept the estimates, you should check the convergence status of each model and discard any estimates that were generated by a model that failed to converge. The data set Converge contains a variable named Status that has a value of 0 if the model converged and a nonzero value if the model fails to converge. The second filtering operation requires a slightly more elaborate explanation. Output 1 shows that PROC FMM names the two component distributions component 1 and component 2. Because the parameter estimate for component 1 is smaller than the estimate for component 2, component 1 is referred to as the normal-risk component and component 2 is referred to as the high-risk component. Thus, when the variable Class has a value of 1, it is interpreted to mean that a particular observation has been assigned to the normal-risk subpopulation; when Class has a value of 2, it is interpreted to mean that a particular observation has been assigned to the high-risk subpopulation. However, this naming of the components by PROC FMM is completely arbitrary. If you fit the same model to a different sample, there is no guarantee that component 1 will be associated with the smaller parameter estimate and thus represent the normal-risk component. This means that when the variable Class has a value of 1, it cannot necessarily be interpreted to mean that the observation has been assigned to the normal-risk subpopulation. To ensure that the classifications have the same interpretations across the bootstrap samples, you must ensure that the magnitudes of the parameter estimates are in the same order. When they are not in the same order, you must switch the values of the variable Class and the values of the prior probabilities that are stored in the output data set ML. To switch these values, you first transpose the data set BootParameters and sort it by Replicate. Then you sort the data set Converge by Replicate and merge BootParameters and Converge with ML. proc transpose data=BootParameters out=BootParameters(drop=_NAME_) prefix=Estimate; by Replicate; id Component; run; proc sort data=BootParameters out=BootParameters; by Replicate; run; proc sort data=Converge; by Replicate; run; data ml; merge ml BootParameters Converge; by Replicate; run; Next, you can perform a DATA step and use a WHERE clause to exclude all observations where Status is nonzero. Then, you create a set of variables that are duplicates of the variables that contain the parameter estimates, the prior probabilities, and the variable Class. Finally, you perform a conditional (IF-THEN-DO) operation based on the relative magnitudes of the parameter estimates and switch the values in the original variables as needed. The parameter estimates are stored in the variables Estimate1 and Estimate2, and the prior probabilities are stored in the variables Prior_1 and Prior_2. data ml(drop=Prior1 Prior2 Lambda1 Lambda2 Class2 Status); set ml(where=(Status=0)); Prior1=Prior_1; Prior2=Prior_2; Post1=Post_1; Post2=Post_2; Lambda1=Estimate1; Lambda2=Estimate2; Class2=Class; if Estimate1>Estimate2 then do; Prior_1=Prior2; Prior_2=Prior1; if Class2=1 then Class=2; if Class2=2 then Class=1; end; run; Step 3: Compute the True Correct Allocation Rates and the Estimated Correct Allocation Rates for Each Bootstrap Sample Step 3 is to compute the true correct allocation rates A, A 1 , and A 2 the estimated correct allocation rates T, T 1 , and T 2 for the 200 bootstrap samples. The following SAS statements compute the true correct allocation rates A, A 1 , and A 2 for the bootstrap samples and save the results in a data set named Allocation: data ml; set ml; z1=ifn(Group=1,1,0); zhat1=ifn(Class=1,1,0); d1=ifn(z1=zhat1,1,0); z2=ifn(Group=2,1,0); zhat2=ifn(Class=2,1,0); d2=ifn(z2=zhat2,1,0); A1=z1*d1; A2=z2*d2; run; proc sort data=ml out=ml; by Replicate Class; run; proc means data=ml(keep= replicate A1 A2 z1 z2) sum; by Replicate; var A1 A2 z1 z2; output out=allocation(drop=_TYPE_ _FREQ_) sum(A1)=A1 sum(A2)=A2 sum(z1)=n1 sum(z2)=n2; run; data allocation; set allocation; A1=A1/n1; A2=A2/n2; n=n1+n2; A=(n1*A1 + n2*A2)/n; run; The next group of SAS statements computes the estimated correct allocation rates T, T 1 , and T 2 for the bootstrap samples and saves the results to the same data set, Allocation: proc means data=ml sum; by Replicate Class; var maxpost; output out=PostSums(drop=_TYPE_ _FREQ_) sum(maxpost)=Maxpost; run; proc transpose data=PostSums prefix=SumMaxPost out=PostSums(drop=_LABEL_ _NAME_); var Maxpost; by Replicate; id Class; run; proc sort data=allocation out=allocation; by Replicate; run; proc sort data=PostSums out=PostSums; by Replicate; run; data prior; set ml(keep=Replicate Prior_1 Prior_2); by Replicate; if First.Replicate; run; proc sort data=prior out=prior; by Replicate; run; data allocation; merge allocation PostSums prior; by Replicate; run; data allocation; set allocation; T=(SumMaxPost1 + SumMaxPost2)/n; T1=SumMaxPost1/(n*Prior_1); T2=SumMaxPost2/(n*Prior_2); run; Step 4: Estimate the Biases of the Correct Allocation Rate Estimators Step 4 is to compute the biases b, b 1 , and b 2 , the standard errors of the bias estimates SE(b), SE(b 1 ), and SE(b 2 ), and the root mean square errors RMSE(T), RMSE(T 1 ), and RMSE(T 2 ). The following SAS statements accomplish this task and save the results in the data set Bias: data allocation; set allocation; B=T-A; Bsqr=B**2; B1=T1-A1; B1sqr=B1**2; B2=T2-A2; B2sqr=B2**2; run; proc means data=allocation mean; var B B1 B2 Bsqr B1sqr B2sqr; output out=bias mean(B)=B var(B)=varB mean(B1)=B1 var(B1)=varB1 mean(B2)=B2 var(B2)=varB2 mean(Bsqr)=Bsqr mean(B1sqr)=B1sqr mean(B2sqr)=B2sqr N=K; run; data bias; set bias; se_B=sqrt(varB/K); se_B1=sqrt(varB1/K); se_B2=sqrt(varB2/K); RMSE_T1=sqrt(B1sqr); RMSE_T2=sqrt(B2sqr); RMSE_T=sqrt(Bsqr); run; Step 5: Compute the Bias-Adjusted Estimates of the Correct Allocation Rates Step 5 is to compute the bias-adjusted estimates of the correct allocation rates T*, T* 1 , and T* 2 . You do this by merging the data set Bias with the data set ModelPost, which contains the estimated correct allocation rates from the original sample, and subtracting the bias estimates b, b 1 , and b 2 from the corresponding correct allocation rate estimatesT, T 1 , and T 2 . The results are saved in the data set BiasAdjusted. data BiasAdjusted; merge bias(drop = _TYPE_ _FREQ_) ModelPost; BiasAdjT=T-B; BiasAdjT1=T1-B1; BiasAdjT2=T2-B2; run; The data set BiasAdjusted holds the results, but it is not in the best form for printing. All the data are currently stored in a wide form that consists of a single vector of values. To prepare the results for printing, it is recommended that you reshape the data set to a long form; that is, you stack the data so there are separate rows for the overall population and the two subpopulations. You can do this by breaking BiasAdjusted into three data sets and then appending those three data sets together. data T; set BiasAdjusted(keep=T B SE_B BiasAdjT RMSE_T); length Label $ 12; label='Overall'; rename T=Estimate B=Bias SE_B=SE BiasAdjT=Adjusted RMSE_T=RMSE; run; data T1; set BiasAdjusted(keep=T1 B1 SE_B1 BiasAdjT1 RMSE_T1); length Label $ 12; Label='Normal Risk'; rename T1=Estimate B1=Bias SE_B1=SE BiasAdjT1=Adjusted RMSE_T1=RMSE; run; data T2; set BiasAdjusted(keep=T2 B2 SE_B2 BiasAdjT2 RMSE_T2); length Label $ 12; Label='High Risk'; rename T2=Estimate B2=Bias SE_B2=SE BiasAdjT2=Adjusted RMSE_T2=RMSE; run; proc append base=results data=T; run; proc append base=results data=T1; run; proc append base=results data=T2; run; Next, you compute the ratios , , and and create appropriate labels to prepare the data set Results for printing: data results; set results; ratio=bias/rmse; label Label=Population Estimate=Estimated Correct Allocation Rate Bias=Estimate of Bias SE=Standard Error of Bias Estimate Adjusted=Bias-Adjusted Correct Allocation Rate RMSE=RMSE of Correct Allocation Rate ratio=ratio of Bias to RMSE; run; Finally, you print the data set Results: ods select all; proc print data=results noobs label; var Label Estimate RMSE Bias SE ratio Adjusted; title 'Estimated Correct Allocation Rates'; title2 'Parametric Bootstrap Method'; run; Output 3 shows the results. As you can see, for this example the bias estimates are all positive but fairly small. The ratios of bias to RMSE are all below the .25 threshold, indicating that bias is not a significant issue for this model and sample. Output 3 Parametric Bootstrap Results Estimated Correct Allocation Rates Parametric Bootstrap Method Population Estimated Correct Allocation Rate RMSE of Correct Allocation Rate Estimate of Bias Standard Error of Bias Estimate ratio of Bias to RMSE Bias-Adjusted Correct Allocation Rate Overall 0.88775 0.04590 .004583433 .003237460 0.099858 0.88316 Normal Risk 0.96290 0.01902 .001455327 .001344381 0.076513 0.96145 High Risk 0.59292 0.12715 .003564499 .009010140 0.028033 0.58935 References Basford, K. E. and McLachlan, G. J. (1985), “Estimation of Allocation Rates in a Cluster Analysis Context,” Journal of the American Statistical Association, 80(390), 286–293. Ganesalingam, S. and McLachlan, G. J. (1980), “Error rate estimation on the basis of posterior probabilities,” Pattern Recognition, 12, 405–413. McLachlan, G. J. and Basford, K. E. (1988), Mixture Models, New York: Marcel Dekker. Symons, M. J., Grimson, R. C., and Yuan, Y. C. (1983), “Clustering of Rare Events,” Biometrics, 39, 193–205.

AlexBeaver · ‎12-18-2023

Overview Semicontinuous random variables are characterized by a continuous distribution that has point masses at one or more locations. One way to model semicontinuous data is to fit a generalized linear model by using a Tweedie distribution for the response variable. Tweedie distributions have been used in such diverse fields as actuarial science, economics, telecommunications, ecology, medicine, and meteorology. This example illustrates how to fit a Tweedie model to aggregate insurance claims payments data by using the HPGENSELECT procedure which is available in SAS/STAT 12.3, recently released with SAS 9.4. Analysis Exponential dispersion models are the response distributions for generalized linear models. Any exponential dispersion model can be characterized by its variance function V(), which describes the mean-variance relationship of the distribution when the dispersion is held constant. If Y follows an exponential dispersion model distribution that has mean μ, variance function V(), and dispersion φ, then the variance of Y can be written as V(Y) = φV(μ) Tweedie distributions are a special case of the exponential dispersion family for which V(μ) = μ p and V(Y) = φ μ p and (Dunn and Smyth, 2005). The distribution is defined for all values of p except values of p in the open interval (0, 1). Many important known distributions are a special case of Tweedie distributions including normal (p = 0), Poisson (p = 1), gamma (p = 2), and inverse Gaussian (p = 3). Apart from these special cases, the probability density unction of the Tweedie distribtion does not have an analyticl expression. For p > 1, it has the form where for p ≠ 2 and for p = 2. The function a(y, φ) does not have an analytical expression. It is usually evaluated by using the series expansion methods that are described in Dunn and Smyth (2005). For 1 < p < 2, the Tweedie distribution is a compound Poisson-gamma mixture distribution, which is the distribution of S defined as where and are independently and identically distributed gamma random variables with the shape parameter α and the scale parameter θ. At Y = 0, the density is a probability mass that is governed by the Poisson distribution, and for values of Y > 0, the density is a mixture of gamma variates with Poisson mixing probability. The parameters λ, α, and θ are related to the natural parameters μ, φ, and p of the Tweedie distribution as The mean of a Tweedie distribution is positive for p > 1. Example: Modeling Insurance Claims Data When modeling aggregate payments from insurance claims, if you assume that the arrival of claims follows a Poisson distribution, that the size of individual claims are independently and identically gamma distributed, and that the arrival and sizes are independent of one another, then the aggregate payments follow a Tweedie compound Poisson-gamma mixture distribution (Frees, 2010). This example uses PROC HPGENSELECT to fit a Tweedie model to the aggregate loss data from a Swedish study about third-party automobile insurance claims for 1977. The data were compiled by the Swedish Committee on the Analysis of Risk Premium in Motor Insurance (Andrews and Herzberg, 1985). The following SAS statements create the data set MotorIns, and Table 1 describes the variables: data motorins; input Kilometres Zone Bonus Make Insured Claims Payment; LogInsured=log(insured); Zeros=ifn(payment ne 0,1,0); datalines; 1 1 1 1 455.13 108 392491 1 1 1 2 69.17 19 46221 1 1 1 3 72.88 13 15694 1 1 1 4 1292.39 124 422201 1 1 1 5 191.01 40 119373 1 1 1 6 477.66 57 170913 ... more lines ... 5 7 7 8 13.06 0 0 5 7 7 9 384.87 16 112252 ; run; Table 1: Example Data Set MotorIns Variable Type Description Kilometres Class Distance traveled Zone Class Geographical zone Make Class Make of automobile Bonus Continuous No-claims bonus Insured Continuous Number of insured drivers (years 100,000) LogInsured Continuous Natural logarithm of Insured Claims Continuous Number of insurance claims Payment Continuous Sum of insurance claims payments, in Swedish kronor Zeros Binary Indicator variable for Payment not equal to 0 Table 2 describes the levels of the classification variable Kilometres. Table 2: Values and Labels of Kilometres Value Label 1 Less than 1,000 km per year 2 1,000–15,000 km per year 3 15,000–20,000 km per year 4 20,000–25,000 km per year 5 More than 25,000 km per year Table 3 describes the levels of the classification variable Zone. The zones are given from a detailed investigation of 100 areas in 1972 and represent combinations of traffic intensity, state of roads, climatic differences, and so on (Andrews and Herzberg, 1985). Table 3: Values and Labels of Zone Value Label 1 Stockholm, Göteborg, Malmö with surroundings 2 Other large cities with surroundings 3 Smaller cities with surroundings in southern Sweden 4 Rural areas in southern Sweden 5 Smaller cities with surroundings in northern Sweden 6 Rural areas in northern Sweden 7 Gotland The models of cars are classified into 10 premium classes, but in a special investigation for 1977 eight common pure models were chosen and the rest were put in a combined class for reference (Andrews and Herzberg, 1985). The levels 1–8 of the classification variable Make represent the eight pure models, and level 9 is the combined class. The variable Bonus is a measure of individual claim history. The insured motorist starts in the class Bonus = 1. Every year that no claim is filed, the insured moves up one class (Andrews and Herzberg, 1985). Figure 1 shows a histogram of the response variable Payment and the proportion of zeros. The variable Payment exhibits a distribution that is fairly typical of semicontinuous variables: a significant density mass at zero and a continuous, right-skewed distribution elsewhere. Figure 1: Distribution of Payment Output 1: Frequency of Zeros The FREQ Procedure Zeros Frequency Percent Cumulative Frequency Cumulative Percent 0 385 17.64 385 17.64 1 1797 82.36 2182 100.00 The following SAS statements fit a Tweedie compound Poisson-gamma mixture model to the response variable Payment. The CLASS statement specifies that the variables Kilometres, Zone, and Make are categorical variables. The SPLIT option requests that the columns of the design matrix that correspond to any effect that contains a split classification variable be able to be selected to enter or leave a model independently of the other design columns of that effect. The PARAM= option specifies a reference cell encoding for the classification variables. The MODEL statement specifies that the response variable have a Tweedie distribution with a log link function. The candidates for the linear predictor include the main effects and the interactions between the classification variables Kilometres, Zone, and Make and the continuous variable Bonus. The OFFSET= option specifies that the variable LogInsured be included in the linear predictor with a coefficient of 1. The SELECTION statement requests that stepwise selection be used and that the final model be chosen based on the AICC criterion. The DETAILS=SUMMARY option requests that only a summary of the selection process be displayed rather than the details from each step of the selection process. The OUTPUT statement requests that the selected model’s prediction and residuals be saved to the SAS data set Tweedie. The ID statement requests that the variable Payment also be included in the output data set. proc hpgenselect data=motorins; class Kilometres Zone Make / split param=reference; model payment = Kilometres|Zone|Make|Bonus / dist=tweedie link=log offset=loginsured; selection method=stepwise(choose=aicc) details=summary; output out=tweedie P R; id payment; run; The “Performance Information” table in Output 2 shows that the procedure executed in single-machine mode (that is, on the server where SAS is installed). When high-performance procedures run in single-machine mode, they use concurrently scheduled threads. In this case, four threads were used. The “Model Information” table reports that a Tweedie model was fit with a log link function. The “Selection Information” table reports that stepwise selection was used, the selection and stopping criteria are the significance level of each individual effect, the entry significance level is the default value of 0.05, and the choose criterion is AICC. Output 2: Performance, Model, and Selection Information The HPGENSELECT Procedure Performance Information Execution Mode Single-Machine Number of Threads 4 Model Information Data Source WORK.MOTORINS Response Variable Payment Offset Variable LogInsured Class Parameterization Reference Distribution Tweedie Link Function Log Optimization Technique Quasi-Newton Selection Information Selection Method Stepwise Select Criterion Significance Level Stop Criterion Significance Level Choose Criterion AICC Effect Hierarchy Enforced None Entry Significance Level (SLE) 0.05 Stay Significance Level (SLS) 0.05 Stop Horizon 1 Output 3 shows that the sample size is 2,182, and the “Class Level Information” table shows the number of levels and the level values of the three classification variables. Output 3: Sample Size and Classification Variable Levels Number of Observations Read 2182 Number of Observations Used 2182 Class Level Information Class Levels Reference Value Values Kilometres 5 * 5 1 2 3 4 5 Zone 7 * 7 1 2 3 4 5 6 7 Make 9 * 9 1 2 3 4 5 6 7 8 9 * Associated Parameters Split The “Selection Summary” table in Output 4 reports the variables that are added at each step of the selection process. The summary shows that 30 effects plus an intercept were selected and that the selection process terminated because the sequence of effect additions and removals began cycling. Output 4: Effect Selection Summary The HPGENSELECT Procedure Selection Summary Step Effect Entered Effect Removed Number Effects In AICC p Value 0 Intercept 1 44299.3018 . 1 Bonus 2 43694.2249 <.0001 2 Zone_1 3 43576.8235 <.0001 3 Kilometres_1 4 43414.2670 <.0001 4 Make_4 5 43251.5268 <.0001 5 Make_6 6 43186.2263 <.0001 6 Bonus*Kilometres_2 7 43128.0129 <.0001 7 Zone_5*Make_8 8 43103.4200 <.0001 8 Bonus*Kilometres_2*Zone_1*Make_8 9 43087.3135 <.0001 9 Zone_2 10 43057.5628 <.0001 10 Bonus*Kilometres_3 11 43035.8693 <.0001 11 Make_5 12 43021.0275 <.0001 12 Kilometres_1*Make_1 13 43010.8831 0.0003 13 Make_8 14 43000.7509 0.0003 14 Make_2 15 42990.3324 0.0003 15 Bonus*Kilometres_4*Zone_6*Make_4 16 42987.4687 0.0007 16 Kilometres_1*Zone_5*Make_7 17 42981.7743 0.0008 17 Kilometres_1*Zone_6*Make_7 18 42976.6767 0.0019 18 Bonus*Zone_5 19 42969.6789 0.0020 19 Bonus*Zone_3 20 42961.8306 0.0015 20 Bonus*Kilometres_1*Make_1 21 42954.6503 0.0019 21 Kilometres_4 22 42945.4183 0.0008 22 Make_1 23 42937.8402 0.0017 23 Bonus*Make_5 24 42931.7907 0.0039 24 Bonus*Kilometres_1 25 42925.8663 0.0045 25 Zone_2*Make_6 26 42920.2139 0.0075 26 Bonus*Kilometres_1*Zone_6 27 42915.7147 0.0082 27 Bonus*Kilometres_3*Zone_5*Make_2 28 42913.6024 0.0183 28 Bonus*Kilometres_4*Zone_1*Make_5 29 42911.4418 0.0190 29 Zone_1*Make_2 30 42908.4035 0.0202 30 Bonus*Zone_1*Make_2 31 42904.3219* 0.0099 31 Kilometres_1*Zone_6*Make_7 30 42907.6359 0.0534 * Optimal Value of Criterion Stepwise selection stopped because the sequence of effect additions and removals is cycling. The model at step 30 is selected where AICC is 42904.32. The “Selected Effects” note in Output 5 lists the effects that are selected for the final model. The “Dimensions” table reports that 31 effects are included in the final model and 33 parameters are estimated. The “Fit Statistics” table reports that the value of AICC for the final model is 42,904. Output 5: Selected Effects, Dimensions, Convergence Status, and Fit Statistics Selected Effects: Intercept Kilometres_1 Kilometres_4 Zone_1 Zone_2 Make_1 Make_2 Make_4 Make_5 Make_6 Make_8 Kilometres_1*Make_1 Zone_1*Make_2 Zone_2*Make_6 Zone_5*Make_8 Kilometres_1*Zone_5*Make_7 Kilometres_1*Zone_6*Make_7 Bonus Bonus*Kilometres_1 Bonus*Kilometres_2 Bonus*Kilometres_3 Bonus*Zone_3 Bonus*Zone_5 Bonus*Kilometres_1*Zone_6 Bonus*Make_5 Bonus*Kilometres_1*Make_1 Bonus*Zone_1*Make_2 Bonus*Kilometres_2*Zone_1*Make_8 Bonus*Kilometres_3*Zone_5*Make_2 Bonus*Kilometres_4*Zone_1*Make_5 Bonus*Kilometres_4*Zone_6*Make_4 Dimensions Number of Effects 31 Number of Effects after Splits 31 Number of Parameters 33 Columns in X 31 Fit Statistics -2 Log Likelihood 42837 AIC (smaller is better) 42903 AICC (smaller is better) 42904 BIC (smaller is better) 43091 Pearson Chi-Square 1015375 Pearson Chi-Square/DF 472.04777 Convergence criterion (GCONV=1E-8) satisfied. Output 6 displays the estimates of the model parameters. The estimate of the dispersion parameter, φ, is 349.65 and the estimate of the power, p, is 1.36. The effect of using the SPLIT option in the CLASS statement is apparent. None of the classification variables have all their main effects or complete sets of interactions included in the model. The result is a more parsimonious model than you would achieve without enabling the design columns to enter and leave the model independently. Output 6: Parameter Estimates Parameter Estimates Parameter DF Estimate Standard Error Chi-Square Pr > ChiSq Intercept 1 6.422717 0.030076 45604.6548 <.0001 Kilometres_1 1 -0.396625 0.050252 62.2944 <.0001 Kilometres_4 1 -0.163970 0.038900 17.7679 <.0001 Zone_1 1 0.431027 0.028089 235.4780 <.0001 Zone_2 1 0.234752 0.028513 67.7853 <.0001 Make_1 1 0.102037 0.031974 10.1842 0.0014 Make_2 1 0.104421 0.048899 4.5601 0.0327 Make_4 1 -0.676695 0.051711 171.2483 <.0001 Make_5 1 0.467489 0.091598 26.0480 <.0001 Make_6 1 -0.200767 0.039391 25.9776 <.0001 Make_8 1 0.240254 0.056656 17.9826 <.0001 Kilometres_1*Make_1 1 0.443491 0.132307 11.2358 0.0008 Zone_1*Make_2 1 0.728101 0.202768 12.8939 0.0003 Zone_2*Make_6 1 -0.286627 0.101679 7.9464 0.0048 Zone_5*Make_8 1 0.520894 0.160295 10.5599 0.0012 Kilometres_1*Zone_5*Make_7 1 0.736407 0.250920 8.6132 0.0033 Kilometres_1*Zone_6*Make_7 1 0.453568 0.230985 3.8558 0.0496 Bonus 1 -0.138215 0.006689 426.9209 <.0001 Bonus*Kilometres_1 1 -0.035637 0.010883 10.7225 0.0011 Bonus*Kilometres_2 1 -0.069006 0.006272 121.0511 <.0001 Bonus*Kilometres_3 1 -0.045452 0.006450 49.6621 <.0001 Bonus*Zone_3 1 0.019307 0.005369 12.9336 0.0003 Bonus*Zone_5 1 0.027528 0.007469 13.5859 0.0002 Bonus*Kilometres_1*Zone_6 1 0.031829 0.012152 6.8603 0.0088 Bonus*Make_5 1 -0.055964 0.018012 9.6539 0.0019 Bonus*Kilometres_1*Make_1 1 -0.063791 0.025037 6.4914 0.0108 Bonus*Zone_1*Make_2 1 -0.107665 0.039459 7.4448 0.0064 Bonus*Kilometres_2*Zone_1*Make_8 1 0.166930 0.043453 14.7578 0.0001 Bonus*Kilometres_3*Zone_5*Make_2 1 0.101962 0.045120 5.1068 0.0238 Bonus*Kilometres_4*Zone_1*Make_5 1 0.098160 0.045496 4.6550 0.0310 Bonus*Kilometres_4*Zone_6*Make_4 1 0.233105 0.094307 6.1097 0.0134 Dispersion 1 349.647997 28.740411 . . Power 1 1.363043 0.008891 . . The following SAS statements generate a scatter plot that compares the model predictions with the observed values of the response variable. proc sort data=tweedie out=tweedie; by payment; run; proc sgplot data=tweedie; scatter x=payment y=pred / legendlabel="Predicted"; series x=payment y=payment / lineattrs=(pattern=solid color=red) legendlabel="45 degree line"; yaxis label="Predicted"; run; Figure 2 shows that the predictions of the final model compare favorably with the observed responses. Figure 2: Scatter Plot of Predicted versus Observed Payments References Andrews, D. F. and Herzberg, A. M. (1985), A Collection of Problems from Many Fields for the Student and Research Worker, New York: Springer-Verlag. Dunn, P. K. and Smyth, G. K. (2005), “Series Evaluation of Tweedie Exponential Dispersion Model Densities,” Statistics and Computing, 15, 267–280. Frees, E. W. (2010), Regression Modeling with Actuarial and Financial Applications, Cambridge: Cambridge University Press. Jørgensen, B. and Paes de Souza, M. C. (1994), “Fitting Tweedie’s Compound Poisson Model to Insurance Claims Data,” Scandinavian Actuarial Journal, 1, 69–93.

AlexBeaver · ‎12-18-2023

Overview The HPGENSELECT procedure, available in SAS/STAT 12.3 (which runs on Base SAS 9.4), performs model selection for generalized linear models (GLMs). It fits models for standard distributions in the exponential family, such as the normal, Poisson, and Tweedie distributions. In addition, PROC HPGENSELECT fits multinomial models for ordinal and nominal responses, and it fits zero-inflated Poisson and negative binomial models for count data. For all these models, the HPGENSELECT procedure provides forward, backward, and stepwise variable selection and includes Akaike’s information criterion (AIC), a small-sample bias-corrected version of Akaike’s information criterion (AICC), and the Schwarz Bayesian criterion (SBC) as selection criteria. PROC HPGENSELECT enables you to run in distributed mode on a cluster of machines that distribute the data and the computations or in single-machine mode on the server where SAS is installed. Analysis Many of the probability distributions that the HPGENSELECT procedure fits are members of an exponential family of distributions, which have probability distributions that are expressed as follows for some functions b and c that determine the specific distribution: For fixed φ, this is a one-parameter exponential family of distributions. The response variable can be discrete or continuous, so ƒ(y) represents either a probability mass function or a probability density function. A more useful parameterization of generalized linear models is by the mean and variance of the distribution: In generalized linear models, the mean of the response distribution is related to linear regression parameters through a link function, for the ith observation, where x i is a fixed known vector of explanatory variables and β is a vector of regression parameters. The HPGENSELECT procedure parameterizes models in terms of the regression parameters β and either the dispersion parameter φ or a parameter that is related to φ, depending on the model. For exponential family models, the distribution variance is Var(Y) = φ V(μ), where V(μ) is a variance function that depends only on μ. The zero-inflated models and the multinomial models are not exponential family models, but they are closely related models that are useful and are included in the HPGENSELECT procedure. Zero-Inflated Models Count data that have an incidence of zeros greater than expected for the underlying probability distribution of counts can be modeled by using a zero-inflated distribution. In PROC HPGENSELECT, the underlying distribution can be either Poisson or negative binomial. The population is considered to consist of two types of individuals. The first type gives Poisson or negative binomial distributed counts, which might contain zeros. The second type always gives a zero count. Suppose λ is the underlying distribution mean and ω is the probability of an individual being of the second type. The parameter ω, which is called the zero-inflation probability, is the probability of zero counts in excess of the frequency that the underlying distribution predicts. The probability distribution of a zero-inflated Poisson random variable Y is given by The probability distribution of a zero-inflated negative binomial random variable Y is given by where k is the negative binomial dispersion parameter. You can model the parameters ω and λ in PROC HPGENSELECT by using the regression models, where h is one of the binary link functions: logit, probit, or complementary log-log. You usually use the log link function for g when you are fitting a Poisson or a negative binomial model. The mean and variance of Y for the zero-inflated Poisson are given by The mean and variance of Y for the zero-inflated negative binomial are given by Multinomial Models Multinomial models apply to cases where an observation can fall into one of k categories. Binary data occur in the special case where k = 2. If there are m i observations in a subpopulation i, then the probability distribution of the number that falls into the k categories y i = (y i1 , y i2 ,...,y ik ) can be modeled by the multinomial distribution, where Σ j y ij = m i . The multinomial model is an ordinal model if the categories have a natural order. If (p i1 , p i 2 ,...,p ik ) are the category probabilities, the cumulative category probabilities are modeled by using the same link functions that are used for binomial data. Suppose that , r = 1, 2, ..., k – 1, are the cumulative category probabilities. The ordinal model is where μ 1 , μ 2 ,...,μ k–1 are intercept terms that depend only on the categories and x i is a vector of covariates that does not include an intercept term. The link function g can be specified as a logit, probit, log-log, or complementary log-log function. Model Selection The HPGENSELECT procedure supports three methods of effect selection: forward selection, backward elimination, and stepwise selection. In forward selection, the model-fitting process begins with only the intercept and then sequentially adds the effect that most improves the fit. The process terminates when adding an effect produces no significant improvement. The statistic that determines whether to add an effect is the significance level of a hypothesis test that indicates an effect’s potential contribution to the model. At each step, the effect that is most significant is added. The process stops when the significance level for adding any effect is greater than some specified entry significance level. Backward elimination starts from the full model, which includes all independent effects. Then effects are deleted one by one until a stopping condition is satisfied. At each step, the effect that makes the smallest contribution to the model is deleted. The significance level of an effect determines whether to drop that effect. At any step, the least significant predictor is dropped, and the process continues until all effects that remain in the model are significant at a specified stay significance level. Stepwise selection is a modification of forward selection in which effects already in the model do not necessarily stay there. In the HPGENSELECT procedure’s implementation of stepwise selection, the same entry and removal significance levels for forward selection and backward elimination are used to assess contributions of effects as they are added to or removed from a model. If, at a step of the selection process, any effects in the model are not significant, then the least significant of these effects is removed from the model and the algorithm proceeds to the next step. This ensures that no effect can be added to a model while an effect currently in the model is not deemed significant. Only after all necessary deletions have been made can another effect be added to the model. In this case the effect whose addition is the most significant is added to the model, and the algorithm proceeds to the next step. The stepwise process ends when none of the effects outside the model are significant and every effect in the model is significant. In some cases, neither of these two stopping conditions is met and the sequence of models cycles. In these cases, the stepwise method terminates at the end of the sequence. Example: Modeling Automobile Insurance Claims Frequency models are commonly used in the insurance industry to predict how often claims are made. This example uses a sample of real automobile insurance policy data to model the number of claims. The following DATA step reads the data, and Table 1 describes the variables in the data set Claim_History: data claim_history; input ID $ 1-10 Kids_Drive 11 Birth $ 15-25 Age 27-30 Home_Kids 31 YOJ 34-37 Income 38-48 Parent1 $ 49-52 Home_Value 53-63 MStatus $ 64-67 Gender $ 68 #2 Education $ 1-14 Occupation $ 15-27 Travel_Time 28-34 Car_Use $ 35-45 Bluebook 46-55 TIF 56 Car_Type $ 60-72 #3 Red_Car $ 1-3 OldClaim 5-13 Claims 14-18 Revoked $ 19-23 Mvr_Pts 24-26 Clm_Amt 27-35 Car_Age 36-39 Claim_Flag 40 Urbanicity $ 43-48; datalines; 77382913 0 11/22/1964 34 0 10 62977.82 No 0.00 No F ... more lines ... 121441578 0 7/1/1964 35 0 11 43111.84 No 0.00 No M High School Blue Collar 51.00 Commercial 27330.00 10 Panel Truck Yes 0.00 0 No 0 0.00 8 0 Rural ; Table 1: Claim_History Data Set Variable Name Description ID Policy identification number Kids_Drive Number of driving children Birth Date of birth of insured Age Age of insured Home_Kids Number of children at home YOJ Years on job Income Income of insured Parent1 Single parent Home_Value Value of home MStatus Marital status Gender Gender of insured Education Maximum education level of insured Occupation Occupation of insured Travel_Time Distance to work Car_Use Vehicle use Bluebook Value of vehicle TIF Time in force Car_Type Type of vehicle Red_Car A red car OldClaim Total dollar value of claims in past five years Claims Number of claims in past five years Revoked License revoked in past seven years Mvr_Pts Motor vehicle record points Clm_Amt Claim amount Car_Age Age of vehicle Claim_Flag Claim indicator Urbanicity Home/Work area You can use PROC FREQ as follows to generate a histogram of the response variable Claims for a visual inspection of its marginal distribution: ods graphics on; proc freq data=claim_history; table claims / plots(only)=freqplot(scale=percent); run; Figure 1 shows that the marginal distribution of Claims resembles a Poisson distribution that has excess zeros, suggesting that a zero-inflated Poisson (ZIP) model might be appropriate. Figure 1: Distribution of Claims The following SAS statements fit a zero-inflated Poisson (ZIP) model and use forward selection to find the best subset of effects for both the conditional mean of the Poisson distribution and the zero-inflation probability. The CLASS statement specifies that the variables Education, Gender, Car_Type, Car_Use, MStatus, Occupation, Parent1, Red_Car, Revoked, and Urbanicity are categorical variables. The MODEL statement specifies that the response variable has a zero-inflated Poisson distribution with a log link function. The ZEROMODEL statement requests that a probit link function be used for the zero-inflation probability equation. The SELECTION statement requests that forward selection be used and that the final model be chosen based on the AICC criterion. The DETAILS=SUMMARY option requests that only a summary of the selection process be displayed rather than the details from each step of the selection process. The ID statement requests that the variable Claims be included in the output data set. The OUTPUT statement requests that the selected model’s prediction and the estimate of the zero-inflation probability for each observation be saved to the SAS data set Zip. The ODS OUTPUT statements requests that the selected model’s “Fit Statistics” table be saved to the SAS data set Fit, the “Number of Observations” table be saved to the SAS data set Nobs, and the “Dimensions” table be saved to the SAS data set Dimensions. proc hpgenselect data=claim_history; class education gender car_type car_use mstatus occupation parent1 red_car revoked urbanicity; model claims = education gender car_type car_use mstatus occupation parent1 red_car revoked urbanicity bluebook age car_age home_kids home_value income kids_drive mvr_pts tif travel_time yoj / distribution=zip link=log; zeromodel education gender car_type car_use mstatus occupation parent1 red_car revoked urbanicity bluebook age car_age home_kids home_value income kids_drive mvr_pts tif travel_time yoj / link=probit; selection method=forward(choose=aicc) details=summary; id claims; output out=zip pred=pred pzero=pzero; ods output fitstatistics=fit nobs=nobs dimensions=dimensions; run; The “Performance Information” table in Output 1 shows that the procedure executed in single-machine mode (that is, on the server where SAS is installed). When high-performance procedures run in single-machine mode, they use concurrently scheduled threads. In this case, four threads were used. The “Model Information” table reports that a zero-inflated Poisson model was fit with a log link function for the mean equation and a probit link function for the zero-inflation probability equation. Output 1: Performance and Model Information The HPGENSELECT Procedure Performance Information Execution Mode Single-Machine Number of Threads 4 Model Information Data Source WORK.CLAIM_HISTORY Response Variable Claims Class Parameterization GLM Distribution Zero-Inflated Poisson Link Function Log Zero Model Link Function Probit Optimization Technique Newton-Raphson with Ridging Number of Observations Read 2500 Number of Observations Used 2500 Output 2 lists the 10 class variables that are specified in the CLASS statement, along with the number of levels and the values for each variable. Output 2: Class Level Information Class Level Information Class Levels Values Education 5 < High School Bachelors High School Masters PhD Gender 2 F M Car_Type 6 Minivan Panel Truck Pickup SUV Sports Car Van Car_Use 2 Commercial Private MStatus 2 No Yes Occupation 8 Blue Collar Clerical Doctor Home Maker Lawyer Manager Professional Student Parent1 2 No Yes Red_Car 2 No Yes Revoked 2 No Yes Urbanicity 2 Rural Urban Output 3 displays a summary of the selection process. The “Selection Information” table reports that forward selection was used, the selection and stopping criteria are the significance level of each individual effect, and the entry significance level is the default value of 0.05. The “Selection Summary” table reports the variables that are added at each step of the selection process. Because the ZIP model is a mixture model and has two equations, the variables that are selected for the zero-inflation probability equation are distinguished from the variables that are selected for the mean equation by having “_Zero” appended to their name. The summary shows that four variables plus an intercept were selected for the mean equation and that eight variables plus an intercept were selected for the zero-inflation probability equation. The “Dimensions” table reports that 14 effects are included in the final model and 18 parameters are estimated. Output 3: Summary of the Selection Process Selection Information Selection Method Forward Select Criterion Significance Level Stop Criterion Significance Level Choose Criterion AICC Effect Hierarchy Enforced None Entry Significance Level (SLE) 0.05 Stop Horizon 1 The HPGENSELECT Procedure Selection Summary Step Effect Entered Number Effects In AICC p Value 0 Intercept 1 Intercept_Zero 2 5686.6821 . 1 Mvr_Pts_Zero 3 5140.5067 <.0001 2 Urbanicity_Zero 4 4979.1354 <.0001 3 Income_Zero 5 4954.3377 <.0001 4 Car_Use_Zero 6 4937.4125 <.0001 5 Revoked_Zero 7 4926.9092 0.0005 6 Travel_Time_Zero 8 4920.1708 0.0031 7 Car_Type_Zero 9 4914.0826 0.0064 8 Home_Value_Zero 10 4908.7648 0.0070 9 Red_Car 11 4904.6657 0.0126 10 MStatus 12 4901.7134 0.0254 11 Travel_Time 13 4899.3634 0.0358 12 Mvr_Pts 14 4897.1158* 0.0383 * Optimal Value of Criterion Selection stopped because no candidate for entry is significant at the 0.05 level. The model at step 12 is selected where AICC is 4897.116. Selected Effects: Intercept MStatus Red_Car Mvr_Pts Travel_Time Intercept_Zero Car_Type_Zero Car_Use_Zero Revoked_Zero Urbanicity_Zero Home_Value_Zero Income_Zero Mvr_Pts_Zero Travel_Time_Zero Dimensions Number of Effects 14 Number of Parameters 18 Columns in X 24 Output 4 displays the estimates of the model parameters. The “Parameter Estimates” table displays the estimates of the mean equation’s parameters, and the “Zero-Inflation Parameter Estimates” table displays the estimates of the zero-inflation probability equation’s parameters. The selected variables for the mean equation are MStatus, Red_Car, Mvr_Pts, and Travel_Time. The selected variables for the zero-inflation probability equation are Car_Type, Car_Use, Revoked, Urbanicity, Home_Value, Income, Mvr_Pts, and Travel_Time. Output 4: Parameter Estimates Parameter Estimates Parameter DF Estimate Standard Error Chi-Square Pr > ChiSq Intercept 1 0.393835 0.087850 20.0977 <.0001 MStatus No 1 0.111187 0.052546 4.4775 0.0343 MStatus Yes 0 0 . . . Red_Car No 1 -0.142290 0.056806 6.2743 0.0123 Red_Car Yes 0 0 . . . Mvr_Pts 1 0.021451 0.010358 4.2888 0.0384 Travel_Time 1 0.003631 0.001707 4.5255 0.0334 Zero-Inflation Parameter Estimates Parameter DF Estimate Standard Error Chi-Square Pr > ChiSq Intercept_Zero 1 0.281308 0.199009 1.9981 0.1575 Car_Type_Zero Minivan 1 0.095028 0.148198 0.4112 0.5214 Car_Type_Zero Panel Truck 1 -0.034481 0.200538 0.0296 0.8635 Car_Type_Zero Pickup 1 -0.072832 0.157468 0.2139 0.6437 Car_Type_Zero SUV 1 -0.297481 0.151435 3.8589 0.0495 Car_Type_Zero Sports Car 1 -0.297123 0.179088 2.7526 0.0971 Car_Type_Zero Van 0 0 . . . Car_Use_Zero Commercial 1 -0.367408 0.091243 16.2144 <.0001 Car_Use_Zero Private 0 0 . . . Revoked_Zero No 1 0.380971 0.111722 11.6281 0.0006 Revoked_Zero Yes 0 0 . . . Urbanicity_Zero Rural 1 1.390559 0.108296 164.8738 <.0001 Urbanicity_Zero Urban 0 0 . . . Home_Value_Zero 1 0.000000814 0.000000362 5.0499 0.0246 Income_Zero 1 0.000002556 0.000001024 6.2366 0.0125 Mvr_Pts_Zero 1 -0.363820 0.023392 241.9059 <.0001 Travel_Time_Zero 1 -0.005264 0.002499 4.4386 0.0351 Convergence criterion (GCONV=1E-8) satisfied. Output 5 displays the fit statistics for the final model. Output 5: Fit Statistics Fit Statistics -2 Log Likelihood 4860.84011 AIC (smaller is better) 4896.84011 AICC (smaller is better) 4897.11580 BIC (smaller is better) 5001.67293 Pearson Chi-Square 2398.50471 Pearson Chi-Square/DF 0.96636 Most of the criteria are useful only for comparing the model fit among given alternative models. However, the Pearson statistic can be used to determine whether there is evidence of overdispersion or underdispersion. If the model is correctly specified and there is no overdispersion or underdispersion, the Pearson chi-square statistic divided by the degrees of freedom has an expected value of 1. The obvious question is whether the observed value of 0.96636 is significantly less than 1 and thus indicates underdispersion. The Pearson statistic for a zero-inflated Poisson model has a limiting chi-square distribution under certain regularity conditions, with degrees of freedom equal to the number of observations minus the number of estimated parameters. A formal one-sided test for underdispersion is performed by computing , the probability of observing a smaller value of the statistic. The following SAS statements compute the p-value for the test: data _null_; set nobs(where=(label="Number of Observations Used")); call symput('n',NobsUsed); run; data _null_; set dimensions(where=(description="Number of Parameters")); call symput('parms',value); run; data fit; set fit(where=(label="Pearson Chi-Square")); format pvalue pvalue6.4; df=%eval(&n) - %eval(&parms); pvalue=probchi(value,df); label pvalue="Pr < ChiSq"; run; proc print data=fit noobs label; var label value df pvalue; run; Output 6 displays the test result. The p-value for the one-sided test for underdispersion is 0.12, so you fail to reject the null hypothesis of no underdispersion at the most commonly used confidence levels. Output 6: Test for Overdispersion Description Value df Pr < ChiSq Pearson Chi-Square 2398.50471 2482 0.1172 A common method of assessing the goodness of fit of a model is to compare the observed relative frequencies of the various counts to the maximum likelihood estimates of their respective probabilities. The following SAS statements demonstrate one method of computing the estimated probabilities and generating two comparative plots. The first step is to observe the value of the largest count and save it as a macro variable: proc means data=zip(where=(~missing(pred))) noprint; var claims; output out=maxcount max=max; run; data _null_; set maxcount; call symput('max',max); run; %let max=%sysfunc(strip(&max)); Next, you use the model predictions and the estimated zero-inflation probabilities that are stored in the output data set Zip to compute the conditional probabilities . These are the variables ep0–ep&max in the following DATA step. You also generate an indicator variable for each count i, i = 0, 1,..., & max, where each observation is assigned a value of 1 if count i is observed, and 0 otherwise. These are the variables c0–c&max. data zip(drop= i); set zip(where=(~missing(pred))); lambda=pred/(1-pzero); array ep{0:&max} ep0-ep&max; array c{0:&max} c0-c&max; do i = 0 to &max; if i=0 then ep{i}= pzero + (1-pzero)*pdf('POISSON',i,lambda); else ep{i}= (1-pzero)*pdf('POISSON',i,lambda); c{i}=ifn(claims=i,1,0); end; run; Now you can use PROC MEANS to compute the means of the variables ep0,..., ep&max and c0,..., c&max. The means of ep0,..., ep&max are the maximum likelihood estimates of Pr( y = i ). The means of c0,..., c&max are the observed relative frequencies. proc means data=zip noprint; var ep0 - ep&max c0-c&max; output out=ep(drop=_TYPE_ _FREQ_) mean(ep0-ep&max)=ep0-ep&max; output out=p(drop=_TYPE_ _FREQ_) mean(c0-c&max)=p0-p&max; run; The output data sets from PROC MEANS are in what is commonly referred to as wide form. That is, there is one observation for each variable. In order to generate comparative plots, the data need to be in what is referred to as long form. Ultimately, you need four variables: one whose observations are an index of the values of the counts, a second whose observations are the observed relative frequencies, a third whose observations contain the ZIP model estimates of the probabilities Pr( y = i ), and a fourth whose observations contain the difference between the observed relative frequencies and the estimated probabilities. The following SAS statements transpose the two output data sets so that they are in long form. Then, the two data sets are merged, and the variables that index the count values and record the difference between the observed relative frequencies and the estimated probabilities are generated. proc transpose data=ep out=ep(rename=(col1=zip) drop=_NAME_); run; proc transpose data=p out=p(rename=(col1=p) drop=_NAME_); run; data zipprob; merge ep p; zipdiff=p-zip; claims=_N_ -1; label zip='ZIP Probabilities' p='Relative Frequencies' zipdiff='Observed minus Predicted'; run; Now you can use the SGPLOT procedure to produce the comparative plots: proc sgplot data=zipprob; scatter x=claims y=p / markerattrs=(symbol=CircleFilled size=5px color=blue); scatter x=claims y=zip / markerattrs=(symbol=TriangleFilled size=5px color=red); xaxis type=discrete; run; Figure 2 shows that the ZIP model that has the selected effects captures the shape of the distribution of the relative frequencies and accounts for the excess zeros quite well. Figure 2: Comparison of ZIP Probabilities to Observed Relative Frequencies ZIP Probabilities versus Relative Frequencies Observed Relative Frequencies Minus ZIP Probabilities

Online Status	Offline
Date Last Visited	2 hours ago

Information on select CData JDBC drivers in SAS Viya 4

Information on select CData JDBC drivers in SAS Viya 4

Re: FULLSTIMER SAS Option

FULLSTIMER SAS Option

When Should You Use the Scalable Performance Data Engine (SPDE)?

High Availability with SAS Grid Manager

Poisson Regressions for Complex Surveys

Poststratification with PROC SURVEYMEANS

Fractional Hot-Deck Imputation for Mixed Variables

Estimating the Standard Deviation of a Variable in a Finite Population

Modernizing the SAS customer support experience

50 years of SAS

Information on select CData JDBC drivers in SAS Viya 4

Re: The new customer experience is here

The new customer experience is here

Past Conference Proceedings: 1976-onward

Bayesian Autoregressive and Time-Varying Coefficients Time Series Mode...

FULLSTIMER SAS Option

Information on select CData JDBC drivers in SAS Viya 4

Information on select CData JDBC drivers in SAS Viya 4

Re: FULLSTIMER SAS Option

FULLSTIMER SAS Option

When Should You Use the Scalable Performance Data Engine (SPDE)?

High Availability with SAS Grid Manager

Poisson Regressions for Complex Surveys

Poststratification with PROC SURVEYMEANS

Fractional Hot-Deck Imputation for Mixed Variables

Estimating the Standard Deviation of a Variable in a Finite Population

Estimating the Variance of a Variable in a Finite Population

Handling Spatial Data in Spherical Coordinates

Assessing the Accuracy of Cluster Allocations Obtained from Finite Mix...

Fitting Tweedie’s Compound Poisson-Gamma Mixture Model by Using PROC H...

High-Performance Variable Selection for Generalized Linear Models: PRO...

SAS Analytics Explorers

SAS Explore