CHIS dataset. How to clean?

elissak1 · Posted 06-27-2022 07:06 PM

I have a dataset called California Health Interview Survey (CHIS), a state survey. I have some trouble starting the data management and cleaning process. The data provides formatted data and regular SAS data files. I wondered if there is an efficient way of filtering your data to the variable of interest (I have to refer back to the codebook many times) and pooling previous survey years? Since this is survey data, I also have to keep weights and ranks in mind. It has been a while since I worked with big data (my last attempt was BRFSS).

Thank you!

ballardw · Posted 06-27-2022 07:57 PM

I would start by running frequencies on any sort of categorical variable and check 1) are any responses out of range (here is where you compare things to your codebook) 2) are expected values missing and 3) are any supposedly unique values (respondent id, possibly phone number or such) are duplicated. Proc freq does this stuff pretty easily.

This may also provide a clue for whether skip patterns are followed. If Qxx is supposed to be asked only when question Qaa was answered 1 or 2 (to grab a simple example) then the number of 1 and 2 responses to Qaa should match the number of responses for Qxx. More complicated skip patterns are up to you to provide logic for.

If you have lots of questions with the same responses, such as BRFSS has many Yes, No, Don't Know or Refused you might create custom format for those questions that returns Valid for the expected answers and "Invalid" for anything else (up to including missing for questions that should not have missing responses). Then Proc freq output with that sort of format applied to those variables can significantly reduce first pass output. Again I would expect the organization to have done this but if that is your job ...

For any sort of continuous variables (income for example) run summary statistics like mean, max, min to see if you have some obvious outliers. Proc means, summary or univariate will do these.

I typically would not "filter" to variable of interest at the first pass. A weights are not needed for cleaning, at least none of my data ever did.

You can always use data set options to select variables (keep or drop) at any given step. You can select values of variables with either data set option where or where statement in most procedures. Or for many procedures like Proc Freq and means you can provide the variables you are interested in on the Tables statement (Freq) of Var statement(Means/Summary and Univariate)

I would hope that the above had already been done but having BRFSS data one year with a person 11 inches tall weighing 150 lbs after CDC weighted it, I understand the principal of "verify" the data.

Combining data and re-weighting isn't difficult but it is also not trivial. First you need to pick a population to re-weight to. There are different approaches but the easiest to get acceptance for is often to weight to the "middle" of the time period population. It is basically a calculate proportion of each population characteristic combination used, determine the weighted total for each year and derive the proportion of the target year population each survey period includes. This should most likely only be undertaken to combine small survey samples to combine years so that you can now say something about the responses of those populations.

The fun part is finding the questions that changed from year to year and working out what sort of recoding may be needed to get similar responses from that. (I looked at 20 years of BRFSS data related to Body Mass Index. In that time frame the BRSS survey used something like 8 different coding schemes for number of digits and where implied decimals went, so combining that was a headache). For trend type data make sure that different years have a Year variable and you can use that as strata variable along with whatever is normally used. If you use the SAS survey procs you will want two variables with the year information as the variables on the STRATA statements cannot also be on any analysis or domain type statement.

There are no quick and dirty tricks though. It is routine and tedious.

CHIS dataset. How to clean?

Re: CHIS dataset. How to clean?

Catch up on SAS Innovate 2026