Re: General best practices - maintaining variables across different da...

ciaohermosa · Posted 10-28-2025 09:42 AM

This is more of a question re: general data best practices than SAS-specific.

I am managing a large survey dataset with data across multiple years. In more recent years of the survey, certain variables have been added and others have been removed. I often have to analyze all the data from all years at once, and I am wondering if it is better practice to create dummy variables for all variables across all years or to just maintain the variables used for that particular year. I have a separate excel file for a data dictionary that I maintain, so I can add notes for any differences there.

As an example, the survey has a language section. In year 2018, there was a variable for lang_chinese (0 or 1) whereas in 2024 there are options for lang_mandarin (0 or 1) and lang_cantonese (0 or 1) and lang_chinese was removed. So is it better to create dummy variables for all 3 across years or not?

I know there may not be a "right" answer for this, just looking for others' opinions. Also, yes I realize this speaks to an issue in overall survey development across years, but that is another topic!

Thanks!

PaigeMiller · Posted 10-28-2025 10:03 AM

I often have to analyze all the data from all years at once, and I am wondering if it is better practice to create dummy variables for all variables across all years or to just maintain the variables used for that particular year.

Not sure what you mean by "dummy variables", but I would have all the variables in the data set, and missing values when the question isn't asked. The phrase "dummy variables", as I am understanding its meaning, is usually used to indicate a binary (or 0/1) variable, and that doesn't make any sense to do here when the value is actually missing.

As an example, the survey has a language section. In year 2018, there was a variable for lang_chinese (0 or 1) whereas in 2024 there are options for lang_mandarin (0 or 1) and lang_cantonese (0 or 1) and lang_chinese was removed. So is it better to create dummy variables for all 3 across years or not?

There is no way to answer this, as we don't know what analysis you intend to do. If we did know what analysis you are going to do (hint hint hint), we could provide some possible suggestions.

I also comment on your usage of terminology ... you say "there was a variable for lang_chinese" but the variable is really "Speaks Chinese". Please do not use lang_chinese (the name of the variable) to indicate the meaning of the variable, which is "Speaks Chinese". You want to create a label for this variable, so in your reports and plots the words "Speaks Chinese" appear rather than the variable name lang_chinese. The variable name and the variable meaning should not be used interchangeably.

--
Paige Miller

quickbluefish · Posted 10-28-2025 10:50 AM

This may be overkill for your situation, but when I used to work with US Medicare data, which has separate files for each year within a given domain (e.g., inpatient, outpatient, etc.), I built a macro that generated SAS views on the fly given the start and end year of the analysis in question. What it generally did was align the variables according to the least granular data available for that time range. So, in your example, if for example lang_cantonese and lang_mandarin replaced lang_chinese in 2021, and your current analysis uses data from 2021 to 2024, then the view would keep the more granular variables only. However, if that same analysis went back to 2020, then the view would translate the granular variables into courser ones:

** NOTE: this is a *very* rough, untested sketch and the assumption is 
that this data step is inside a macro ;

%macro mkview(startyear=, endyear=);

proc format;
value fvers
    2016-2020="J"
    2021-2024="K"
    ;
run;

data _null_;
call symputx("version", put(&startyear, fvers.));
run;

data vw.want view=vw.want;
set 
    %do yyyy=&startyear %to &endyear;
        have&yyyy (in=year&yyyy)
    %end;
    ;
array yr {&startyear:&endyear} year&startyear-year&endyear;
length year 3;
do yyyy=lbound(yr) to hbound(yr);
    if yr[yyyy] then year=yyyy;
end;
%if &version=J %then %do;
    * do version J stuff, e.g., ;
    lang_chinese=max(lang_chinese, lang_mandarin, lang_cantonese);
    * other version J stuff... ;
%end;
%else %if &version=K %then %do;
    * version K stuff ;
%end;
run;

%mend;

Again, that's just a sketch - obviously some situations will be more complicated, but that's the general premise. These views would then be saved in a permanent directory (if needed) that's specific to a given analysis and used as if they were regular datasets.

Yough1967 · Posted 10-28-2025 10:59 AM

If it were me, I’d probably lean toward keeping the variables consistent across all years, even if it means creating some dummy variables for the older datasets. It makes analysis way smoother later on, especially if you’re comparing trends over time. You can always document the differences in your data dictionary, but having a unified set of variables saves headaches when running models or aggregating results.

ballardw · Posted 10-28-2025 02:36 PM

Make sure every data set has a variable, same name in all, with the source data year.

Make sure that all of the same named variables existing in the data sets have the same properties: type, length, label and range of values. I always found changes in value ranges much more of a problem than intermittent missing values such as you describe from versions of the data set.

Append the data sets with a data step then all of the variables will be in the resulting data set with missing values for the years the variable isn't in the source data. The data step appending the data can also be used to create new variables to hold the analysis values such as when the ranges of values change.

I would be very tempted to assign special missing values to variables that were not in a given year so the data set documents the reason they are missing instead of having to go to an external dictionary.

I suspect any dummy variable approach, assuming you mean to add an indicator variable for each of the actual data points, is likely to add complexity to any analysis, increase the data set size significantly and likely not be very helpful in the long run.

Depending on the analysis you may want to use options that treat missing values as an analysis level.

General best practices - maintaining variables across different datasets

Re: General best practices - maintaining variables across different datasets

Re: General best practices - maintaining variables across different datasets

Re: General best practices - maintaining variables across different datasets

Re: General best practices - maintaining variables across different datasets

Registration is open