I often have to analyze all the data from all years at once, and I am wondering if it is better practice to create dummy variables for all variables across all years or to just maintain the variables used for that particular year.
Not sure what you mean by "dummy variables", but I would have all the variables in the data set, and missing values when the question isn't asked. The phrase "dummy variables", as I am understanding its meaning, is usually used to indicate a binary (or 0/1) variable, and that doesn't make any sense to do here when the value is actually missing.
As an example, the survey has a language section. In year 2018, there was a variable for lang_chinese (0 or 1) whereas in 2024 there are options for lang_mandarin (0 or 1) and lang_cantonese (0 or 1) and lang_chinese was removed. So is it better to create dummy variables for all 3 across years or not?
There is no way to answer this, as we don't know what analysis you intend to do. If we did know what analysis you are going to do (hint hint hint), we could provide some possible suggestions.
I also comment on your usage of terminology ... you say "there was a variable for lang_chinese" but the variable is really "Speaks Chinese". Please do not use lang_chinese (the name of the variable) to indicate the meaning of the variable, which is "Speaks Chinese". You want to create a label for this variable, so in your reports and plots the words "Speaks Chinese" appear rather than the variable name lang_chinese. The variable name and the variable meaning should not be used interchangeably.
... View more