I'm currently working with SAS University online . I have 6 datasets from NHANES in my SAS University folder: /folders/myfolders/SASDatasets/ the datasets are named:
cdq_j.sas7bdat
dbq_j.sas7bdat
demo_j.sas7bdat
dxx_j.sas7bdat
paq_j.sas7bdat
whq_j.sas7bdat
all of the datasets have the common variable SEQN (the respondents identifying #) and I would like to merge them all by this variable.
Please assist
You should be able to just use MERGE with the just the variable SEQN on the BY statement. None of those tables look like they will have repeated measures per SEQN. https://wwwn.cdc.gov/nchs/nhanes/tutorials/Module1.aspx
Use a LIBNAME statement to point to the /folders/myfolders/SASDatasets/ directory. Then a data step to combine them.
libname nhanes '/folders/myfolders/SASDatasets/';
data want;
merge
nhanes.cdq_j
nhanes.dbq_j
nhanes.demo_j
nhanes.dxx_j
nhanes.paq_j
nhanes.whq_j
;
by seqn;
run;
If SAS complains the data is not sort by SEQN then add PROC SORT steps for each dataset. Example:
proc sort data=nhanes.whq_j ;
by seqn;
run;
libname sasdata '/folders/myfolders/SASDatasets';
data want;
merge sasdata.cdq_i
sasdata.dnq_i
.....
;
by SEQN;
run;
You should be aware in cases:
- there are the same SEQN in more then one input dataset,
would you like to keep all of the observations of that SEQN
or do you prefer the that from a specific dataset or any of the dataset
due to some priority order?
Alternative code can be done by concatenating all input dataset and
sort the output by SEQN. That will be the same as merging and keeping all
the observations even if there are duplicates.
Before doing so you really want to make sure that there are no other variables with the same name.
Reasons: If they have the same name and different values the result may not have the value you expect because you will only have one variable of that name in the result. Also if the variables of the same name happen to be different types you will get an error.
Variables of particular concern would be WEIGHT variables. If you are doing a one-to-many, which I believe NHANES does with family members vs household then the matched records would mean that the Household values are duplicated and the weights are incorrect as well as the results of any analysis of household variables.
Did you download any of the code for analysis from the CDC NHANES website. https://wwwn.cdc.gov/nchs/nhanes/tutorials/default.aspx
@stephcolter wrote:
Im well aware... ive done this before i'm just rusty on my sas... SEQN is a unique identifer for each individual person regardless of them being apart of a household... these numbers are not duplicated thus why we used SEQN to merge datasets
The issue is not the variable SEQN. The issues are 1) all of the other variables and 2) the numbers of observations
Here is some code that relatively easily shows you whether a variable is duplicated between data sets and the variable type.
Proc sql; create table work.vars as select memname, name, type from dictionary.columns where libname='WORK' and memname in ("CDQ_J", "DBQ_J", "DEMO_J", "DXX_J", "PAQ_J", "WHQ_J") ; run; proc format; value putx 1-High ='X' other = ' ' ; run; proc tabulate data=work.vars; class memname name type; table name*type, memname*n=' '*f=putx. ; run;
You didn't mention what your library where the data sets are stored would be. Put that in the LIBNAME= in the proc sql. This should be in uppercase as the dictionary tables, which stores the characteristics of all the variables available in your current SAS session, stores library and data set (Memname) as upper case. The format places an X in the body of a report table with the data sets as column headers. Where you see an X then the variable exists in the data set.
The type, N=numeric and C=character may help prevent errors when combining this data in any way.
If you have Labels assigned to the variables you may want to incorporate those as well to show common names with different labels (often indicating different meaning).
Large economy size hint: If you use data from CDC data sets from multiple years you really really want to do such a comparison. CDC routinely in some of the projects reuses variable names with very drastic differences from year to year.
I developed the above report matrix to deal with CDC BRFSS data where I was going to use 20 years of data. Some of the variables with the exact same name appeared in 15 or more sets with as many as 8 different meanings (think question).
Strongly agree with @ballardw
Get to know your data (Maxim 3) before you proceed:
proc sort
data=dataset_to_be_checked (keep=SEQN)
out=test
nodupkey
;
by SEQN;
run;
You should be able to just use MERGE with the just the variable SEQN on the BY statement. None of those tables look like they will have repeated measures per SEQN. https://wwwn.cdc.gov/nchs/nhanes/tutorials/Module1.aspx
Use a LIBNAME statement to point to the /folders/myfolders/SASDatasets/ directory. Then a data step to combine them.
libname nhanes '/folders/myfolders/SASDatasets/';
data want;
merge
nhanes.cdq_j
nhanes.dbq_j
nhanes.demo_j
nhanes.dxx_j
nhanes.paq_j
nhanes.whq_j
;
by seqn;
run;
If SAS complains the data is not sort by SEQN then add PROC SORT steps for each dataset. Example:
proc sort data=nhanes.whq_j ;
by seqn;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.