BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
stephcolter
Fluorite | Level 6

I'm currently working with SAS University online . I have 6 datasets from NHANES in my SAS University folder:  /folders/myfolders/SASDatasets/  the datasets are named:

cdq_j.sas7bdat

dbq_j.sas7bdat

demo_j.sas7bdat

dxx_j.sas7bdat

paq_j.sas7bdat

whq_j.sas7bdat

all of the datasets have the common variable SEQN (the respondents identifying #) and I would like to merge them all by this variable. 

 

Please assist 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

You should be able to just use MERGE with the just the variable SEQN on the BY statement. None of those tables look like they will have repeated measures per SEQN.  https://wwwn.cdc.gov/nchs/nhanes/tutorials/Module1.aspx

 

Use a LIBNAME statement to point to the /folders/myfolders/SASDatasets/ directory. Then a data step to combine them.  

 

libname nhanes '/folders/myfolders/SASDatasets/';
data want;
  merge
    nhanes.cdq_j 
    nhanes.dbq_j
    nhanes.demo_j
    nhanes.dxx_j
    nhanes.paq_j
    nhanes.whq_j
  ;
  by seqn;
run;

 

If SAS complains the data is not sort by SEQN then add PROC SORT steps for each dataset. Example:

proc sort data=nhanes.whq_j  ;
  by seqn;
run;

View solution in original post

7 REPLIES 7
Shmuel
Garnet | Level 18
libname sasdata '/folders/myfolders/SASDatasets';

data want;
 merge sasdata.cdq_i
           sasdata.dnq_i
          .....
  ;
  by SEQN;
run;

You should be aware in cases:

- there are the same SEQN in more then one input dataset,

   would you like to keep all of the observations of that SEQN

   or do you prefer the that from a specific dataset or any of the dataset

   due to some priority order?

 

Alternative code can be done by concatenating all input dataset and

sort the output by SEQN. That will be the same as merging and keeping all

the observations even if there are duplicates.

stephcolter
Fluorite | Level 6
I want to keep all the information from each datatset and just organize it by the seqn number
the respondent answers all of these surveys seperately but they are identified by teir SEQN number so now I'm trying to merge all of those survey responses together for each SEQN number
ballardw
Super User

Before doing so you really want to make sure that there are no other variables with the same name.

 

Reasons: If they have the same name and different values the result may not have the value you expect because you will only have one variable of that name in the result. Also if the variables of the same name happen to be different types you will get an error.

 

Variables of particular concern would be WEIGHT variables. If you are doing a one-to-many, which I believe NHANES does with family members vs household then the matched records would mean that the Household values are duplicated and the weights are incorrect as well as the results of any analysis of household variables.

 

Did you download any of the code for analysis from the CDC NHANES website. https://wwwn.cdc.gov/nchs/nhanes/tutorials/default.aspx

stephcolter
Fluorite | Level 6
Im well aware... ive done this before i'm just rusty on my sas... SEQN is a unique identifer for each individual person regardless of them being apart of a household... these numbers are not duplicated thus why we used SEQN to merge datasets
ballardw
Super User

@stephcolter wrote:
Im well aware... ive done this before i'm just rusty on my sas... SEQN is a unique identifer for each individual person regardless of them being apart of a household... these numbers are not duplicated thus why we used SEQN to merge datasets

The issue is not the variable SEQN. The issues are 1) all of the other variables and 2) the numbers of observations

 

Here is some code that relatively easily shows you whether a variable is duplicated between data sets and the variable type.

Proc sql;
   create table work.vars as
   select memname, name, type
   from dictionary.columns
   where libname='WORK' and memname in ("CDQ_J", "DBQ_J", "DEMO_J", "DXX_J", "PAQ_J", "WHQ_J")
   ;
run;

proc format;
   value putx
      1-High ='X'
      other = ' '
   ;
run;
proc tabulate data=work.vars;
   class memname name type;
   table name*type,
         memname*n=' '*f=putx.
   ;
run;

You didn't mention what your library where the data sets are stored would be. Put that in the LIBNAME= in the proc sql. This should be in uppercase as the dictionary tables, which stores the characteristics of all the variables available in your current SAS session, stores library and data set (Memname) as upper case. The format places an X in the body of a report table with the data sets as column headers. Where you see an X then the variable exists in the data set.

The type, N=numeric and C=character may help prevent errors when combining this data in any way.

 

If you have Labels assigned to the variables you may want to incorporate those as well to show common names with different labels (often indicating different meaning).

 

Large economy size hint: If you use data from CDC data sets from multiple years you really really want to do such a comparison. CDC routinely in some of the projects reuses variable names with very drastic differences from year to year.

 

I developed the above report matrix to deal with CDC BRFSS data where I was going to use 20 years of data. Some of the variables with the exact same name appeared in 15 or more sets with as many as 8 different meanings (think question).

Kurt_Bremser
Super User

Strongly agree with @ballardw 

Get to know your data (Maxim 3) before you proceed:

  • is the key (SEQN) unique in all those datasets? Run this and check the log
    proc sort
      data=dataset_to_be_checked (keep=SEQN)
      out=test
      nodupkey
    ;
    by SEQN;
    run;
  • if not, is the data to be summarized, or a special observation picked, or shall data from the other datasets be multiplied?
  • if more than one dataset is non-unique, do you want a cartesian join?
  • do variables (aside from the key) exist in more than one dataset? If yes, decide which information needs to be kept
Tom
Super User Tom
Super User

You should be able to just use MERGE with the just the variable SEQN on the BY statement. None of those tables look like they will have repeated measures per SEQN.  https://wwwn.cdc.gov/nchs/nhanes/tutorials/Module1.aspx

 

Use a LIBNAME statement to point to the /folders/myfolders/SASDatasets/ directory. Then a data step to combine them.  

 

libname nhanes '/folders/myfolders/SASDatasets/';
data want;
  merge
    nhanes.cdq_j 
    nhanes.dbq_j
    nhanes.demo_j
    nhanes.dxx_j
    nhanes.paq_j
    nhanes.whq_j
  ;
  by seqn;
run;

 

If SAS complains the data is not sort by SEQN then add PROC SORT steps for each dataset. Example:

proc sort data=nhanes.whq_j  ;
  by seqn;
run;

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 2354 views
  • 1 like
  • 5 in conversation