Solved: Re: Merge Six Sas Datasets Together

stephcolter · Posted 09-18-2020 11:00 PM

I'm currently working with SAS University online . I have 6 datasets from NHANES in my SAS University folder: /folders/myfolders/SASDatasets/ the datasets are named:

cdq_j.sas7bdat

dbq_j.sas7bdat

demo_j.sas7bdat

dxx_j.sas7bdat

paq_j.sas7bdat

whq_j.sas7bdat

all of the datasets have the common variable SEQN (the respondents identifying #) and I would like to merge them all by this variable.

Please assist

Tom · Posted 09-19-2020 01:28 PM

You should be able to just use MERGE with the just the variable SEQN on the BY statement. None of those tables look like they will have repeated measures per SEQN. https://wwwn.cdc.gov/nchs/nhanes/tutorials/Module1.aspx

Use a LIBNAME statement to point to the /folders/myfolders/SASDatasets/ directory. Then a data step to combine them.

libname nhanes '/folders/myfolders/SASDatasets/';
data want;
  merge
    nhanes.cdq_j 
    nhanes.dbq_j
    nhanes.demo_j
    nhanes.dxx_j
    nhanes.paq_j
    nhanes.whq_j
  ;
  by seqn;
run;

If SAS complains the data is not sort by SEQN then add PROC SORT steps for each dataset. Example:

proc sort data=nhanes.whq_j  ;
  by seqn;
run;

View solution in original post

Shmuel · Posted 09-19-2020 12:46 AM

libname sasdata '/folders/myfolders/SASDatasets';

data want;
 merge sasdata.cdq_i
           sasdata.dnq_i
          .....
  ;
  by SEQN;
run;

You should be aware in cases:

- there are the same SEQN in more then one input dataset,

would you like to keep all of the observations of that SEQN

or do you prefer the that from a specific dataset or any of the dataset

due to some priority order?

Alternative code can be done by concatenating all input dataset and

sort the output by SEQN. That will be the same as merging and keeping all

the observations even if there are duplicates.

stephcolter · Posted 09-19-2020 06:05 AM

I want to keep all the information from each datatset and just organize it by the seqn number
the respondent answers all of these surveys seperately but they are identified by teir SEQN number so now I'm trying to merge all of those survey responses together for each SEQN number

ballardw · Posted 09-19-2020 02:59 AM

Before doing so you really want to make sure that there are no other variables with the same name.

Reasons: If they have the same name and different values the result may not have the value you expect because you will only have one variable of that name in the result. Also if the variables of the same name happen to be different types you will get an error.

Variables of particular concern would be WEIGHT variables. If you are doing a one-to-many, which I believe NHANES does with family members vs household then the matched records would mean that the Household values are duplicated and the weights are incorrect as well as the results of any analysis of household variables.

Did you download any of the code for analysis from the CDC NHANES website. https://wwwn.cdc.gov/nchs/nhanes/tutorials/default.aspx

stephcolter · Posted 09-19-2020 06:09 AM

Im well aware... ive done this before i'm just rusty on my sas... SEQN is a unique identifer for each individual person regardless of them being apart of a household... these numbers are not duplicated thus why we used SEQN to merge datasets

ballardw · Posted 09-19-2020 01:43 PM

@stephcolter wrote:
Im well aware... ive done this before i'm just rusty on my sas... SEQN is a unique identifer for each individual person regardless of them being apart of a household... these numbers are not duplicated thus why we used SEQN to merge datasets

The issue is not the variable SEQN. The issues are 1) all of the other variables and 2) the numbers of observations

Here is some code that relatively easily shows you whether a variable is duplicated between data sets and the variable type.

Proc sql;
   create table work.vars as
   select memname, name, type
   from dictionary.columns
   where libname='WORK' and memname in ("CDQ_J", "DBQ_J", "DEMO_J", "DXX_J", "PAQ_J", "WHQ_J")
   ;
run;

proc format;
   value putx
      1-High ='X'
      other = ' '
   ;
run;
proc tabulate data=work.vars;
   class memname name type;
   table name*type,
         memname*n=' '*f=putx.
   ;
run;

You didn't mention what your library where the data sets are stored would be. Put that in the LIBNAME= in the proc sql. This should be in uppercase as the dictionary tables, which stores the characteristics of all the variables available in your current SAS session, stores library and data set (Memname) as upper case. The format places an X in the body of a report table with the data sets as column headers. Where you see an X then the variable exists in the data set.

The type, N=numeric and C=character may help prevent errors when combining this data in any way.

If you have Labels assigned to the variables you may want to incorporate those as well to show common names with different labels (often indicating different meaning).

Large economy size hint: If you use data from CDC data sets from multiple years you really really want to do such a comparison. CDC routinely in some of the projects reuses variable names with very drastic differences from year to year.

I developed the above report matrix to deal with CDC BRFSS data where I was going to use 20 years of data. Some of the variables with the exact same name appeared in 15 or more sets with as many as 8 different meanings (think question).

Kurt_Bremser · Posted 09-19-2020 03:50 AM

Strongly agree with @ballardw

Get to know your data (Maxim 3) before you proceed:

is the key (SEQN) unique in all those datasets? Run this and check the log

proc sort
  data=dataset_to_be_checked (keep=SEQN)
  out=test
  nodupkey
;
by SEQN;
run;

if not, is the data to be summarized, or a special observation picked, or shall data from the other datasets be multiplied?
if more than one dataset is non-unique, do you want a cartesian join?
do variables (aside from the key) exist in more than one dataset? If yes, decide which information needs to be kept

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Tom · Posted 09-19-2020 01:28 PM

You should be able to just use MERGE with the just the variable SEQN on the BY statement. None of those tables look like they will have repeated measures per SEQN. https://wwwn.cdc.gov/nchs/nhanes/tutorials/Module1.aspx

Use a LIBNAME statement to point to the /folders/myfolders/SASDatasets/ directory. Then a data step to combine them.

libname nhanes '/folders/myfolders/SASDatasets/';
data want;
  merge
    nhanes.cdq_j 
    nhanes.dbq_j
    nhanes.demo_j
    nhanes.dxx_j
    nhanes.paq_j
    nhanes.whq_j
  ;
  by seqn;
run;

If SAS complains the data is not sort by SEQN then add PROC SORT steps for each dataset. Example:

proc sort data=nhanes.whq_j  ;
  by seqn;
run;

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away