file merge

ChuksManuel · Posted 02-05-2023 06:32 PM

Hello programmers,

I am want to do an analysis to estimate the risk of developing stroke in those who have a heart disease in the recently released NHIS longitudinal dataset. The NHIS released a longitudinal dataset last year that makes it possible for researchers to merge the 2019 and 2020 together. So my plan is to have a sample of people with heart disease (after applying all exclusion criteria) in the 2019 dataset and check to see if they have the outcome (stroke) in the 2020 dataset. I intend to use cox regression while controlling for baseline covariates.

I have however not been able to merge the 2019 dataset and the 2020 dataset with the joining file. I will appreciate if anyone can help me merge it in a way for me to conduct my analysis. The 2019 dataset has over 300k obs, 2020 has 31k and the join dataset (people followed up from 2019 to 2020) is 10k.

My initial code is as follows but i know I'm missing something. I have attached the 2019, 2020 and join dataset in excel as well.

data merged_data;
   merge nhis.adultlong20 (in=a)
         one (in=b)
         two(in=c);
   by HHX_2019 ; /* the common variable */
run;

ChuksManuel · Posted 02-05-2023 06:33 PM

And for the files, I have selected just the first 20 observations

andreas_lds · Posted 02-06-2023 01:43 AM

Please post data in usable form (data step with datalines), many users won't open office files due to security threads caused by the file-format.

ChuksManuel · Posted 02-06-2023 02:14 PM

/*2019 Dataset has this structure*/
RECTYPE	SRVY_YR	HHX	REGION	PSTRAT	PPSU
10 Sample Adult	2019	H048109	3 South	122	2
10 Sample Adult	2019	H027044	3 South	122	2
10 Sample Adult	2019	H058855	3 South	122	2
10 Sample Adult	2019	H031993	3 South	122	2
10 Sample Adult	2019	H007122	3 South	115	2
10 Sample Adult	2019	H007736	3 South	115	2
10 Sample Adult	2019	H040698	3 South	115	1
10 Sample Adult	2019	H022161	3 South	115	1
10 Sample Adult	2019	H017054	3 South	115	1

/*2020 dataset has this structure*/
	RECTYPE	SRVY_YR	HHX	REGION	PSTRAT	PPSU
	10	2020	H066706	3 Almost everything	103	2
	10	2020	H034928	3 Almost everything	103	2
	10	2020	H018289	3 Almost everything	103	2
	10	2020	H006876	3 Almost everything	103	2
	10	2020	H028842	3 Almost everything	103	2
	10	2020	H004811	3 Almost everything	103	2
	10	2020	H068043	3 Almost everything	103	2

/*joining dataset has this structure*/
RECTYPE	SRVY_YR	HHX_2019	HHX_2020	WTSA_L
60 Sample Adult Longitudinal	Survey year	H000003	H038763	33091.61
60 Sample Adult Longitudinal	Survey year	H000008	H060250	15476.27
60 Sample Adult Longitudinal	Survey year	H000009	H018743	15994.46
60 Sample Adult Longitudinal	Survey year	H000010	H067593	13108.81
60 Sample Adult Longitudinal	Survey year	H000011	H021952	8211.29
60 Sample Adult Longitudinal	Survey year	H000012	H007892	40972.78
60 Sample Adult Longitudinal	Survey year	H000014	H041303	16721.16
60 Sample Adult Longitudinal	Survey year	H000023	H004538	1852.21
60 Sample Adult Longitudinal	Survey year	H000027	H007281	4790.91

Reeza · Posted 02-06-2023 03:37 PM

In the data shown in the Excel the variable is HHX, not HHX 2019 and there are no overlaps/matches.

1. Are you using code not shown?
2. Have you confirmed you have overlaps in the IDs?

Query to identify records in both files:

proc sql;
create table in_both_files as
select hhx
from have1
where hhx in (select hhx from have2);
quit;

ChuksManuel · Posted 02-06-2023 03:54 PM

1. Can't i use a simple merge to execute this instead of proc sql?
2. There's no overlap in ID in this small dataset (obs=7 to 10) because the dataset is large. But the joining dataset has all the unique IDs (HHX_2019) for the 2019 data and HHX_2020 for the 2020 data.

I renamed the HHX variables in the 2020 and 2019 datasets to HHX_2020 and HHX_2019 to correspond with the one in the ID in the joining datatset.

Reeza · Posted 02-06-2023 05:55 PM

Show your full code.
You can use a simple merge, but according to your comments you renamed the matching variables to differently named variables. In that situation you need to use SQL. If they have the same name you can use a data step.

file merge

Re: file merge

Re: file merge

Re: file merge

Re: file merge

Re: file merge

Re: file merge

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!