I have a dataset (temp: cancer file) and a yearly mbsf_abcd files (2006-2019 demographic file).
I will like to merge temp with yearly files by patient_id in the year of their diagnosis to identify demographic and baseline characteristics of individuals who had evidence of cancer.
The yearly files are large so I only want to select variables of interest and pull in demographic variables as of the year of cancer diagnoses. If data are missing in the year of cancer diagnosis, I will pull data from year prior to cancer diagnosis.
I tried to use macro but am not proficient enough with it and I don't know how to do section in bold.
%macro around(start_year=,end_year=); %do year=&start_year %to &end_year; data want; keep PATIENT_ID BENE_BIRTH_DT BENE_DEATH_DT BENE_ENROLLMT_REF_YR AGE_AT_END_REF_YR BENE_RACE_CD COUNTY_CD RTI_RACE_CD SAMPLE_GROUP SEX_IDENT_CD STATE_CODE ZIP_CD dsn VALID_DEATH_DT_SW; merge temp a.mbsf_abcd_&year.; by patient_id; if year_of_diagnosis=. then call missing (year_of_diagnosis); else year_of_diagnosis=&year.; %end; %mend; %around(start_year=2006,end_year=2019)
Appreciate your help.
Something like this perhaps:
data easy; set a.mbsf_abcd_2006 - a.mbsf_abcd_2019 ; where not missing(year_of_diagnosis); run; proc sql; create data want as select a.*, b.* from temp as a left join easy as b on a.patient_id = b.patient_id and a.year_of_diagnosis =b.year_of_diagnosis ; quit;
You will get a note about patient_id and year_of_diagnosis already on the data set. But without knowing the names of the variable in the combined data of all the records that you want I can't provide any of these.
The a.* refers to all the variables in set aliased as A. So if there are none of the other variables from the details data set then there weren't any matches found. This will return multiple records if the patient had multiple records with the same year of diagnosis.
@ballardw Thank you. The temp file is a unique ID file(no dup record) The mbsf_abcd_YYYY files has the demographics for each patient but no year of diagnosis. This is why I want to map each patient demographic with their year of diagnosis.
Your sql code is great but I cannot do this below because temp is the only file that has year of diagnosis.
and a.year_of_diagnosis =b.year_of_diagnosis
Concatenate all yearly datasets into one (use a data step view, and derive the year with INDSNAME=), merge with your unique dataset, and select the demographics data in a DO UNTIL loop.
For code, please supply usable example data in data steps with DATALINES.
Not sure what you are "merging", but if you just want to take the YEAR from the TEMP then just list it first and remember it into another variable. THen just test if the year matches the suffix on the current dataset's name.
data want; set temp(in=in1) x_2006-x_2019 indsname=dsname ; by id;
retain year; if first.id then call missing(year); if in1 then year=year_of_diagnosis; else if year = input(scan(dsname,-1,'_'),32.) ; run;
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.