I have four datasets that I'm combining into one dataset.
data FinalDataSet;
set AC4 EP4 IP4 CD4;
run;
Each dataset has a common set of variables. They all have 130 variables that are named the same and are all formatted to character variables. They each contain list of doctors from different health networks.
However, some doctors are only on one dataset, and others are found in multiple datasets. For the doctors that are found in multiple, one dataset may information that might differ or be missing from another dataset. For example, Doctor Smith might have their name, office address, and eye color in dataset AC4 but the CD4 dataset, which also has the same information, is saying the eye color is "blue" when dataset AC4 lists eye color as "brown".
Question: how do I go about concatenating these multiple datasets into one so that the data will always default to CD4 dataset when there is conflicting information or if the data is missing? Basically, I want my final dataset to say Dr. Smith has "brown" eyes regardless of what the other datasets have down for that doctor. But if the CD4 dataset doesn't have an observation for Dr. Smith, then I want to keep the observation in the other datasets.
The four datasets can be matched / linked by a variable called "NPI" which is a unique identifier for doctors.
I understand that CD4 is a masterfile. Merge data by NPI.
While merging a null variable will not replace a non-null and last non-null value will hold.
all datasets should be sorted by NPI then:
data FinalDataSet;
merge AC4 EP4 IP4 CD4;
by NPI;
run;
> While merging a null variable will not replace a non-null and last non-null value will hold.
Untrue. In this MERGE, the CD4 value will always overwrite other values if the record is present in CD4.
UPDATE is the way to obtain the behaviour you describe.
Here, MERGE is required as CD$ should always overwrite the other values, if available.
You are off to a good start, concatenating the data sets and naming CD4 last. Continue with:
proc sort data=FinalDataSet;
by doctor;
run;
data TrulyFinalDataSet;
update FinalDataSet (obs=0) FinalDataSet;
by doctor;
run;
You didn't specify variable names, so I'm assuming DOCTOR is the name of the variable that identifies a doctor.
Ok I think this is the right answer but there is one other thing I need to take into account. Out of the 130 common variables, I only need to update 29 variables from the CD4 dataset. How do I tell SAS to only update those 29 variables?
Keep in mind that the right answer requires defining what to do what the other 101 variables.
If you start with 3 observations for the same doctor, and there is conflicting data among various sources for those other 101 variables, what should be done?
Remember, there will be only one observation per doctor in the final data set.
So spell out the rules, and the programming should be short but perhaps mildly more complex than what I posted the first time.
@InspectahDex wrote:
Ok I think this is the right answer but there is one other thing I need to take into account. Out of the 130 common variables, I only need to update 29 variables from the CD4 dataset. How do I tell SAS to only update those 29 variables?
Am I correct - for those doctores in CD4 you want update only 29 variables but for those who are not in CD4 you need all variables?
If true then you should do it within 3 steps:
(1) Update - keeping NPI and the 29 variables of AC4 EP4 and CD4 - update CD4 keeping all variables but if NPI is in CD4.
(2) Merge or update CD4, keeping all variables, but only those NPI who are not in CD4.
(3) Merge the result of step1 and step2.
data Final1t
temp1(keep=NPI);
update AC4 (in=in1 keep=NPI ... <29 variables to update> )
EP4 (in=in2 keep=NPI ... <29 variables to update> )
IP4 (in=in3 keep=NPI ... <29 variables to update> )
CD4 (in=in4);
by NPI;
if in4 then output Final1;
else output temp1;
run;
data Final2;
Update AC4 (in=in1)
EP4 (in=in2)
IP4 (in=in3)
temp1 (in=in4);
by NPI;
if in4;
run;
data FinalDataSet;
merge Final1 Final2;
by NPI;
run;
This is how I would approach the problem identify the 29 variables that need to be kept track of
%let list_of_vars_to_track=<some varlist separated by spaces>;
%let list_of_vars_len_to_track=<some varlist separated by spaces>;
%macro test;
data FinalDataSet(rename=(
%do i = 1 %to 29;
ret_%scan(list_of_vars_to_track,&i.,%str( )) =%scan(list_of_vars_to_track,&i.,%str( ))
%end;
));
length %do i = 1 %to 29;
ret_%scan(list_of_vars_to_track,&i.,%str( )) %scan(list_of_vars_len_to_track,&i.,%str( ))
%end;;
set AC4 EP4 IP4 CD4;
by doctor;
retain %do i = 1 %to 29;
ret_%scan(list_of_vars_to_track,&i.,%str( ))
%end; ' ';
if first.doctor then do;
%do i = 1 %to 29;
ret_%scan(list_of_vars_to_track,&i.,%str( )) = '';
%end;
end;
%do i = 1 %to 29;
if not missing(%scan(list_of_vars_to_track,&i.,%str( ))) then
ret_%scan(list_of_vars_to_track,&i.,%str( )) =%scan(list_of_vars_to_track,&i.,%str( ));
%end;
if last.doctor then output;
drop %do i = 1 %to 29;
%scan(list_of_vars_to_track,&i.,%str( ))
%end; ;
run;
%mend test;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.