Combining multiple datasets into one but choosing values based on a co...

InspectahDex · Posted 07-07-2020 07:02 PM

I have four datasets that I'm combining into one dataset.

data FinalDataSet;
set AC4 EP4 IP4 CD4;
run;

Each dataset has a common set of variables. They all have 130 variables that are named the same and are all formatted to character variables. They each contain list of doctors from different health networks.

However, some doctors are only on one dataset, and others are found in multiple datasets. For the doctors that are found in multiple, one dataset may information that might differ or be missing from another dataset. For example, Doctor Smith might have their name, office address, and eye color in dataset AC4 but the CD4 dataset, which also has the same information, is saying the eye color is "blue" when dataset AC4 lists eye color as "brown".

Question: how do I go about concatenating these multiple datasets into one so that the data will always default to CD4 dataset when there is conflicting information or if the data is missing? Basically, I want my final dataset to say Dr. Smith has "brown" eyes regardless of what the other datasets have down for that doctor. But if the CD4 dataset doesn't have an observation for Dr. Smith, then I want to keep the observation in the other datasets.

The four datasets can be matched / linked by a variable called "NPI" which is a unique identifier for doctors.

Shmuel · Posted 07-07-2020 08:54 PM

I understand that CD4 is a masterfile. Merge data by NPI.

While merging a null variable will not replace a non-null and last non-null value will hold.

all datasets should be sorted by NPI then:

data FinalDataSet;
merge AC4 EP4 IP4 CD4;
  by NPI;
run;

ChrisNZ · Posted 07-07-2020 09:39 PM

> While merging a null variable will not replace a non-null and last non-null value will hold.

Untrue. In this MERGE, the CD4 value will always overwrite other values if the record is present in CD4.

UPDATE is the way to obtain the behaviour you describe.

Here, MERGE is required as CD$ should always overwrite the other values, if available.

High-Performance SAS Coding - Third Edition

Astounding · Posted 07-07-2020 10:18 PM

You are off to a good start, concatenating the data sets and naming CD4 last. Continue with:

proc sort data=FinalDataSet;
   by doctor;
run;

data TrulyFinalDataSet;
   update FinalDataSet (obs=0) FinalDataSet;
   by doctor;
run;

You didn't specify variable names, so I'm assuming DOCTOR is the name of the variable that identifies a doctor.

Astounding · Posted 07-07-2020 10:24 PM

Sorry, should have seen ... use NPI instead of DOCTOR as the identifier.

InspectahDex · Posted 07-08-2020 03:05 PM

Ok I think this is the right answer but there is one other thing I need to take into account. Out of the 130 common variables, I only need to update 29 variables from the CD4 dataset. How do I tell SAS to only update those 29 variables?

Astounding · Posted 07-08-2020 03:23 PM

Keep in mind that the right answer requires defining what to do what the other 101 variables.

If you start with 3 observations for the same doctor, and there is conflicting data among various sources for those other 101 variables, what should be done?

Remember, there will be only one observation per doctor in the final data set.

So spell out the rules, and the programming should be short but perhaps mildly more complex than what I posted the first time.

InspectahDex · Posted 07-08-2020 04:48 PM

Shmuel · Posted 07-08-2020 05:21 PM

@InspectahDex wrote:

Ok I think this is the right answer but there is one other thing I need to take into account. Out of the 130 common variables, I only need to update 29 variables from the CD4 dataset. How do I tell SAS to only update those 29 variables?

Am I correct - for those doctores in CD4 you want update only 29 variables but for those who are not in CD4 you need all variables?

If true then you should do it within 3 steps:

(1) Update - keeping NPI and the 29 variables of AC4 EP4 and CD4 - update CD4 keeping all variables but if NPI is in CD4.

(2) Merge or update CD4, keeping all variables, but only those NPI who are not in CD4.

(3) Merge the result of step1 and step2.

data Final1t  
     temp1(keep=NPI);
update AC4 (in=in1 keep=NPI ... <29 variables to update> )
       EP4 (in=in2 keep=NPI ... <29 variables to update> )
       IP4  (in=in3 keep=NPI ... <29 variables to update> )
       CD4 (in=in4);
  by NPI;
     if in4 then output Final1;
     else output temp1;
run;
data Final2;
Update AC4 (in=in1)
       EP4 (in=in2)
       IP4  (in=in3)
       temp1 (in=in4);
  by NPI;
      if  in4;
run;
data FinalDataSet;
merge Final1 Final2;
  by NPI;
run;

smantha · Posted 07-08-2020 04:19 PM

This is how I would approach the problem identify the 29 variables that need to be kept track of

%let list_of_vars_to_track=<some varlist separated by spaces>;
%let list_of_vars_len_to_track=<some varlist separated by spaces>;

%macro  test;
data FinalDataSet(rename=(
%do i = 1 %to 29;
              ret_%scan(list_of_vars_to_track,&i.,%str( ))  =%scan(list_of_vars_to_track,&i.,%str( ))
            %end;
));
length %do i = 1 %to 29;
              ret_%scan(list_of_vars_to_track,&i.,%str( ))  %scan(list_of_vars_len_to_track,&i.,%str( ))
            %end;;
set AC4 EP4 IP4 CD4;
by doctor;
retain %do i = 1 %to 29;
              ret_%scan(list_of_vars_to_track,&i.,%str( ))  
            %end; ' ';
if first.doctor then do;
%do i = 1 %to 29;
              
              ret_%scan(list_of_vars_to_track,&i.,%str( ))  = '';
            %end; 
end;
%do i = 1 %to 29;
              if not missing(%scan(list_of_vars_to_track,&i.,%str( ))) then 
              ret_%scan(list_of_vars_to_track,&i.,%str( ))  =%scan(list_of_vars_to_track,&i.,%str( ));
            %end; 
if last.doctor then output;
drop %do i = 1 %to 29;
             %scan(list_of_vars_to_track,&i.,%str( ))  
            %end; ;
run;
%mend test;

Combining multiple datasets into one but choosing values based on a condition

Re: Combining multiple datasets into one but choosing values based on a condition

Re: Combining multiple datasets into one but choosing values based on a condition

Re: Combining multiple datasets into one but choosing values based on a condition

Re: Combining multiple datasets into one but choosing values based on a condition

Re: Combining multiple datasets into one but choosing values based on a condition

Re: Combining multiple datasets into one but choosing values based on a condition

Re: Combining multiple datasets into one but choosing values based on a condition

Re: Combining multiple datasets into one but choosing values based on a condition

Re: Combining multiple datasets into one but choosing values based on a condition

SAS Innovate 2025: Call for Content

Classroom Training Available!