BookmarkSubscribeRSS Feed
InspectahDex
Obsidian | Level 7

I have four datasets that I'm combining into one dataset. 

 

data FinalDataSet;
set AC4 EP4 IP4 CD4;
run;

 

Each dataset has a common set of variables. They all have 130 variables that are named the same and are all formatted to character variables.  They each contain list of doctors from different health networks.

 

However, some doctors are only on one dataset, and others are found in multiple datasets. For the doctors that are found in multiple, one dataset may information that might differ or be missing from another dataset. For example, Doctor Smith might have their name, office address, and eye color in dataset AC4 but the CD4 dataset, which also has the same information, is saying the eye color is "blue" when dataset AC4 lists eye color as "brown". 

 

Question: how do I go about concatenating these multiple datasets into one so that the data will always default to CD4 dataset when there is conflicting information or if the data is missing? Basically, I want my final dataset to say Dr. Smith has "brown" eyes regardless of what the other datasets have down for that doctor. But if the CD4 dataset doesn't have an observation for Dr. Smith, then I want to keep the observation in the other datasets. 

 

The four datasets can be matched / linked by a variable called "NPI" which is a unique identifier for doctors.

9 REPLIES 9
Shmuel
Garnet | Level 18

I understand that CD4 is a masterfile. Merge data by NPI. 

While merging a null variable will not replace a non-null and last non-null value will hold.

all datasets should be sorted by NPI then:

data FinalDataSet;
merge AC4 EP4 IP4 CD4;
  by NPI;
run;
ChrisNZ
Tourmaline | Level 20

> While merging a null variable will not replace a non-null and last non-null value will hold.

Untrue. In this MERGE, the CD4 value will always overwrite other values if the record is present in CD4.

UPDATE is the way to obtain the behaviour you describe.

 

Here, MERGE is required as CD$ should always overwrite the other values, if available.

 

 

Astounding
PROC Star

You are off to a good start, concatenating the data sets and naming CD4 last.  Continue with:

 

proc sort data=FinalDataSet;
   by doctor;
run;

data TrulyFinalDataSet;
   update FinalDataSet (obs=0) FinalDataSet;
   by doctor;
run;

You didn't specify variable names, so I'm assuming DOCTOR is the name of the variable that identifies a doctor.

Astounding
PROC Star
Sorry, should have seen ... use NPI instead of DOCTOR as the identifier.
InspectahDex
Obsidian | Level 7

Ok I think this is the right answer but there is one other thing I need to take into account. Out of the 130 common variables, I only need to update 29 variables from the CD4 dataset. How do I tell SAS to only update those 29 variables?

Astounding
PROC Star

Keep in mind that the right answer requires defining what to do what the other 101 variables.

 

If you start with 3 observations for the same doctor, and there is conflicting data among various sources for those other 101 variables, what should be done?

 

Remember, there will be only one observation per doctor in the final data set.

 

So spell out the rules, and the programming should be short but perhaps mildly more complex than what I posted the first time.

InspectahDex
Obsidian | Level 7

"If you start with 3 observations for the same doctor, and there is conflicting data among various sources for those other 101 variables, what should be done?"

These observations with conflicting data would write to another observation.

So for example, if Dr. Smith sees patients at two different locations then I want both locations in my dataset as two observations.

Example:
Dataset AC4 says:
Name | OfficeAddress | EyeColor | NPI
John Smith | 123 Main St. | Blue |1234567890

Dataset CD4 says:
Name | OfficeAddress | EyeColor | NPI
John Smith | 789 Any St. | Brown |1234567890

I want the final dataset to say:
Name | OfficeAddress | EyeColor | NPI
John Smith | 789 Any St. | Brown |1234567890
John Smith | 123 Main St. | Brown |1234567890
Shmuel
Garnet | Level 18

@InspectahDex wrote:

Ok I think this is the right answer but there is one other thing I need to take into account. Out of the 130 common variables, I only need to update 29 variables from the CD4 dataset. How do I tell SAS to only update those 29 variables?


Am I correct - for those doctores in CD4 you want update only 29 variables but for those who are not in CD4 you need all variables?

If true then you should do it within 3 steps:

(1) Update - keeping NPI and the 29 variables of  AC4 EP4 and CD4 - update CD4 keeping all variables but if NPI is in CD4.

(2) Merge or update CD4, keeping all variables, but only those NPI who are not in CD4.

(3) Merge the result of step1 and step2.

data Final1t  
     temp1(keep=NPI);
update AC4 (in=in1 keep=NPI ... <29 variables to update> )
       EP4 (in=in2 keep=NPI ... <29 variables to update> )
       IP4  (in=in3 keep=NPI ... <29 variables to update> )
       CD4 (in=in4);
  by NPI;
     if in4 then output Final1;
     else output temp1;
run;
data Final2;
Update AC4 (in=in1)
       EP4 (in=in2)
       IP4  (in=in3)
       temp1 (in=in4);
  by NPI;
      if  in4;
run;
data FinalDataSet;
merge Final1 Final2;
  by NPI;
run;
smantha
Lapis Lazuli | Level 10

This is how I would approach the problem identify the 29 variables that need to be kept track of

%let list_of_vars_to_track=<some varlist separated by spaces>;
%let list_of_vars_len_to_track=<some varlist separated by spaces>;

%macro  test;
data FinalDataSet(rename=(
%do i = 1 %to 29;
              ret_%scan(list_of_vars_to_track,&i.,%str( ))  =%scan(list_of_vars_to_track,&i.,%str( ))
            %end;
));
length %do i = 1 %to 29;
              ret_%scan(list_of_vars_to_track,&i.,%str( ))  %scan(list_of_vars_len_to_track,&i.,%str( ))
            %end;;
set AC4 EP4 IP4 CD4;
by doctor;
retain %do i = 1 %to 29;
              ret_%scan(list_of_vars_to_track,&i.,%str( ))  
            %end; ' ';
if first.doctor then do;
%do i = 1 %to 29;
              
              ret_%scan(list_of_vars_to_track,&i.,%str( ))  = '';
            %end; 
end;
%do i = 1 %to 29;
              if not missing(%scan(list_of_vars_to_track,&i.,%str( ))) then 
              ret_%scan(list_of_vars_to_track,&i.,%str( ))  =%scan(list_of_vars_to_track,&i.,%str( ));
            %end; 
if last.doctor then output;
drop %do i = 1 %to 29;
             %scan(list_of_vars_to_track,&i.,%str( ))  
            %end; ;
run;
%mend test;

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 926 views
  • 0 likes
  • 5 in conversation