Solved: Re: data merge

mayasak · Posted 07-29-2022 06:27 PM

I have a project to import multiple excel files daily into SAS and merge them into one SAS file. I want to do this using macro. The result should be match-merged and have all variables, observations, and values from both datasets. In the following example there are some missing values and what I need is to fill in the missing values from the dataset that have this value.

Here are two datasets as an example:

data Inc;
infile datalines truncover;
input Person_id Id MRN $ Last_Name $ First_Name $ specimen_source $ Gender $ Race $;
cards;
122345 56788 M776 Doe John blood M W
34567 12344 L87765 . Mickey nasal .
90876 36476 009875 Mouse Minnie blood F I
777654 227364 26376 Duck Donald . M B
56761 23676 M765 Pippin Mary . M W
667785 3624554 L97960 Smith Oliver nasal M W
345564 34567 . Johns Karen saliva F W
79908 123540 M78765 Doe Kevin . M A
677654 378754 L8544 Wayne Bruce . M B
run;

data Lab;
infile datalines truncover;
input ID Last_Name $ First_Name $ specimen_source $ Gender $;
datalines;
56788 Doe John blood M
127655 Johns Trevor nasal M
12344 Mouse Mickey nasal .
36476 Mouse Minnie . F
297654 Gatzby Jay saliva M
23676 Pippin Mary . F
3624554 Smith Oliver . M
34857 Miller Edward blood M
176255 Roberts Linda nasal .
378754 Wayne Bruce blood M
854324 Hill Jane . F
run;

I tried different codes but the one below was the best in terms of the result needed.

%MACRO MSORT1(DATASET,VAR);
PROC SORT DATA=&DATASET;
BY &VAR;
RUN;
%MEND MSORT1;

%MACRO MMERGE(DEST,SOURCE1,SOURCE2,VAR);

%msort1(&source1, &var);
%msort1(&source2, &var);

DATA &DEST;
MERGE &SOURCE1 &SOURCE2;
BY &VAR;
RUN;
%MEND MMERGE;
%MMERGE (WORK.labinc, WORK.lab, WORK.inc, id);

Here's the result I got:

This code only got the values from "Inc" dataset and not from "lab". Please check the highlighted values. My question is: Can we manipulate the code so that it can get the values filled from both datasets? For example: in the result above for observation 14 the specimen_source was filled in from "Inc" but I need to get the last name and specimen_source in observations 1 and 12 from "lab" dataset as well.

Thank you

mkeintz · Posted 07-29-2022 08:56 PM

Why do you highlight the blank lastname cell for ID 12344? The lastname is blank for this id in work.ind, and the id is completely absent from work.lab.

Now the blank for specimen for id 378754 is a different issue, because the code

DATA WORK.labinc;
   MERGE WORK.lab WORK.inc;
   BY id;
RUN;

will always have the value in the left dataset overwritten by the value in the right dataset, even if the value on the right is missing and the value on the left is not missing.

Instead use

DATA WORK.labinc;
   UPDATE WORK.lab WORK.inc;
   BY id;
RUN;

which will propagate the right-side value only when it is not missing. I.e. a non-missing value will never be replaced by a missing value.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

Reeza · Posted 07-29-2022 08:20 PM

When you merge the data set on the right* will overwrite the ones on the left* if they have the same name.
If you want a different handling you need to:

Rename the variables so they are different and explicitly deal with them
Use PROC SQL and COALESCE() to explicitly deal with this case.

Or is there a data set you can assume will be the 'master' correct data set. If you can, then merge with that data set as the left most one.

*right/left refer to order listed in the MERGE statement.

mkeintz · Posted 07-29-2022 08:56 PM

Why do you highlight the blank lastname cell for ID 12344? The lastname is blank for this id in work.ind, and the id is completely absent from work.lab.

Now the blank for specimen for id 378754 is a different issue, because the code

DATA WORK.labinc;
   MERGE WORK.lab WORK.inc;
   BY id;
RUN;

will always have the value in the left dataset overwritten by the value in the right dataset, even if the value on the right is missing and the value on the left is not missing.

Instead use

DATA WORK.labinc;
   UPDATE WORK.lab WORK.inc;
   BY id;
RUN;

which will propagate the right-side value only when it is not missing. I.e. a non-missing value will never be replaced by a missing value.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

mayasak · Posted 07-29-2022 11:36 PM

Thank you Mkeintz,
This is exactly what I needed.
Quick question. Does this work on more than one key variable such as first_name and last_name?
Thank you again

mkeintz · Posted 07-29-2022 11:59 PM

@mayasak wrote:
Thank you Mkeintz,
This is exactly what I needed.
Quick question. Does this work on more than one key variable such as first_name and last_name?
Thank you again

What do you mean by "key variable"?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

mayasak · Posted 07-30-2022 12:27 AM

The variable that links both dataset. For example, I used ID.

Kurt_Bremser · Posted 07-30-2022 03:17 AM

The BY statement accepts a list of variables. Make sure you use the same BY when you sort the datasets.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Tom · Posted 07-30-2022 11:13 AM

@mayasak wrote:
Thank you Mkeintz,
This is exactly what I needed.
Quick question. Does this work on more than one key variable such as first_name and last_name?
Thank you again

It does not matter how many variables it takes to form a unique key. But they must form a unique key in the ORIGINAL dataset, that is there is only one observation per by group (per unique combination of the values of all variables listed in the BY statement). Note that you can have multiple observations per by group in the TRANSACTIONS dataset, they will be applied in the order they appear. The resulting dataset will have only one observation per by group.

data want;
  update original transactions;
  by id1 id2 id3;
run;

SAS Training: Just a Click Away