Solved: Merging two data sets

Kunko · Posted 10-25-2021 07:16 AM

Hello,

I am trying to merge two data sets with different observations (Data1 (fulldata_pca_2021)=2760 with lifestyle variables, Data2 (blood301774)=1774 with blood biomarkers). A new data set (Data3=fulldata_pca_2021a), as expected, had 2760 observations and all lifestyle and blood biomarkers. However, Data3 (fulldata_pca_2021a) gave higher observations for blood biomarkers (eg.2462 for low density lipoprotein level) than supposed to be 1774 or less.

Below is a SAS code I used to merge:

Proc sort data = blood301774; by codea; run;

Proc sort data = fulldata_pca_2021a; by codea; run;

Data fulldata_pca_2021a;
merge fulldata_pca_2021 blood301774;
by codea;
run;

NOTE: Format M211A was not found or could not be loaded.
NOTE: Format MUSPGENA was not found or could not be loaded.
NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 2760 observations read from the data set WORK.FULLDATA_PCA_2021.
NOTE: There were 1774 observations read from the data set WORK.BLOOD301774.
NOTE: The data set WORK.FULLDATA_PCA_2021A has 2760 observations and 25 variables.
NOTE: DATA statement used (Total process time):
real time 0.06 seconds
cpu time 0.03 seconds

Here is a proc means to check mean LDL in Data3;

3933 Proc means data = fulldata_pca_2021a;
3934 Var ldl ;
3935 run;

NOTE: There were 2760 observations read from the data set WORK.FULLDATA_PCA_2021A.
NOTE: PROCEDURE MEANS used (Total process time):
real time 0.03 seconds
cpu time 0.00 seconds

Analysis Variable : ldl LDL Cholesterol (mmol/L)
N	Mean	Std Dev	Minimum	Maximum
2462	2.8373680	0.8309344	0.3000000	8.9000000

Thanks

PaigeMiller · Posted 10-25-2021 08:35 AM

@Kunko wrote:

How can I address this, please? Thank you once again for your unreserved support.

That's up to you. How do you want to address duplicates?

You are exactly right; the blood biomarkers data replicated themselves in places that they are supposed to be missing in an output data set.

But they are not supposed to be missing if there are duplicates of the value of CODEA, the real problem is the duplicates of the values of CODEA.

--
Paige Miller

View solution in original post

PaigeMiller · Posted 10-25-2021 07:28 AM

Have you actually looked at the data set fulldata_pca_2021a with your own eyes? Do you see missing values for LDL or not? From the PROC MEANS output, I suspect you will not see missing values, this is telling you something. You can investigate what the merge did by looking at this output data set named fulldata_pca_2021a.

One possibility is that the dataset fulldata_pca_2021 has replicates of each value of CODEA and so the merge is not a one-to-one merge, which should produce 1774 values of LDL, but it is a one-to-many merge. Is that what you see when you look at the data sets?

--
Paige Miller

Kunko · Posted 10-25-2021 07:49 AM

Hi Paige,
Thank you so much for your swift reply.
Yes, I have looked at an output data set and there is no as many missing as I expected (nmiss=298).
You are exactly right; the blood biomarkers data replicated themselves in places that they are supposed to be missing in an output data set.
How can I address this, please? Thank you once again for your unreserved support.
Kind Regards,
Tol

Kurt_Bremser · Posted 10-25-2021 08:11 AM

This is your alert:

NOTE: MERGE statement has more than one data set with repeats of BY values.

This is suspicious whenever it happens. It means that the biomarkers dataset has more than one observation for some key value, which means that those blood markers can be multiplied. You need to first deduplicate at least one of the datasets. This may involve "interesting" code depending on how multiple values can be coalesced into one.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

PaigeMiller · Posted 10-25-2021 08:35 AM

@Kunko wrote:

How can I address this, please? Thank you once again for your unreserved support.

That's up to you. How do you want to address duplicates?

You are exactly right; the blood biomarkers data replicated themselves in places that they are supposed to be missing in an output data set.

But they are not supposed to be missing if there are duplicates of the value of CODEA, the real problem is the duplicates of the values of CODEA.

--
Paige Miller

Kunko · Posted 10-25-2021 10:29 AM

Hi Paige,
Sorry, I was thinking the issue from another perspective, NOT just addressing the DUPLICATES.
I finally removed DUPLICATES in blood biomarker using PROC SORT and fixed my issue!
Thank you so much again!

Merging two data sets

Re: Merging two data sets

Re: Merging two data sets

Re: Merging two data sets

Re: Merging two data sets

Re: Merging two data sets

Re: Merging two data sets

Merging two data sets

Re: Merging two data sets

Re: Merging two data sets

Re: Merging two data sets

Re: Merging two data sets

Re: Merging two data sets

Re: Merging two data sets

Registration is open