Hello,
I am trying to merge two data sets with different observations (Data1 (fulldata_pca_2021)=2760 with lifestyle variables, Data2 (blood301774)=1774 with blood biomarkers). A new data set (Data3=fulldata_pca_2021a), as expected, had 2760 observations and all lifestyle and blood biomarkers. However, Data3 (fulldata_pca_2021a) gave higher observations for blood biomarkers (eg.2462 for low density lipoprotein level) than supposed to be 1774 or less.
Below is a SAS code I used to merge:
Proc sort data = blood301774; by codea; run;
Proc sort data = fulldata_pca_2021a; by codea; run;
Data fulldata_pca_2021a;
merge fulldata_pca_2021 blood301774;
by codea;
run;
NOTE: Format M211A was not found or could not be loaded.
NOTE: Format MUSPGENA was not found or could not be loaded.
NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 2760 observations read from the data set WORK.FULLDATA_PCA_2021.
NOTE: There were 1774 observations read from the data set WORK.BLOOD301774.
NOTE: The data set WORK.FULLDATA_PCA_2021A has 2760 observations and 25 variables.
NOTE: DATA statement used (Total process time):
real time 0.06 seconds
cpu time 0.03 seconds
Here is a proc means to check mean LDL in Data3;
3933 Proc means data = fulldata_pca_2021a;
3934 Var ldl ;
3935 run;
NOTE: There were 2760 observations read from the data set WORK.FULLDATA_PCA_2021A.
NOTE: PROCEDURE MEANS used (Total process time):
real time 0.03 seconds
cpu time 0.00 seconds
Analysis Variable : ldl LDL Cholesterol (mmol/L) | ||||
N | Mean | Std Dev | Minimum | Maximum |
2462 | 2.8373680 | 0.8309344 | 0.3000000 | 8.9000000 |
Thanks
@Kunko wrote:
How can I address this, please? Thank you once again for your unreserved support.
That's up to you. How do you want to address duplicates?
You are exactly right; the blood biomarkers data replicated themselves in places that they are supposed to be missing in an output data set.
But they are not supposed to be missing if there are duplicates of the value of CODEA, the real problem is the duplicates of the values of CODEA.
Have you actually looked at the data set fulldata_pca_2021a with your own eyes? Do you see missing values for LDL or not? From the PROC MEANS output, I suspect you will not see missing values, this is telling you something. You can investigate what the merge did by looking at this output data set named fulldata_pca_2021a.
One possibility is that the dataset fulldata_pca_2021 has replicates of each value of CODEA and so the merge is not a one-to-one merge, which should produce 1774 values of LDL, but it is a one-to-many merge. Is that what you see when you look at the data sets?
This is your alert:
NOTE: MERGE statement has more than one data set with repeats of BY values.
This is suspicious whenever it happens. It means that the biomarkers dataset has more than one observation for some key value, which means that those blood markers can be multiplied. You need to first deduplicate at least one of the datasets. This may involve "interesting" code depending on how multiple values can be coalesced into one.
@Kunko wrote:
How can I address this, please? Thank you once again for your unreserved support.
That's up to you. How do you want to address duplicates?
You are exactly right; the blood biomarkers data replicated themselves in places that they are supposed to be missing in an output data set.
But they are not supposed to be missing if there are duplicates of the value of CODEA, the real problem is the duplicates of the values of CODEA.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.