BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Kunko
Obsidian | Level 7

Hello, 

I am trying to merge two data sets with different observations (Data1 (fulldata_pca_2021)=2760 with lifestyle variables, Data2 (blood301774)=1774 with blood biomarkers). A new data set (Data3=fulldata_pca_2021a), as expected, had 2760 observations and all lifestyle and blood biomarkers. However, Data3 (fulldata_pca_2021a) gave higher observations for blood biomarkers (eg.2462 for low density lipoprotein level) than supposed to be 1774 or less. 

 Below is a SAS code I used to merge:

Proc sort data = blood301774; by codea; run;

Proc sort data = fulldata_pca_2021a; by codea; run;

Data fulldata_pca_2021a;
merge fulldata_pca_2021 blood301774;
by codea;
run;

NOTE: Format M211A was not found or could not be loaded.
NOTE: Format MUSPGENA was not found or could not be loaded.
NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 2760 observations read from the data set WORK.FULLDATA_PCA_2021.
NOTE: There were 1774 observations read from the data set WORK.BLOOD301774.
NOTE: The data set WORK.FULLDATA_PCA_2021A has 2760 observations and 25 variables.
NOTE: DATA statement used (Total process time):
real time 0.06 seconds
cpu time 0.03 seconds

Here is a proc means to check mean LDL in Data3;

3933 Proc means data = fulldata_pca_2021a;
3934 Var ldl ;
3935 run;

NOTE: There were 2760 observations read from the data set WORK.FULLDATA_PCA_2021A.
NOTE: PROCEDURE MEANS used (Total process time):
real time 0.03 seconds
cpu time 0.00 seconds

 

Analysis Variable : ldl LDL Cholesterol (mmol/L)

N

Mean

Std Dev

Minimum

Maximum

2462

2.8373680

0.8309344

0.3000000

8.9000000

Thanks

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26

@Kunko wrote:

How can I address this, please? Thank you once again for your unreserved support.

That's up to you. How do you want to address duplicates?

 

You are exactly right; the blood biomarkers data replicated themselves in places that they are supposed to be missing in an output data set.

 

But they are not supposed to be missing if there are duplicates of the value of CODEA, the real problem is the duplicates of the values of CODEA.

--
Paige Miller

View solution in original post

5 REPLIES 5
PaigeMiller
Diamond | Level 26

Have you actually looked at the data set fulldata_pca_2021a with your own eyes? Do you see missing values for LDL or not? From the PROC MEANS output, I suspect you will not see missing values, this is telling you something. You can investigate what the merge did by looking at this output data set named fulldata_pca_2021a.

 

One possibility is that the dataset fulldata_pca_2021 has replicates of each value of CODEA and so the merge is not a one-to-one merge, which should produce 1774 values of LDL, but it is a one-to-many merge. Is that what you see when you look at the data sets?

--
Paige Miller
Kunko
Obsidian | Level 7
Hi Paige,
Thank you so much for your swift reply.
Yes, I have looked at an output data set and there is no as many missing as I expected (nmiss=298).
You are exactly right; the blood biomarkers data replicated themselves in places that they are supposed to be missing in an output data set.
How can I address this, please? Thank you once again for your unreserved support.
Kind Regards,
Tol
Kurt_Bremser
Super User

This is your alert:

NOTE: MERGE statement has more than one data set with repeats of BY values.

This is suspicious whenever it happens. It means that the biomarkers dataset has more than one observation for some key value, which means that those blood markers can be multiplied. You need to first deduplicate at least one of the datasets. This may involve "interesting" code depending on how multiple values can be coalesced into one.

PaigeMiller
Diamond | Level 26

@Kunko wrote:

How can I address this, please? Thank you once again for your unreserved support.

That's up to you. How do you want to address duplicates?

 

You are exactly right; the blood biomarkers data replicated themselves in places that they are supposed to be missing in an output data set.

 

But they are not supposed to be missing if there are duplicates of the value of CODEA, the real problem is the duplicates of the values of CODEA.

--
Paige Miller
Kunko
Obsidian | Level 7
Hi Paige,
Sorry, I was thinking the issue from another perspective, NOT just addressing the DUPLICATES.
I finally removed DUPLICATES in blood biomarker using PROC SORT and fixed my issue!
Thank you so much again!

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 715 views
  • 3 likes
  • 3 in conversation