Hello,
I have two dataset which are merged by different methods from the same original datasets.
I want to match two datasets if there was an error made by me while merging the original datasets.
I compared "aflatoxin_r1" in excel merged data (hiv_survival) to "afb1" in sas merged data (positive_s).
But the printed output includes a number of datapoints which has exactly matched values(aflatoxin_r1=afb1 or aflatoxin_r2=afb2 or aflatoxin_r3=afb3).
I matched the format of aflatoxin_r1..., afb1....
Here is my log.
535
536 *import the file merged with excel;
537 proc import
537! datafile="E:\Research\Manuscript\HIVprogression\DATA_FOR_JOURNAL\02_Merged\HIV_participants_s
537! urvival_analysis.xls" dbms=xls out=hiv_survival replace;
538 run;
NOTE: The import data set has 258 observations and 102 variables.
NOTE: WORK.HIV_SURVIVAL data set was successfully created.
NOTE: PROCEDURE IMPORT used (Total process time):
real time 0.02 seconds
cpu time 0.03 seconds
539 data hiv_survival (KEEP=RHSP_ID aflatoxin_r1 aflatoxin_r2 aflatoxin_r3);
540 retain RHSP_ID;
541 format aflatoxin_r1 7.3 aflatoxin_r2 7.3 aflatoxin_r3 7.3;*for matching the format of
541! number (aflatoxin B1-lysine);
542 set hiv_survival;
543 run;
NOTE: There were 258 observations read from the data set WORK.HIV_SURVIVAL.
NOTE: The data set WORK.HIV_SURVIVAL has 258 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
544
545 proc sort data=hiv_survival;
546 by RHSP_ID;
547 run;
NOTE: There were 258 observations read from the data set WORK.HIV_SURVIVAL.
NOTE: The data set WORK.HIV_SURVIVAL has 258 observations and 4 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
548
549 data positive_s (KEEP=RHSP_ID afb1 afb2 afb3);
550 retain RHSP_ID;
551 format afb1 7.3 afb2 7.3 afb3 7.3; *for matching the format of number (aflatoxin
551! B1-lysine);
552 set positive_s;
553 run;
NOTE: There were 257 observations read from the data set WORK.POSITIVE_S.
NOTE: The data set WORK.POSITIVE_S has 257 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
554 proc contents data=positive_s;
555 run;
NOTE: PROCEDURE CONTENTS used (Total process time):
real time 0.11 seconds
cpu time 0.03 seconds
556 proc sort data=positive_s;
557 by RHSP_ID;
558 run;
NOTE: There were 257 observations read from the data set WORK.POSITIVE_S.
NOTE: The data set WORK.POSITIVE_S has 257 observations and 4 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
559 data both_merged;
560 merge hiv_survival positive_s;
561 by RHSP_ID;
562 run;
NOTE: There were 258 observations read from the data set WORK.HIV_SURVIVAL.
NOTE: There were 257 observations read from the data set WORK.POSITIVE_S.
NOTE: The data set WORK.BOTH_MERGED has 258 observations and 7 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
563
564 /*compare the aflatoxin B1-Lysine*/
565 proc print data=both_merged;
566 where aflatoxin_r1^=afb1;
567 run;
NOTE: There were 43 observations read from the data set WORK.BOTH_MERGED.
WHERE aflatoxin_r1 not = afb1;
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.17 seconds
cpu time 0.01 seconds
Here is my output.
RHSP_ID aflatoxin_r1 aflatoxin_r2 aflatoxin_r3 afb1 afb2 afb3
A004954 | 5.362 | 1.960 | 0.400 | 5.362 | 1.960 | 0.400 |
A021248 | 24.171 | 5.550 | 21.098 | 24.180 | 5.550 | 21.990 |
A040728 | 10.394 | 0.877 | 7.373 | 10.394 | 0.877 | 7.373 |
A060669 | 12.662 | 5.416 | 0.826 | 12.662 | 5.416 | 0.826 |
B027150 | 9.477 | . | 1.353 | 9.477 | . | 1.353 |
B052316 | 7.867 | 9.746 | 14.607 | 7.867 | 9.746 | 14.607 |
B052900 | 1.012 | 5.686 | 6.888 | 1.012 | 5.686 | 6.888 |
B056007 | 3.999 | 1.696 | 0.439 | 3.999 | 1.696 | 0.439 |
C020229 | 0.815 | 0.400 | 19.354 | 0.815 | 0.400 | 19.354 |
C036387 | 4.209 | 1.587 | 2.510 | . | . | . |
C046885 | 5.349 | 1.188 | 0.627 | 5.349 | 1.188 | 0.627 |
C049602 | 4.060 | 13.423 | 0.718 | 4.060 | 13.423 | 0.718 |
D004232 | 1.036 | 40.516 | 1.398 | 1.036 | 40.516 | 1.398 |
D037600 | 12.030 | 88.491 | 0.559 | 12.983 | 88.491 | 0.559 |
D039650 | 3.621 | 1.387 | 0.400 | 3.621 | 1.387 | 0.400 |
D051606 | 1.911 | 2.620 | 0.832 | 1.911 | 2.620 | 0.832 |
E006754 | 7.589 | 2.168 | 0.848 | 7.589 | 2.168 | 0.848 |
E012692 | 2.895 | 3.940 | 8.352 | 2.895 | 3.940 | 8.352 |
E028637 | 0.875 | 0.502 | 4.781 | 0.875 | 0.502 | 4.781 |
E045442 | 6.209 | 3.802 | 0.683 | 6.299 | 3.802 | 0.683 |
E058165 | 3.657 | 3.452 | 10.600 | 3.657 | 3.452 | 10.600 |
F007718 | 1.807 | 3.126 | 3.201 | 1.807 | 3.126 | 3.201 |
F022971 | 10.182 | 21.432 | 9.867 | 10.182 | 21.432 | 9.867 |
F038479 | 1.971 | 8.225 | 1.888 | 1.971 | 8.225 | 1.888 |
F056441 | 1.885 | 1.762 | 1.457 | 1.885 | 1.762 | 1.457 |
G002112 | 2.226 | 1.576 | 7.927 | 2.226 | 1.576 | 7.927 |
G008798 | 35.478 | 19.261 | . | 35.479 | 19.261 | . |
G049698 | 1.684 | 0.400 | 1.038 | 1.684 | 0.400 | 1.038 |
G052882 | 1.470 | 0.860 | 6.081 | 1.470 | 0.860 | 6.081 |
H001710 | 1.484 | 0.400 | 3.763 | 1.484 | 0.400 | 3.763 |
H005998 | 0.734 | 0.400 | . | 0.734 | 0.400 | . |
H023821 | 3.721 | 2.831 | 0.400 | 3.721 | 2.831 | 0.400 |
H052320 | 0.491 | 0.600 | 1.941 | 0.491 | 0.600 | 1.941 |
J011405 | 3.805 | 2.089 | 5.182 | 3.805 | 2.089 | 5.182 |
J052622 | 1.910 | 8.742 | 0.400 | 1.910 | 8.742 | 0.400 |
J054876 | 10.743 | 3.648 | 2.606 | 10.743 | 3.648 | 2.606 |
J056981 | 15.430 | 43.753 | . | 15.430 | 43.753 | . |
K021302 | 1.392 | 11.240 | 28.519 | 1.392 | 11.250 | 28.519 |
K025881 | 1.054 | 1.213 | . | 1.054 | 1.213 | . |
K051830 | 1.301 | . | 6.757 | 1.301 | . | 6.757 |
K052110 | 10.158 | 2.922 | 0.400 | 10.158 | 2.922 | 0.400 |
K055929 | 1.150 | 2.019 | 0.878 | 1.150 | 2.019 | 0.878 |
K057547 | 2.764 | 5.790 | 7.658 | 2.764 | 5.790 | 7.658 |
I would recommend sorting your two merged datasets by a common set of variables and then use PROC COMPARE. There are lots of options for getting different types of details out but the base
Proc compare base=onedataset compare=otherdataset listall; run;
would give you unique records in both sets. BUT the sort is important when comparing as the procedure looks at the first record from each set, then the second, third and so on.
I would recommend sorting your two merged datasets by a common set of variables and then use PROC COMPARE. There are lots of options for getting different types of details out but the base
Proc compare base=onedataset compare=otherdataset listall; run;
would give you unique records in both sets. BUT the sort is important when comparing as the procedure looks at the first record from each set, then the second, third and so on.
thank you.
Your comment is very helpful!
I'm glad you found some useful info, m1ny! If one of the replies was the exact solution to your problem, can you "Accept it as a solution"? Or if one was particularly helpful, feel free to "Like" it. This will help other community members who may run into the same issue know what worked.
Thanks!
Anna
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.