Solved: compare two datasets merged by two different methods

m1ny · Posted 02-09-2016 12:15 PM

Hello,

I have two dataset which are merged by different methods from the same original datasets.

I want to match two datasets if there was an error made by me while merging the original datasets.

I compared "aflatoxin_r1" in excel merged data (hiv_survival) to "afb1" in sas merged data (positive_s).

But the printed output includes a number of datapoints which has exactly matched values(aflatoxin_r1=afb1 or aflatoxin_r2=afb2 or aflatoxin_r3=afb3).

I matched the format of aflatoxin_r1..., afb1....

Here is my log.

535
536 *import the file merged with excel;
537 proc import
537! datafile="E:\Research\Manuscript\HIVprogression\DATA_FOR_JOURNAL\02_Merged\HIV_participants_s
537! urvival_analysis.xls" dbms=xls out=hiv_survival replace;
538 run;

NOTE: The import data set has 258 observations and 102 variables.
NOTE: WORK.HIV_SURVIVAL data set was successfully created.
NOTE: PROCEDURE IMPORT used (Total process time):
      real time           0.02 seconds
      cpu time            0.03 seconds

539 data hiv_survival (KEEP=RHSP_ID aflatoxin_r1 aflatoxin_r2 aflatoxin_r3);
540      retain RHSP_ID;
541      format aflatoxin_r1 7.3 aflatoxin_r2 7.3 aflatoxin_r3 7.3;*for matching the format of
541! number (aflatoxin B1-lysine);
542      set hiv_survival;
543 run;

NOTE: There were 258 observations read from the data set WORK.HIV_SURVIVAL.
NOTE: The data set WORK.HIV_SURVIVAL has 258 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

544
545 proc sort data=hiv_survival;
546      by RHSP_ID;
547 run;

NOTE: There were 258 observations read from the data set WORK.HIV_SURVIVAL.
NOTE: The data set WORK.HIV_SURVIVAL has 258 observations and 4 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds

548
549 data positive_s (KEEP=RHSP_ID afb1 afb2 afb3);
550      retain RHSP_ID;
551      format afb1 7.3 afb2 7.3 afb3 7.3; *for matching the format of number (aflatoxin
551! B1-lysine);
552      set positive_s;
553 run;

NOTE: There were 257 observations read from the data set WORK.POSITIVE_S.
NOTE: The data set WORK.POSITIVE_S has 257 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds

554 proc contents data=positive_s;
555 run;

NOTE: PROCEDURE CONTENTS used (Total process time):
      real time           0.11 seconds
      cpu time            0.03 seconds

556 proc sort data=positive_s;
557      by RHSP_ID;
558 run;

NOTE: There were 257 observations read from the data set WORK.POSITIVE_S.
NOTE: The data set WORK.POSITIVE_S has 257 observations and 4 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

559 data both_merged;
560      merge hiv_survival positive_s;
561      by RHSP_ID;
562 run;

NOTE: There were 258 observations read from the data set WORK.HIV_SURVIVAL.
NOTE: There were 257 observations read from the data set WORK.POSITIVE_S.
NOTE: The data set WORK.BOTH_MERGED has 258 observations and 7 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

563
564 /*compare the aflatoxin B1-Lysine*/
565 proc print data=both_merged;
566      where aflatoxin_r1^=afb1;
567 run;

NOTE: There were 43 observations read from the data set WORK.BOTH_MERGED.
      WHERE aflatoxin_r1 not = afb1;
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.17 seconds
      cpu time            0.01 seconds

Here is my output.

RHSP_ID aflatoxin_r1 aflatoxin_r2 aflatoxin_r3 afb1 afb2 afb3

A004954	5.362	1.960	0.400	5.362	1.960	0.400
A021248	24.171	5.550	21.098	24.180	5.550	21.990
A040728	10.394	0.877	7.373	10.394	0.877	7.373
A060669	12.662	5.416	0.826	12.662	5.416	0.826
B027150	9.477	.	1.353	9.477	.	1.353
B052316	7.867	9.746	14.607	7.867	9.746	14.607
B052900	1.012	5.686	6.888	1.012	5.686	6.888
B056007	3.999	1.696	0.439	3.999	1.696	0.439
C020229	0.815	0.400	19.354	0.815	0.400	19.354
C036387	4.209	1.587	2.510	.	.	.
C046885	5.349	1.188	0.627	5.349	1.188	0.627
C049602	4.060	13.423	0.718	4.060	13.423	0.718
D004232	1.036	40.516	1.398	1.036	40.516	1.398
D037600	12.030	88.491	0.559	12.983	88.491	0.559
D039650	3.621	1.387	0.400	3.621	1.387	0.400
D051606	1.911	2.620	0.832	1.911	2.620	0.832
E006754	7.589	2.168	0.848	7.589	2.168	0.848
E012692	2.895	3.940	8.352	2.895	3.940	8.352
E028637	0.875	0.502	4.781	0.875	0.502	4.781
E045442	6.209	3.802	0.683	6.299	3.802	0.683
E058165	3.657	3.452	10.600	3.657	3.452	10.600
F007718	1.807	3.126	3.201	1.807	3.126	3.201
F022971	10.182	21.432	9.867	10.182	21.432	9.867
F038479	1.971	8.225	1.888	1.971	8.225	1.888
F056441	1.885	1.762	1.457	1.885	1.762	1.457
G002112	2.226	1.576	7.927	2.226	1.576	7.927
G008798	35.478	19.261	.	35.479	19.261	.
G049698	1.684	0.400	1.038	1.684	0.400	1.038
G052882	1.470	0.860	6.081	1.470	0.860	6.081
H001710	1.484	0.400	3.763	1.484	0.400	3.763
H005998	0.734	0.400	.	0.734	0.400	.
H023821	3.721	2.831	0.400	3.721	2.831	0.400
H052320	0.491	0.600	1.941	0.491	0.600	1.941
J011405	3.805	2.089	5.182	3.805	2.089	5.182
J052622	1.910	8.742	0.400	1.910	8.742	0.400
J054876	10.743	3.648	2.606	10.743	3.648	2.606
J056981	15.430	43.753	.	15.430	43.753	.
K021302	1.392	11.240	28.519	1.392	11.250	28.519
K025881	1.054	1.213	.	1.054	1.213	.
K051830	1.301	.	6.757	1.301	.	6.757
K052110	10.158	2.922	0.400	10.158	2.922	0.400
K055929	1.150	2.019	0.878	1.150	2.019	0.878
K057547	2.764	5.790	7.658	2.764	5.790	7.658

ballardw · Posted 02-09-2016 12:45 PM

I would recommend sorting your two merged datasets by a common set of variables and then use PROC COMPARE. There are lots of options for getting different types of details out but the base

Proc compare base=onedataset compare=otherdataset listall; run;

would give you unique records in both sets. BUT the sort is important when comparing as the procedure looks at the first record from each set, then the second, third and so on.

View solution in original post

ballardw · Posted 02-09-2016 12:45 PM

I would recommend sorting your two merged datasets by a common set of variables and then use PROC COMPARE. There are lots of options for getting different types of details out but the base

Proc compare base=onedataset compare=otherdataset listall; run;

would give you unique records in both sets. BUT the sort is important when comparing as the procedure looks at the first record from each set, then the second, third and so on.

m1ny · Posted 02-09-2016 01:32 PM

thank you.

Your comment is very helpful!

m1ny · Posted 02-09-2016 01:34 PM

I'm glad you found some useful info, m1ny! If one of the replies was the exact solution to your problem, can you "Accept it as a solution"? Or if one was particularly helpful, feel free to "Like" it. This will help other community members who may run into the same issue know what worked.

Thanks!
Anna

Access SAS Innovate on-demand content now!

compare two datasets merged by two different methods

Re: compare two datasets merged by two different methods

Re: compare two datasets merged by two different methods

Re: compare two datasets merged by two different methods

Re: compare two datasets merged by two different methods

compare two datasets merged by two different methods

Re: compare two datasets merged by two different methods

Re: compare two datasets merged by two different methods

Re: compare two datasets merged by two different methods

Re: compare two datasets merged by two different methods

Ready to join fellow brilliant minds for the SAS Hackathon?