SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

compare two datasets merged by two different methods

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 7
Accepted Solution

compare two datasets merged by two different methods

[ Edited ]

Hello,

 

I have two dataset which are merged by different methods from the same original datasets.

I want to match  two datasets if there was an error made by me while merging the original datasets.

 

I compared "aflatoxin_r1" in excel merged data (hiv_survival) to "afb1" in sas merged data (positive_s).

But the printed output includes a number of datapoints which has exactly matched values(aflatoxin_r1=afb1 or aflatoxin_r2=afb2 or aflatoxin_r3=afb3).

I matched the format of aflatoxin_r1..., afb1....

 

 

Here is my log.



535
536  *import the file merged with excel;
537  proc import
537! datafile="E:\Research\Manuscript\HIVprogression\DATA_FOR_JOURNAL\02_Merged\HIV_participants_s
537! urvival_analysis.xls" dbms=xls out=hiv_survival replace;
538  run;

NOTE: The import data set has 258 observations and 102 variables.
NOTE: WORK.HIV_SURVIVAL data set was successfully created.
NOTE: PROCEDURE IMPORT used (Total process time):
      real time           0.02 seconds
      cpu time            0.03 seconds


539  data hiv_survival (KEEP=RHSP_ID aflatoxin_r1 aflatoxin_r2 aflatoxin_r3);
540      retain RHSP_ID;
541      format aflatoxin_r1 7.3  aflatoxin_r2 7.3 aflatoxin_r3 7.3;*for matching the format of
541! number (aflatoxin B1-lysine);
542      set hiv_survival;
543  run;

NOTE: There were 258 observations read from the data set WORK.HIV_SURVIVAL.
NOTE: The data set WORK.HIV_SURVIVAL has 258 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


544
545  proc sort data=hiv_survival;
546      by RHSP_ID;
547  run;

NOTE: There were 258 observations read from the data set WORK.HIV_SURVIVAL.
NOTE: The data set WORK.HIV_SURVIVAL has 258 observations and 4 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds


548
549  data positive_s (KEEP=RHSP_ID afb1 afb2 afb3);
550      retain RHSP_ID;
551      format afb1 7.3 afb2 7.3 afb3 7.3; *for matching the format of number (aflatoxin
551! B1-lysine);
552      set positive_s;
553  run;

NOTE: There were 257 observations read from the data set WORK.POSITIVE_S.
NOTE: The data set WORK.POSITIVE_S has 257 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds


554  proc contents data=positive_s;
555  run;

NOTE: PROCEDURE CONTENTS used (Total process time):
      real time           0.11 seconds
      cpu time            0.03 seconds


556  proc sort data=positive_s;
557      by RHSP_ID;
558  run;

NOTE: There were 257 observations read from the data set WORK.POSITIVE_S.
NOTE: The data set WORK.POSITIVE_S has 257 observations and 4 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


559  data both_merged;
560      merge hiv_survival positive_s;
561      by RHSP_ID;
562  run;

NOTE: There were 258 observations read from the data set WORK.HIV_SURVIVAL.
NOTE: There were 257 observations read from the data set WORK.POSITIVE_S.
NOTE: The data set WORK.BOTH_MERGED has 258 observations and 7 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


563
564  /*compare the aflatoxin B1-Lysine*/
565  proc print data=both_merged;
566      where aflatoxin_r1^=afb1;
567  run;

NOTE: There were 43 observations read from the data set WORK.BOTH_MERGED.
      WHERE aflatoxin_r1 not = afb1;
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.17 seconds
      cpu time            0.01 seconds

 

 

Here is my output.

 

 

RHSP_ID     aflatoxin_r1  aflatoxin_r2  aflatoxin_r3  afb1  afb2  afb3

A0049545.3621.9600.4005.3621.9600.400
A02124824.1715.55021.09824.1805.55021.990
A04072810.3940.8777.37310.3940.8777.373
A06066912.6625.4160.82612.6625.4160.826
B0271509.477.1.3539.477.1.353
B0523167.8679.74614.6077.8679.74614.607
B0529001.0125.6866.8881.0125.6866.888
B0560073.9991.6960.4393.9991.6960.439
C0202290.8150.40019.3540.8150.40019.354
C0363874.2091.5872.510...
C0468855.3491.1880.6275.3491.1880.627
C0496024.06013.4230.7184.06013.4230.718
D0042321.03640.5161.3981.03640.5161.398
D03760012.03088.4910.55912.98388.4910.559
D0396503.6211.3870.4003.6211.3870.400
D0516061.9112.6200.8321.9112.6200.832
E0067547.5892.1680.8487.5892.1680.848
E0126922.8953.9408.3522.8953.9408.352
E0286370.8750.5024.7810.8750.5024.781
E0454426.2093.8020.6836.2993.8020.683
E0581653.6573.45210.6003.6573.45210.600
F0077181.8073.1263.2011.8073.1263.201
F02297110.18221.4329.86710.18221.4329.867
F0384791.9718.2251.8881.9718.2251.888
F0564411.8851.7621.4571.8851.7621.457
G0021122.2261.5767.9272.2261.5767.927
G00879835.47819.261.35.47919.261.
G0496981.6840.4001.0381.6840.4001.038
G0528821.4700.8606.0811.4700.8606.081
H0017101.4840.4003.7631.4840.4003.763
H0059980.7340.400.0.7340.400.
H0238213.7212.8310.4003.7212.8310.400
H0523200.4910.6001.9410.4910.6001.941
J0114053.8052.0895.1823.8052.0895.182
J0526221.9108.7420.4001.9108.7420.400
J05487610.7433.6482.60610.7433.6482.606
J05698115.43043.753.15.43043.753.
K0213021.39211.24028.5191.39211.25028.519
K0258811.0541.213.1.0541.213.
K0518301.301.6.7571.301.6.757
K05211010.1582.9220.40010.1582.9220.400
K0559291.1502.0190.8781.1502.0190.878
K0575472.7645.7907.6582.7645.7907.658

Accepted Solutions
Solution
‎02-09-2016 01:54 PM
Super User
Posts: 10,538

Re: compare two datasets merged by two different methods

I would recommend sorting your two merged datasets by a common set of variables and then use PROC COMPARE. There are lots of options for getting different types of details out but the base

 

Proc compare base=onedataset compare=otherdataset listall; run;

would give you unique records in both sets. BUT the sort is important when comparing as the procedure looks at the first record from each set, then the second, third and so on.

View solution in original post


All Replies
Solution
‎02-09-2016 01:54 PM
Super User
Posts: 10,538

Re: compare two datasets merged by two different methods

I would recommend sorting your two merged datasets by a common set of variables and then use PROC COMPARE. There are lots of options for getting different types of details out but the base

 

Proc compare base=onedataset compare=otherdataset listall; run;

would give you unique records in both sets. BUT the sort is important when comparing as the procedure looks at the first record from each set, then the second, third and so on.

Occasional Contributor
Posts: 7

Re: compare two datasets merged by two different methods

thank you.

Your comment is very helpful!

Community Manager
Posts: 486

Re: compare two datasets merged by two different methods

I'm glad you found some useful info, m1ny! If one of the replies was the exact solution to your problem, can you "Accept it as a solution"? Or if one was particularly helpful, feel free to "Like" it. This will help other community members who may run into the same issue know what worked.

Thanks!
Anna

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 3 replies
  • 311 views
  • 1 like
  • 3 in conversation