Hi everyone
I have two datasets from phase I and phase 2 trials. I need to combine the data and compare each variable value from phase2 to each variable from phase 1 and high-light all mismatch value with red color.
Along with this, a new column needs to be inserted ' comment' where if data are matching, it should populate 'matching'. For every mismatch, it will display which variable mismatch and the respective value in parenthesis. Multiple mismatches should be separated by comma.
I need to repeat the same process for 500 variables and lakh of records in new dataset
data phase1;
input PID AGE Visit phase event $ drug $ study $;
cards;
101 23 1 1 asthma parace Yes
102 32 1 1 asthma sinarest Yes
101 23 2 1 headache drg1 No
103 43 1 1 vomit drg4 Yes
103 43 2 1 pain drg5 No
103 43 3 1 pain drg5 Yes
;
run;
data phase2;
input PID AGE Visit phase event $ drug $ study $;
cards;
101 23 1 2 asthma parace Yes
102 32 1 2 asthma . Yes
101 23 2 2 acidity drg1 No
103 43 1 2 vomit drg4 Yes
103 43 2 2 pain drg5 No
103 43 4 2 pan drg5 Yes
107 43 3 2 pain drg5 Yes
;
run;
proc sort data=phase1;by pid visit phase;run;
proc sort data=phase2;by pid visit phase;run;
proc sql;
create table final as
select * from phase1 union all
select * from phase2;
run;
Output
data phase1;
input PID AGE Visit phase event $ drug $ study $;
cards;
101 23 1 1 asthma parace Yes
102 32 1 1 asthma sinarest Yes
101 23 2 1 headache drg1 No
103 43 1 1 vomit drg4 Yes
103 43 2 1 pain drg5 No
103 43 3 1 pain drg5 Yes
;
run;
data phase2;
input PID AGE Visit phase event $ drug $ study $;
cards;
101 23 1 2 asthma parace Yes
102 32 1 2 asthma . Yes
101 23 2 2 acidity drg1 No
103 43 1 2 vomit drg4 Yes
103 43 2 2 pain drg5 No
103 43 4 2 pan drg5 Yes
107 43 3 2 pain drg5 Yes
;
run;
libname old v9 'c:\temp\old';
data old.phase;
set phase1;
drop phase;
run;
libname new v9 'c:\temp\new';
data new.phase;
set phase2;
drop phase;
run;
%macro type_two(dsn=,subjid=,key=);
proc datasets library=work kill nolist nodetails;quit;
proc sort data=old.&dsn. out=old;by &subjid. &key. ;run;
proc sort data=new.&dsn. out=new;by &subjid. &key. ;run;
proc compare data=old compare=new noprint out=diff_&dsn.;
id &subjid. &key. ;
run;
data call_define(where=(upcase(vname) not in ("%upcase(&subjid.)" "%upcase(&key.)" )));
set diff_&dsn.(drop=_TYPE_ _OBS_ );
length vname $ 40;
array n{*} _numeric_;
array c{*} _character_;
do i=1 to dim(n);
if n{i} not in (. 0) then do;vname=vname(n{i});output;end;
end;
do i=1 to dim(c);
if findc(c{i},'X') then do;vname=vname(c{i});output;end;
end;
keep &subjid. &key. vname;
run;
filename t temp;
data _null_;
set call_define;
file t;
put "if &subjid.='" &subjid. "' and &key.='" &key. "' then call define('" vname "','style','style={background=red}');";
run;
data &dsn.;
if _n_=1 then do;
if 0 then set new.&dsn.;
if 0 then set old.&dsn.;
declare hash h(dataset:"old.&dsn.");
h.definekey("&subjid.","&key.");
h.definedone();
declare hash h1(dataset:'call_define');
h1.definekey("&subjid.","&key.");
h1.definedone();
end;
set new.&dsn.;
length new_change $ 40;
if h.check() ne 0 then new_change='new' ;
if h1.check() = 0 then new_change='change' ;
label new_change='new/change';
run;
ods excel file="c:\temp\want.xlsx" options(sheet_name="&dsn." autofilter='all' ) ;
proc report data=&dsn. split='*' nowd;
column _all_ dummy;
define _all_/display;
define dummy/computed noprint;
compute dummy;
%include t;
if new_change='new' then call define(_row_,'style','style={background=red}');
endcomp;
run;
ods excel close;
%mend;
%type_two(dsn=phase,subjid=pid ,key=visit )
Why is "comment" empty in the first six obs in the result?
You can't change the background colour of cells in a dataset, so you need to create a report. Which file format is expected?
In the data shown all variables that need to be compared are alphanumeric. Is this true for all variables in your original data?
EDIT
Please check the data you have posted an align it with the expected result. The value "upset" is not in the data shown.
Thank you sir for checking
As I need to compare new dataset i.e. phase 2 to phase 1 (master sheet), the comment column should display discrepancy only on new dataset. Therefore phase 1 comment column kept blank
I need to generate the report in excel format for analysis.
No, some variables contain values of numeric, some character and some alphanumeric.
@abraham1 wrote:
Thank you sir for checking
As I need to compare new dataset i.e. phase 2 to phase 1 (master sheet), the comment column should display discrepancy only on new dataset. Therefore phase 1 comment column kept blank
I need to generate the report in excel format for analysis.
No, some variables contain values of numeric, some character and some alphanumeric.
Then please change the data you have posted to match what you really have.
I have added one numeric variable ('dose'). please have a look.
The discrepancy in last variable if found during comparison will be added into comment column. I have not updated the final output. Hope it is fine.
data phase1; input PID AGE Visit phase event $ drug $ study $ dose; cards; 101 23 1 1 asthma parace Yes 5 102 32 1 1 asthma sinarest Yes 10 101 23 2 1 headache drg1 No 5 103 43 1 1 vomit drg4 Yes 5 103 43 2 1 pain drg5 No 100 103 43 3 1 pain drg5 Yes 500 ; run; data phase2; input PID AGE Visit phase event $ drug $ study $ dose; cards; 101 23 1 2 asthma parace Yes 5 102 32 1 2 asthma . Yes 5 101 23 2 2 acidity drg1 No 5 103 43 1 2 vomit drg4 Yes 100 103 43 2 2 pain drg5 No 100 103 43 4 2 pan drg5 Yes 500 107 43 3 2 pain drg5 Yes 200 ; run; proc sort data=phase1;by pid visit phase;run; proc sort data=phase2;by pid visit phase;run; proc sql; create table final as select * from phase1 union all select * from phase2; run;
Have checked the PROC COMPARE to achieve the differences, though it is not reported in the required format.
data phase1;
input PID AGE Visit phase event $ drug $ study $;
cards;
101 23 1 1 asthma parace Yes
102 32 1 1 asthma sinarest Yes
101 23 2 1 headache drg1 No
103 43 1 1 vomit drg4 Yes
103 43 2 1 pain drg5 No
103 43 3 1 pain drg5 Yes
;
run;
data phase2;
input PID AGE Visit phase event $ drug $ study $;
cards;
101 23 1 2 asthma parace Yes
102 32 1 2 asthma . Yes
101 23 2 2 acidity drg1 No
103 43 1 2 vomit drg4 Yes
103 43 2 2 pain drg5 No
103 43 4 2 pan drg5 Yes
107 43 3 2 pain drg5 Yes
;
run;
libname old v9 'c:\temp\old';
data old.phase;
set phase1;
drop phase;
run;
libname new v9 'c:\temp\new';
data new.phase;
set phase2;
drop phase;
run;
%macro type_two(dsn=,subjid=,key=);
proc datasets library=work kill nolist nodetails;quit;
proc sort data=old.&dsn. out=old;by &subjid. &key. ;run;
proc sort data=new.&dsn. out=new;by &subjid. &key. ;run;
proc compare data=old compare=new noprint out=diff_&dsn.;
id &subjid. &key. ;
run;
data call_define(where=(upcase(vname) not in ("%upcase(&subjid.)" "%upcase(&key.)" )));
set diff_&dsn.(drop=_TYPE_ _OBS_ );
length vname $ 40;
array n{*} _numeric_;
array c{*} _character_;
do i=1 to dim(n);
if n{i} not in (. 0) then do;vname=vname(n{i});output;end;
end;
do i=1 to dim(c);
if findc(c{i},'X') then do;vname=vname(c{i});output;end;
end;
keep &subjid. &key. vname;
run;
filename t temp;
data _null_;
set call_define;
file t;
put "if &subjid.='" &subjid. "' and &key.='" &key. "' then call define('" vname "','style','style={background=red}');";
run;
data &dsn.;
if _n_=1 then do;
if 0 then set new.&dsn.;
if 0 then set old.&dsn.;
declare hash h(dataset:"old.&dsn.");
h.definekey("&subjid.","&key.");
h.definedone();
declare hash h1(dataset:'call_define');
h1.definekey("&subjid.","&key.");
h1.definedone();
end;
set new.&dsn.;
length new_change $ 40;
if h.check() ne 0 then new_change='new' ;
if h1.check() = 0 then new_change='change' ;
label new_change='new/change';
run;
ods excel file="c:\temp\want.xlsx" options(sheet_name="&dsn." autofilter='all' ) ;
proc report data=&dsn. split='*' nowd;
column _all_ dummy;
define _all_/display;
define dummy/computed noprint;
compute dummy;
%include t;
if new_change='new' then call define(_row_,'style','style={background=red}');
endcomp;
run;
ods excel close;
%mend;
%type_two(dsn=phase,subjid=pid ,key=visit )
Hi Sir,
Thank you for detailed information on the question. I have one question, can we please help me what to do if the data from target is deleted while comparing to source? For the current context only. I just add a new record in the phase1 dataset with subject record 104. What to do if the 104 record is deleted in the phase2 dataset.
data phase1;
input PID AGE Visit phase event $ drug $ study $;
cards;
101 23 1 1 asthma parace Yes
102 32 1 1 asthma sinarest Yes
101 23 2 1 headache drg1 No
103 43 1 1 vomit drg4 Yes
103 43 2 1 pain drg5 No
103 43 3 1 pain drg5 Yes
104 20 3 1 fever drg6 No
;
run;
data phase2;
input PID AGE Visit phase event $ drug $ study $;
cards;
101 23 1 2 asthma parace Yes
102 32 1 2 asthma . Yes
101 23 2 2 acidity drg1 No
103 43 1 2 vomit drg4 Yes
103 43 2 2 pain drg5 No
103 43 4 2 pan drg5 Yes
107 43 3 2 pain drg5 Yes
;
run;
Thank you
/*Try this one*/
data phase1;
input PID AGE Visit phase event $ drug $ study $;
cards;
101 23 1 1 asthma parace Yes
102 32 1 1 asthma sinarest Yes
101 23 2 1 headache drg1 No
103 43 1 1 vomit drg4 Yes
103 43 2 1 pain drg5 No
103 43 3 1 pain drg5 Yes
104 20 3 1 fever drg6 No
;
run;
data phase2;
input PID AGE Visit phase event $ drug $ study $;
cards;
101 23 1 2 asthma parace Yes
102 32 1 2 asthma . Yes
101 23 2 2 acidity drg1 No
103 43 1 2 vomit drg4 Yes
103 43 2 2 pain drg5 No
103 43 4 2 pan drg5 Yes
107 43 3 2 pain drg5 Yes
;
run;
libname old v9 'c:\temp\old';
data old.phase;
set phase1;
drop phase;
run;
libname new v9 'c:\temp\new';
data new.phase;
set phase2;
drop phase;
run;
%macro type_two(dsn=,subjid=,key=);
proc datasets library=work kill nolist nodetails;quit;
proc sort data=old.&dsn. out=old;by &subjid. &key. ;run;
proc sort data=new.&dsn. out=new;by &subjid. &key. ;run;
proc compare data=old compare=new noprint out=diff_&dsn.;
id &subjid. &key. ;
run;
data call_define(where=(upcase(vname) not in ("%upcase(&subjid.)" "%upcase(&key.)" )));
set diff_&dsn.(drop=_TYPE_ _OBS_ );
length vname $ 40;
array n{*} _numeric_;
array c{*} _character_;
do i=1 to dim(n);
if n{i} not in (. 0) then do;vname=vname(n{i});output;end;
end;
do i=1 to dim(c);
if findc(c{i},'X') then do;vname=vname(c{i});output;end;
end;
keep &subjid. &key. vname;
run;
filename t temp;
data _null_;
set call_define;
file t;
put "if &subjid.='" &subjid. "' and &key.='" &key. "' then call define('" vname "','style','style={background=red}');";
run;
data &dsn.;
if _n_=1 then do;
if 0 then set new.&dsn.;
if 0 then set old.&dsn.;
declare hash h(dataset:"old.&dsn.");
h.definekey("&subjid.","&key.");
h.definedone();
declare hash h1(dataset:'call_define');
h1.definekey("&subjid.","&key.");
h1.definedone();
end;
set new.&dsn.;
length new_change $ 40;
if h.check() ne 0 then new_change='new' ;
if h1.check() = 0 then new_change='change' ;
label new_change='new/change/delete';
run;
proc sql;
create table &dsn.2 as
select * from &dsn.
union
(select *,'delete' from old.&dsn. where &subjid. not in (select &subjid. from new.&dsn.));
quit;
ods excel file="c:\temp\want.xlsx" options(sheet_name="&dsn." autofilter='all' ) ;
proc report data=&dsn.2 split='*' nowd;
column _all_ dummy;
define _all_/display;
define dummy/computed noprint;
compute dummy;
%include t;
if new_change='new' then call define(_row_,'style','style={background=verylightblue}');
if new_change='delete' then call define(_row_,'style','style={background=gold}');
endcomp;
run;
ods excel close;
%mend;
%type_two(dsn=phase,subjid=pid ,key=visit )
Thank you so much sir.
Sorry for the trouble. I have one more question, With the above program we would be able to get the subject only if the subject has one occurrence right?
I have a scenario that the subject has multiple occurrence in the source file like 5 time and in the target couple of record are deleted. So, I tried adding a variable to the where cause for teh select statement but i think its not working. i have multiple datapoint with same variable value. How can filter the deleted record if the multiple variable are same. Based on my scenario, updated few data point below. Can you please review.
data phase1;
input PID AGE Visit phase event $ drug $ study $;
cards;
101 23 1 1 asthma parace Yes
102 32 1 1 asthma sinarest Yes
101 23 2 1 headache drg1 No
103 43 1 1 vomit drg4 Yes
103 43 2 1 pain drg5 No
103 43 3 1 pain drg5 Yes
104 20 3 1 fever drg6 No
104 20 3 1 headache drg1 No
104 20 3 1 pain drg3 yes
104 23 3 1 headache drg3 yes
104 20 2 1 pain drg5 No
104 20 4 1 pain drg1 No
104 43 1 1 vomit drg4 Yes
104 23 5 1 headache drg1 No
;
run;
data phase2;
input PID AGE Visit phase event $ drug $ study $;
cards;
101 23 1 2 asthma parace Yes
102 32 1 2 asthma . Yes
101 23 2 2 acidity drg1 No
103 43 1 2 vomit drg4 Yes
103 43 2 2 pain drg5 No
103 43 4 2 pan drg5 Yes
107 43 3 2 pain drg5 Yes
104 20 3 1 fever drg6 No
104 20 3 1 headache drg1 No
104 20 2 1 pain drg5 No
;
run;
/*
" we would be able to get the subject only if the subject has one occurrence right? "
That is not true. My code would pick up all the subject's obs if it is not in OLD dataset.
If you want include VISIT variable too ,try the following code.
NOTE: SUBJID= and KEY= identity a unique obs.
You cann't have two obs have the same PID and same VISIT
Therefore,I modify your data to conform this rule.
*/
data phase1;
input PID AGE Visit phase event $ drug $ study $;
cards;
101 23 1 1 asthma parace Yes
102 32 1 1 asthma sinarest Yes
101 23 2 1 headache drg1 No
103 43 1 1 vomit drg4 Yes
103 43 2 1 pain drg5 No
103 43 3 1 pain drg5 Yes
104 20 6 1 fever drg6 No
104 20 7 1 headache drg1 No
104 20 8 1 pain drg3 yes
104 23 3 1 headache drg3 yes
104 20 2 1 pain drg5 No
104 20 4 1 pain drg1 No
104 43 1 1 vomit drg4 Yes
104 23 5 1 headache drg1 No
;
run;
data phase2;
input PID AGE Visit phase event $ drug $ study $;
cards;
101 23 1 2 asthma parace Yes
102 32 1 2 asthma . Yes
101 23 2 2 acidity drg1 No
103 43 1 2 vomit drg4 Yes
103 43 2 2 pain drg5 No
103 43 4 2 pan drg5 Yes
107 43 3 2 pain drg5 Yes
104 20 1 1 fever drg6 No
104 20 3 1 headache drg1 No
104 20 2 1 pain drg5 No
;
run;
libname old v9 'c:\temp\old';
data old.phase;
set phase1;
drop phase;
run;
libname new v9 'c:\temp\new';
data new.phase;
set phase2;
drop phase;
run;
%macro type_two(dsn=,subjid=,key=);
proc datasets library=work kill nolist nodetails;quit;
proc sort data=old.&dsn. out=old;by &subjid. &key. ;run;
proc sort data=new.&dsn. out=new;by &subjid. &key. ;run;
proc compare data=old compare=new noprint out=diff_&dsn.;
id &subjid. &key. ;
run;
data call_define(where=(upcase(vname) not in ("%upcase(&subjid.)" "%upcase(&key.)" )));
set diff_&dsn.(drop=_TYPE_ _OBS_ );
length vname $ 40;
array n{*} _numeric_;
array c{*} _character_;
do i=1 to dim(n);
if n{i} not in (. 0) then do;vname=vname(n{i});output;end;
end;
do i=1 to dim(c);
if findc(c{i},'X') then do;vname=vname(c{i});output;end;
end;
keep &subjid. &key. vname;
run;
filename t temp;
data _null_;
set call_define;
file t;
put "if &subjid.='" &subjid. "' and &key.='" &key. "' then call define('" vname "','style','style={background=red}');";
run;
data &dsn.;
if _n_=1 then do;
if 0 then set new.&dsn.;
if 0 then set old.&dsn.;
declare hash h(dataset:"old.&dsn.");
h.definekey("&subjid.","&key.");
h.definedone();
declare hash h1(dataset:'call_define');
h1.definekey("&subjid.","&key.");
h1.definedone();
end;
set new.&dsn.;
length new_change $ 40;
if h.check() ne 0 then new_change='new' ;
if h1.check() = 0 then new_change='change' ;
label new_change='new/change/delete';
run;
proc sql;
create table &dsn.2 as
select * from &dsn.
union
(select *,'delete' from old.&dsn. where catx('|',&subjid.,&key.) not in (select catx('|',&subjid.,&key.) from new.&dsn.));
quit;
ods excel file="c:\temp\want.xlsx" options(sheet_name="&dsn." autofilter='all' ) ;
proc report data=&dsn.2 split='*' nowd;
column _all_ dummy;
define _all_/display;
define dummy/computed noprint;
compute dummy;
%include t;
if new_change='new' then call define(_row_,'style','style={background=verylightblue}');
if new_change='delete' then call define(_row_,'style','style={background=gold}');
endcomp;
run;
ods excel close;
%mend;
%type_two(dsn=phase,subjid=pid ,key=visit )
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.