BookmarkSubscribeRSS Feed
Sandeep77
Lapis Lazuli | Level 10

Hi Experts,

I have two similar datasets which has most of the values as same but few of the values in some columns are different. I want to compare both of them and have only those rows which have those values as different. I have imported both the datasets and sorted it. Then I have merged the dataset based on the common variable i.e., DM9_Consumer_ID. I tried Proc Compare but it does the overall comparison and does not give the DM9_Consumer_IDs which have inconsistency.  Can you please suggest what is the best way to find the DM9_Consumer_ID which have data difference? Here are both the sample datasets:

Data Existing;
infile cards expandtabs;
input DM9_Consumer_ID $ Calculated_Balance_Non_Covid Calculated_Provision_Covid Calculated_Provision_CBIL Calculated_Provision_Final collectvalcovid Impairment;
datalines ;
D0461 -59972.47 30617.98 30617.98 30617.98 57882.02 30617.98
D0482 38926.17 24643.95 20899.13 20899.13 38926.17 41600.88
D0645 16856.41 499.50 16491.03 16491.03 16856.41 31175.61
D4146 45732.36 -103457.96 45732.36 45732.36 45732.36 86455.14
D9007 7045.11 -15894.38 7045.11 7045.11 7045.11 13318.49
D9012 24650.07 -28707.57 24650.07 24650.07 24650.07 46599.93
D3024 8495.44 -24975.15 8495.44 8495.44 8495.44 16910.65
D3063 7098.62 302.89 6795.73 6795.73 7098.62 12847.05
;
run;

Data Revised;
infile cards expandtabs;
input DM9_Consumer_ID $ Calculated_Balance_Non_Covid Calculated_Provision_Covid Calculated_Provision_CBIL Calculated_Provision_Final collectvalcovid Impairment;
datalines ;
D0461 -59972.47 30617.98 30617.98 30617.98 57882.02 30617.98
D0482 24643.95 0.00 0.00 18027.05 41600.88 18027.05
D0645 499.50 0.00 0.00 365.38 31175.61 365.38
D4146 -103457.96 45732.36 45732.36 45732.36 86455.14 45732.36
D9007 -15894.38 7045.11 7045.11 7045.11 13318.49 7045.11
D9012 -28707.57 24650.07 24650.07 24650.07 46599.93 24650.07
D3024 -24975.15 8495.44 8495.44 8495.44 16910.65 8495.44
D3063 302.89 6795.73 6795.73 7098.62 12847.05 7098.62
;
run;
Proc sort data=existing;
by DM9_Consumer_ID;
run;

Proc sort data=revised;
by DM9_Consumer_ID;
run;

Data merged;
merge existing (in=a) revised (in=b);
by DM9_Consumer_ID;
if a and b;
run;

/*Compare the difference*/
proc compare
 base = existing
 compare = revised;
run;
9 REPLIES 9
sbxkoenk
SAS Super FREQ
Use 
PROC COMPARE statement options: 
NOPRINT
OUT=
OUTBASE
OUTCOMP
OUTDIF
OUTNOEQUAL 
OUTSTATS= ;

Koen

ballardw
Super User

Use the BY DM_Consumer_ID option in Proc Compare as well.

 

If you are particularly interested in just a few particular variables use a VAR statement Proc Compare to reduce unwanted output.

Sandeep77
Lapis Lazuli | Level 10
They both are big dataset with 81 columns and 100k rows. I am only looking at the variables, I have mentioned in the sample datasets. Could you please elaborate your response? Thanks
ballardw
Super User

Data step merge is going to be completely useless of you want to find different values of the variables other than BY variables because of the merge treats like-named variables.

 

Start with

proc compare
 base = existing
 compare = revised
OUTNOEQUAL OUT=WORK.NOEQUAL
; by DM9_Consumer_ID; var Calculated_Balance_Non_Covid Calculated_Provision_Covid Calculated_Provision_CBIL Calculated_Provision_Final collectvalcovid Impairment; run;

The output data set WORK.NOEQUAL will have the value of DM9_consumer_id and the Var variables with the difference of values or the special missing value that displays as E for equal values.

The key part is that Work.Noequal has the By values that have at least one non-equal value in the VAR variables.

If that is sufficient for your "inconsistent" then add a KEEP=DM9_consumer_id data set option to Work.Noequal.

 

Ksharp
Super User

 

 

Data Existing;
infile cards expandtabs;
input DM9_Consumer_ID $ Calculated_Balance_Non_Covid Calculated_Provision_Covid Calculated_Provision_CBIL Calculated_Provision_Final collectvalcovid Impairment;
datalines ;
D0461 -59972.47 30617.98 30617.98 30617.98 57882.02 30617.98
D0482 38926.17 24643.95 20899.13 20899.13 38926.17 41600.88
D0645 16856.41 499.50 16491.03 16491.03 16856.41 31175.61
D4146 45732.36 -103457.96 45732.36 45732.36 45732.36 86455.14
D9007 7045.11 -15894.38 7045.11 7045.11 7045.11 13318.49
D9012 24650.07 -28707.57 24650.07 24650.07 24650.07 46599.93
D3024 8495.44 -24975.15 8495.44 8495.44 8495.44 16910.65
D3063 7098.62 302.89 6795.73 6795.73 7098.62 12847.05
;
run;

Data Revised;
infile cards expandtabs;
input DM9_Consumer_ID $ Calculated_Balance_Non_Covid Calculated_Provision_Covid Calculated_Provision_CBIL Calculated_Provision_Final collectvalcovid Impairment;
datalines ;
D0461 -59972.47 30617.98 30617.98 30617.98 57882.02 30617.98
D0482 24643.95 0.00 0.00 18027.05 41600.88 18027.05
D0645 499.50 0.00 0.00 365.38 31175.61 365.38
D4146 -103457.96 45732.36 45732.36 45732.36 86455.14 45732.36
D9007 -15894.38 7045.11 7045.11 7045.11 13318.49 7045.11
D9012 -28707.57 24650.07 24650.07 24650.07 46599.93 24650.07
D3024 -24975.15 8495.44 8495.44 8495.44 16910.65 8495.44
D3063 302.89 6795.73 6795.73 7098.62 12847.05 7098.62
;
run;







%let old_data= c:\temp\a  ;   *前一次数据 的路径;
%let new_data= c:\temp\b ;   *新数据 的路径;
libname old v9 "&old_data." ;
data old.have;
 set Existing;
run;
libname new v9 "&new_data." ;
data new.have;
 set Revised;
run;



%let subjid= DM9_Consumer_ID;   *受试者编号的变量名. 注意:所有的数据集中是一样的,且值是唯一的;
%let drop= UPDATE_TIME ;   *需要去掉的变量名;




options validvarname=any validmemname=extend mrecall nofmterr nonumber nodate dkricond=nowarn dkrocond=nowarn;

/*******比较 前一次数据 和 新数据********/
libname old v9 "&old_data." access=readonly;
libname new v9 "&new_data." access=readonly;

%macro type_one(idx=,dsn=,title=);
proc datasets library=work kill nolist nodetails;quit;
proc sort data=old.&dsn.(drop=&drop.) out=old;by  &subjid.  ;run;
proc sort data=new.&dsn.(drop=&drop.) out=new;by  &subjid.  ;run;
proc compare data=old compare=new noprint out=diff_&dsn.;
id &subjid.;
run;
data call_define;
 set diff_&dsn.(drop=_TYPE_ _OBS_);
 retain _dummy_ .;
 array n{*} _numeric_;
 array c{*} _character_;
 length vname $ 40;
 do i=1 to dim(n);
   if n{i} not in (. 0) then do;vname=vname(n{i}); if upcase(vname) not in ("%upcase(&subjid.)") then output;end;
 end;
 do i=1 to dim(c);
   if findc(c{i},'X') then do;vname=vname(c{i}); if upcase(vname) not in ("%upcase(&subjid.)") then output;end;
 end;
keep &subjid. vname;
run;
filename t temp;
data _null_;
 set call_define;
 file t;
/* put 'if &subjid.="' &subjid. '" then call define("' vname '","style","style={background=yellow}");';*/
 put "if &subjid.='" &subjid. "' then call define('" vname "','style','style={background=yellow}');";
run;
data &dsn.;
if _n_=1 then do;
 length new_change $ 40;
 if 0 then set old.&dsn.;
 declare hash h(dataset:"old.&dsn.");
 h.definekey("&subjid.");
 h.definedone();

 declare hash h1(dataset:'call_define');
 h1.definekey("&subjid.");
 h1.definedone();
end;
 set new.&dsn.;
 new_change='old' ;
 if h.check() ne 0 then new_change='new' ;
 if h1.check() = 0 then new_change='change' ;
label new_change='数据状态';
run;
/*ods excel file="&out.\&idx..&dsn..xlsx" options(sheet_name="&dsn." autofilter='all' ) ;*/
proc report data=&dsn. split='*'  nowd;
column _all_ dummy;
define _all_/display;
define dummy/computed noprint;
compute dummy;
%include t;
endcomp;
run;
/*ods excel close;*/
%mend;



%type_one(idx=1,dsn=have,title=标题)

 

 

 

Ksharp_0-1721700352309.png

 

Sandeep77
Lapis Lazuli | Level 10
Thanks for the detailed information. But what is libname old v9 "&old_data." and New Libname? When I try to run the data step after the new and Old linrary, it shows ERROR: Library OLD does not exist. Could you please explain? Thanks
Ksharp
Super User
For the comparing these two tables:
I put table “Existing” under path " c:\temp\a " ,and put table “Revised” under path " c:\temp\b " .
If you do not have these two directory, you should firstly create them manually .
And You see I have these two table the same name 'have' to make my code to run.
Sandeep77
Lapis Lazuli | Level 10
Apologies as it is still not clear to me. I have changed the old_data and New_data file path to my existing files.
%let old_data= c:\Users\56029\Desktop\Compare\Existing.xlsx ; *Existing data path;
%let new_data= c:\Users\56029\Desktop\Compare\Revised.xlsx ; *Revised data path;
Are you referring to change the old and new data path to where the actual files are?
Ksharp
Super User

You could see I compare two SAS TABLES, not EXCEL file, I think you should import these excel file into sas firstly.
And %let old_data= is a path/directory/folder , NOT a file.
So is "%let new_data= ".

Here is an example:
%let old_data= c:\Users\56029\Desktop\Compare1\ ;
%let new_data= c:\Users\56029\Desktop\Compare2\ ;

Create two folders "Compare1\" and "Compare2\" by hand, if you do not have under your OS.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 9 replies
  • 2218 views
  • 3 likes
  • 4 in conversation