Re: Comparing similar datasets and finding the difference between the ...

Sandeep77 · Posted 07-22-2024 09:51 AM

Hi Experts,

I have two similar datasets which has most of the values as same but few of the values in some columns are different. I want to compare both of them and have only those rows which have those values as different. I have imported both the datasets and sorted it. Then I have merged the dataset based on the common variable i.e., DM9_Consumer_ID. I tried Proc Compare but it does the overall comparison and does not give the DM9_Consumer_IDs which have inconsistency. Can you please suggest what is the best way to find the DM9_Consumer_ID which have data difference? Here are both the sample datasets:

Data Existing;

infile cards expandtabs;

input DM9_Consumer_ID $ Calculated_Balance_Non_Covid Calculated_Provision_Covid Calculated_Provision_CBIL Calculated_Provision_Final collectvalcovid Impairment;

datalines ;

D0461 -59972.47 30617.98 30617.98 30617.98 57882.02 30617.98

D0482 38926.17 24643.95 20899.13 20899.13 38926.17 41600.88

D0645 16856.41 499.50 16491.03 16491.03 16856.41 31175.61

D4146 45732.36 -103457.96 45732.36 45732.36 45732.36 86455.14

D9007 7045.11 -15894.38 7045.11 7045.11 7045.11 13318.49

D9012 24650.07 -28707.57 24650.07 24650.07 24650.07 46599.93

D3024 8495.44 -24975.15 8495.44 8495.44 8495.44 16910.65

D3063 7098.62 302.89 6795.73 6795.73 7098.62 12847.05

;

run;

Data Revised;

infile cards expandtabs;

input DM9_Consumer_ID $ Calculated_Balance_Non_Covid Calculated_Provision_Covid Calculated_Provision_CBIL Calculated_Provision_Final collectvalcovid Impairment;

datalines ;

D0461 -59972.47 30617.98 30617.98 30617.98 57882.02 30617.98

D0482 24643.95 0.00 0.00 18027.05 41600.88 18027.05

D0645 499.50 0.00 0.00 365.38 31175.61 365.38

D4146 -103457.96 45732.36 45732.36 45732.36 86455.14 45732.36

D9007 -15894.38 7045.11 7045.11 7045.11 13318.49 7045.11

D9012 -28707.57 24650.07 24650.07 24650.07 46599.93 24650.07

D3024 -24975.15 8495.44 8495.44 8495.44 16910.65 8495.44

D3063 302.89 6795.73 6795.73 7098.62 12847.05 7098.62

;

run;

Proc sort data=existing;
by DM9_Consumer_ID;
run;

Proc sort data=revised;
by DM9_Consumer_ID;
run;

Data merged;
merge existing (in=a) revised (in=b);
by DM9_Consumer_ID;
if a and b;
run;

/*Compare the difference*/
proc compare
 base = existing
 compare = revised;
run;

sbxkoenk · Posted 07-22-2024 10:05 AM

Use 
PROC COMPARE statement options: 
NOPRINT
OUT=
OUTBASE
OUTCOMP
OUTDIF
OUTNOEQUAL 
OUTSTATS= ;

Koen

ballardw · Posted 07-22-2024 11:23 AM

Use the BY DM_Consumer_ID option in Proc Compare as well.

If you are particularly interested in just a few particular variables use a VAR statement Proc Compare to reduce unwanted output.

Sandeep77 · Posted 07-22-2024 12:01 PM

They both are big dataset with 81 columns and 100k rows. I am only looking at the variables, I have mentioned in the sample datasets. Could you please elaborate your response? Thanks

ballardw · Posted 07-22-2024 12:28 PM

Data step merge is going to be completely useless of you want to find different values of the variables other than BY variables because of the merge treats like-named variables.

Start with

proc compare
 base = existing
 compare = revised
 OUTNOEQUAL OUT=WORK.NOEQUAL
;
 by DM9_Consumer_ID;
 var Calculated_Balance_Non_Covid Calculated_Provision_Covid Calculated_Provision_CBIL 
     Calculated_Provision_Final collectvalcovid Impairment;
run;

The output data set WORK.NOEQUAL will have the value of DM9_consumer_id and the Var variables with the difference of values or the special missing value that displays as E for equal values.

The key part is that Work.Noequal has the By values that have at least one non-equal value in the VAR variables.

If that is sufficient for your "inconsistent" then add a KEEP=DM9_consumer_id data set option to Work.Noequal.

Ksharp · Posted 07-22-2024 10:07 PM

Data Existing;
infile cards expandtabs;
input DM9_Consumer_ID $ Calculated_Balance_Non_Covid Calculated_Provision_Covid Calculated_Provision_CBIL Calculated_Provision_Final collectvalcovid Impairment;
datalines ;
D0461 -59972.47 30617.98 30617.98 30617.98 57882.02 30617.98
D0482 38926.17 24643.95 20899.13 20899.13 38926.17 41600.88
D0645 16856.41 499.50 16491.03 16491.03 16856.41 31175.61
D4146 45732.36 -103457.96 45732.36 45732.36 45732.36 86455.14
D9007 7045.11 -15894.38 7045.11 7045.11 7045.11 13318.49
D9012 24650.07 -28707.57 24650.07 24650.07 24650.07 46599.93
D3024 8495.44 -24975.15 8495.44 8495.44 8495.44 16910.65
D3063 7098.62 302.89 6795.73 6795.73 7098.62 12847.05
;
run;

Data Revised;
infile cards expandtabs;
input DM9_Consumer_ID $ Calculated_Balance_Non_Covid Calculated_Provision_Covid Calculated_Provision_CBIL Calculated_Provision_Final collectvalcovid Impairment;
datalines ;
D0461 -59972.47 30617.98 30617.98 30617.98 57882.02 30617.98
D0482 24643.95 0.00 0.00 18027.05 41600.88 18027.05
D0645 499.50 0.00 0.00 365.38 31175.61 365.38
D4146 -103457.96 45732.36 45732.36 45732.36 86455.14 45732.36
D9007 -15894.38 7045.11 7045.11 7045.11 13318.49 7045.11
D9012 -28707.57 24650.07 24650.07 24650.07 46599.93 24650.07
D3024 -24975.15 8495.44 8495.44 8495.44 16910.65 8495.44
D3063 302.89 6795.73 6795.73 7098.62 12847.05 7098.62
;
run;







%let old_data= c:\temp\a  ;   *前一次数据 的路径;
%let new_data= c:\temp\b ;   *新数据 的路径;
libname old v9 "&old_data." ;
data old.have;
 set Existing;
run;
libname new v9 "&new_data." ;
data new.have;
 set Revised;
run;



%let subjid= DM9_Consumer_ID;   *受试者编号的变量名. 注意：所有的数据集中是一样的,且值是唯一的;
%let drop= UPDATE_TIME ;   *需要去掉的变量名;




options validvarname=any validmemname=extend mrecall nofmterr nonumber nodate dkricond=nowarn dkrocond=nowarn;

/*******比较 前一次数据 和 新数据********/
libname old v9 "&old_data." access=readonly;
libname new v9 "&new_data." access=readonly;

%macro type_one(idx=,dsn=,title=);
proc datasets library=work kill nolist nodetails;quit;
proc sort data=old.&dsn.(drop=&drop.) out=old;by  &subjid.  ;run;
proc sort data=new.&dsn.(drop=&drop.) out=new;by  &subjid.  ;run;
proc compare data=old compare=new noprint out=diff_&dsn.;
id &subjid.;
run;
data call_define;
 set diff_&dsn.(drop=_TYPE_ _OBS_);
 retain _dummy_ .;
 array n{*} _numeric_;
 array c{*} _character_;
 length vname $ 40;
 do i=1 to dim(n);
   if n{i} not in (. 0) then do;vname=vname(n{i}); if upcase(vname) not in ("%upcase(&subjid.)") then output;end;
 end;
 do i=1 to dim(c);
   if findc(c{i},'X') then do;vname=vname(c{i}); if upcase(vname) not in ("%upcase(&subjid.)") then output;end;
 end;
keep &subjid. vname;
run;
filename t temp;
data _null_;
 set call_define;
 file t;
/* put 'if &subjid.="' &subjid. '" then call define("' vname '","style","style={background=yellow}");';*/
 put "if &subjid.='" &subjid. "' then call define('" vname "','style','style={background=yellow}');";
run;
data &dsn.;
if _n_=1 then do;
 length new_change $ 40;
 if 0 then set old.&dsn.;
 declare hash h(dataset:"old.&dsn.");
 h.definekey("&subjid.");
 h.definedone();

 declare hash h1(dataset:'call_define');
 h1.definekey("&subjid.");
 h1.definedone();
end;
 set new.&dsn.;
 new_change='old' ;
 if h.check() ne 0 then new_change='new' ;
 if h1.check() = 0 then new_change='change' ;
label new_change='数据状态';
run;
/*ods excel file="&out.\&idx..&dsn..xlsx" options(sheet_name="&dsn." autofilter='all' ) ;*/
proc report data=&dsn. split='*'  nowd;
column _all_ dummy;
define _all_/display;
define dummy/computed noprint;
compute dummy;
%include t;
endcomp;
run;
/*ods excel close;*/
%mend;



%type_one(idx=1,dsn=have,title=标题)

Sandeep77 · Posted 07-23-2024 04:52 AM

Thanks for the detailed information. But what is libname old v9 "&old_data." and New Libname? When I try to run the data step after the new and Old linrary, it shows ERROR: Library OLD does not exist. Could you please explain? Thanks

Ksharp · Posted 07-23-2024 05:04 AM

For the comparing these two tables:
I put table “Existing” under path " c:\temp\a " ,and put table “Revised” under path " c:\temp\b " .
If you do not have these two directory, you should firstly create them manually .
And You see I have these two table the same name 'have' to make my code to run.

Sandeep77 · Posted 07-23-2024 05:14 AM

Apologies as it is still not clear to me. I have changed the old_data and New_data file path to my existing files.
%let old_data= c:\Users\56029\Desktop\Compare\Existing.xlsx ; *Existing data path;
%let new_data= c:\Users\56029\Desktop\Compare\Revised.xlsx ; *Revised data path;
Are you referring to change the old and new data path to where the actual files are?

Ksharp · Posted 07-23-2024 05:21 AM

You could see I compare two SAS TABLES, not EXCEL file, I think you should import these excel file into sas firstly.
And %let old_data= is a path/directory/folder , NOT a file.
So is "%let new_data= ".

Here is an example:
%let old_data= c:\Users\56029\Desktop\Compare1\ ;
%let new_data= c:\Users\56029\Desktop\Compare2\ ;

Create two folders "Compare1\" and "Compare2\" by hand, if you do not have under your OS.

Comparing similar datasets and finding the difference between the two datasets

Re: Comparing similar datasets and finding the difference between the two datasets

Re: Comparing similar datasets and finding the difference between the two datasets

Re: Comparing similar datasets and finding the difference between the two datasets

Re: Comparing similar datasets and finding the difference between the two datasets

Re: Comparing similar datasets and finding the difference between the two datasets

Re: Comparing similar datasets and finding the difference between the two datasets

Re: Comparing similar datasets and finding the difference between the two datasets

Re: Comparing similar datasets and finding the difference between the two datasets

Re: Comparing similar datasets and finding the difference between the two datasets

SAS Innovate 2025: Call for Content