Solved: Re: HASH vs. PROC COMPARE

FK1 · Posted 10-04-2018 11:48 AM

Hello Everybody,

I was trying to verify a hash method for comparing dataset-observations (via keys), thereby revealing a somewhat odd inconsistency:

data cars ;
set sashelp.cars;
run;

data cars2 (keep = Make Model compare_digest);
length compare_digest $ 32;
set sashelp.cars;
compare_digest = put(md5(catq(' ', Type, Origin, DriveTrain, Invoice)), $hex32.);
run;
 
/* source/compare lookup: hash lookup method  */ 
data 
     work.changed_obs
     work.new_obs
     work.same_obs;

   length source_digest $ 32;
   
   drop source_digest compare_digest ;
   
   if 0 then
      set work.cars2;
      
   if _N_ eq 1 then
   do;
      declare hash hct(dataset: 'work.cars2', hashexp: 20);
                   hct.defineKey("Make", "Model");
                   hct.defineData("Make", "Model", "compare_digest");
                   hct.defineDone();
   end;
   
   set work.CARS (keep = Make Model Type Origin DriveTrain Invoice) end = eof ;
   
   /* create digest  */ 
   source_digest = put(md5(catq(' ', Type, Origin, DriveTrain, Invoice)), $hex32.);
   
   /* source/compare match  */ 
   if hct.find() eq 0 then
   do;
      /* source/compare match: no change detected  */ 
      if source_digest eq compare_digest then
         output work.same_obs;
         
      /* source/compare match: change detected  */ 
      else if source_digest ne compare_digest then
         output work.changed_obs;
         
   end;
   /* source table: new records  */ 
   else if hct.find() ne 0 then
      output work.new_obs;

      
run;


data cars_for_comp cars_for_comp2;
set sashelp.cars;
run;

proc sort data=work.cars_for_comp out=cars_for_comp_s; by make Model;run;
proc sort data=work.cars_for_comp2 out=cars_for_comp2_s; by make Model;run;

proc compare noprint base=work.cars_for_comp_s compare=work.cars_for_comp2_s   out=diffs outnoeq    ; by make Model ;
run;

The 1 million $ question is:

why do I get 3 observations with the hash method that are regarded different between the two "car files" whereas when using the COMPARE PROCEDURE I do get the correct result, namely, THERE ARE NO DIFFERENCES!

Cheers,

FK1

mkeintz · Posted 10-04-2018 12:35 PM

This is because there are 3 make/model duplicates in the original data set, so you hash object has only 425 data items.

42  proc sort data=sashelp.cars out=test nodupkey; by make model;run;

NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: 3 observations with duplicate key values were deleted.
NOTE: The data set WORK.TEST has 425 observations and 15 variables.

Which also means there is one dataitem in the hash object for a given make/model, with one value of the MD5 results. But the MD5 function for those duplicates depends on

Type, Origin, DriveTrain, Invoice

which no doubt have differing values for the records with duplicate make/model.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

Reeza · Posted 10-04-2018 12:12 PM

Are the differences significant?

FK1 · Posted 10-04-2018 12:28 PM

Well, there 428 observations and 3 are not being indentified correctly... so, I would say, it is not significantly much, but I do not know the pattern when matching works and when not. So it remains a result of pure chance to me, which is not acceptable when creataing quality controlled processes...

ballardw · Posted 10-04-2018 01:25 PM

@FK1 wrote:
Well, there 428 observations and 3 are not being indentified correctly... so, I would say, it is not significantly much, but I do not know the pattern when matching works and when not. So it remains a result of pure chance to me, which is not acceptable when creataing quality controlled processes...

Make           Model                                     DriveTrain

Infiniti        G35 4dr                                  All
Infiniti        G35 4dr                                  Rear
Mercedes-Benz   C240 4dr                                 All
Mercedes-Benz   C240 4dr                                 Rear
Mercedes-Benz   C320 4dr                                 All
Mercedes-Benz   C320 4dr                                 Rear

mkeintz · Posted 10-04-2018 12:35 PM

This is because there are 3 make/model duplicates in the original data set, so you hash object has only 425 data items.

42  proc sort data=sashelp.cars out=test nodupkey; by make model;run;

NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: 3 observations with duplicate key values were deleted.
NOTE: The data set WORK.TEST has 425 observations and 15 variables.

Which also means there is one dataitem in the hash object for a given make/model, with one value of the MD5 results. But the MD5 function for those duplicates depends on

Type, Origin, DriveTrain, Invoice

which no doubt have differing values for the records with duplicate make/model.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

HASH vs. PROC COMPARE

Re: HASH vs. PROC COMPARE

Re: HASH vs. PROC COMPARE

Re: HASH vs. PROC COMPARE

Re: HASH vs. PROC COMPARE

Re: HASH vs. PROC COMPARE

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away