BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
FK1
Lapis Lazuli | Level 10 FK1
Lapis Lazuli | Level 10

Hello Everybody,

 

I was trying to verify a hash method for comparing dataset-observations (via keys), thereby revealing a somewhat odd inconsistency:

 

data cars ;
set sashelp.cars;
run;

data cars2 (keep = Make Model compare_digest);
length compare_digest $ 32;
set sashelp.cars;
compare_digest = put(md5(catq(' ', Type, Origin, DriveTrain, Invoice)), $hex32.);
run;
 
/* source/compare lookup: hash lookup method  */ 
data 
     work.changed_obs
     work.new_obs
     work.same_obs;

   length source_digest $ 32;
   
   drop source_digest compare_digest ;
   
   if 0 then
      set work.cars2;
      
   if _N_ eq 1 then
   do;
      declare hash hct(dataset: 'work.cars2', hashexp: 20);
                   hct.defineKey("Make", "Model");
                   hct.defineData("Make", "Model", "compare_digest");
                   hct.defineDone();
   end;
   
   set work.CARS (keep = Make Model Type Origin DriveTrain Invoice) end = eof ;
   
   /* create digest  */ 
   source_digest = put(md5(catq(' ', Type, Origin, DriveTrain, Invoice)), $hex32.);
   
   /* source/compare match  */ 
   if hct.find() eq 0 then
   do;
      /* source/compare match: no change detected  */ 
      if source_digest eq compare_digest then
         output work.same_obs;
         
      /* source/compare match: change detected  */ 
      else if source_digest ne compare_digest then
         output work.changed_obs;
         
   end;
   /* source table: new records  */ 
   else if hct.find() ne 0 then
      output work.new_obs;

      
run;


data cars_for_comp cars_for_comp2;
set sashelp.cars;
run;

proc sort data=work.cars_for_comp out=cars_for_comp_s; by make Model;run;
proc sort data=work.cars_for_comp2 out=cars_for_comp2_s; by make Model;run;

proc compare noprint base=work.cars_for_comp_s compare=work.cars_for_comp2_s   out=diffs outnoeq    ; by make Model ;
run;



The 1 million $ question is:

why do I get 3 observations with the hash method that are regarded different between the two "car files" whereas when using the COMPARE PROCEDURE I do get the correct result, namely, THERE ARE NO DIFFERENCES!

 

Cheers,

FK1

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

This is because there are 3   make/model duplicates in the original data set, so you hash object has only 425 data items.

 

42  proc sort data=sashelp.cars out=test nodupkey; by make model;run;

NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: 3 observations with duplicate key values were deleted.
NOTE: The data set WORK.TEST has 425 observations and 15 variables.

 

Which also means there is one dataitem in the hash object for a given make/model, with one value of the MD5 results.  But the MD5 function for those duplicates depends on

      Type, Origin, DriveTrain, Invoice

which no doubt have differing values for the records with duplicate make/model.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

4 REPLIES 4
Reeza
Super User
Are the differences significant?
FK1
Lapis Lazuli | Level 10 FK1
Lapis Lazuli | Level 10
Well, there 428 observations and 3 are not being indentified correctly... so, I would say, it is not significantly much, but I do not know the pattern when matching works and when not. So it remains a result of pure chance to me, which is not acceptable when creataing quality controlled processes...
ballardw
Super User

@FK1 wrote:
Well, there 428 observations and 3 are not being indentified correctly... so, I would say, it is not significantly much, but I do not know the pattern when matching works and when not. So it remains a result of pure chance to me, which is not acceptable when creataing quality controlled processes...
Make           Model                                     DriveTrain

Infiniti        G35 4dr                                  All
Infiniti        G35 4dr                                  Rear
Mercedes-Benz   C240 4dr                                 All
Mercedes-Benz   C240 4dr                                 Rear
Mercedes-Benz   C320 4dr                                 All
Mercedes-Benz   C320 4dr                                 Rear


mkeintz
PROC Star

This is because there are 3   make/model duplicates in the original data set, so you hash object has only 425 data items.

 

42  proc sort data=sashelp.cars out=test nodupkey; by make model;run;

NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: 3 observations with duplicate key values were deleted.
NOTE: The data set WORK.TEST has 425 observations and 15 variables.

 

Which also means there is one dataitem in the hash object for a given make/model, with one value of the MD5 results.  But the MD5 function for those duplicates depends on

      Type, Origin, DriveTrain, Invoice

which no doubt have differing values for the records with duplicate make/model.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 1317 views
  • 2 likes
  • 4 in conversation