Hello Everybody,
I was trying to verify a hash method for comparing dataset-observations (via keys), thereby revealing a somewhat odd inconsistency:
data cars ;
set sashelp.cars;
run;
data cars2 (keep = Make Model compare_digest);
length compare_digest $ 32;
set sashelp.cars;
compare_digest = put(md5(catq(' ', Type, Origin, DriveTrain, Invoice)), $hex32.);
run;
/* source/compare lookup: hash lookup method */
data
work.changed_obs
work.new_obs
work.same_obs;
length source_digest $ 32;
drop source_digest compare_digest ;
if 0 then
set work.cars2;
if _N_ eq 1 then
do;
declare hash hct(dataset: 'work.cars2', hashexp: 20);
hct.defineKey("Make", "Model");
hct.defineData("Make", "Model", "compare_digest");
hct.defineDone();
end;
set work.CARS (keep = Make Model Type Origin DriveTrain Invoice) end = eof ;
/* create digest */
source_digest = put(md5(catq(' ', Type, Origin, DriveTrain, Invoice)), $hex32.);
/* source/compare match */
if hct.find() eq 0 then
do;
/* source/compare match: no change detected */
if source_digest eq compare_digest then
output work.same_obs;
/* source/compare match: change detected */
else if source_digest ne compare_digest then
output work.changed_obs;
end;
/* source table: new records */
else if hct.find() ne 0 then
output work.new_obs;
run;
data cars_for_comp cars_for_comp2;
set sashelp.cars;
run;
proc sort data=work.cars_for_comp out=cars_for_comp_s; by make Model;run;
proc sort data=work.cars_for_comp2 out=cars_for_comp2_s; by make Model;run;
proc compare noprint base=work.cars_for_comp_s compare=work.cars_for_comp2_s out=diffs outnoeq ; by make Model ;
run;
The 1 million $ question is:
why do I get 3 observations with the hash method that are regarded different between the two "car files" whereas when using the COMPARE PROCEDURE I do get the correct result, namely, THERE ARE NO DIFFERENCES!
Cheers,
FK1
This is because there are 3 make/model duplicates in the original data set, so you hash object has only 425 data items.
42 proc sort data=sashelp.cars out=test nodupkey; by make model;run;
NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: 3 observations with duplicate key values were deleted.
NOTE: The data set WORK.TEST has 425 observations and 15 variables.
Which also means there is one dataitem in the hash object for a given make/model, with one value of the MD5 results. But the MD5 function for those duplicates depends on
Type, Origin, DriveTrain, Invoice
which no doubt have differing values for the records with duplicate make/model.
@FK1 wrote:
Well, there 428 observations and 3 are not being indentified correctly... so, I would say, it is not significantly much, but I do not know the pattern when matching works and when not. So it remains a result of pure chance to me, which is not acceptable when creataing quality controlled processes...
Make Model DriveTrain Infiniti G35 4dr All Infiniti G35 4dr Rear Mercedes-Benz C240 4dr All Mercedes-Benz C240 4dr Rear Mercedes-Benz C320 4dr All Mercedes-Benz C320 4dr Rear
This is because there are 3 make/model duplicates in the original data set, so you hash object has only 425 data items.
42 proc sort data=sashelp.cars out=test nodupkey; by make model;run;
NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: 3 observations with duplicate key values were deleted.
NOTE: The data set WORK.TEST has 425 observations and 15 variables.
Which also means there is one dataitem in the hash object for a given make/model, with one value of the MD5 results. But the MD5 function for those duplicates depends on
Type, Origin, DriveTrain, Invoice
which no doubt have differing values for the records with duplicate make/model.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.