Hi Hans
An old dog just learnt a new trick
🙂
I couldn't resist to use it for the code I've posted.
A "self-interleaving dataset" for populating the hash - the code just looks better.
https://groups.google.com/group/comp.soft-sys.sas/browse_thread/thread/71ff40f1a21e05ad?hl=en#
By the way: The doco says the following about spedis():
The SPEDIS function is similar to the COMPLEV and COMPGED functions, but COMPLEV and COMPGED are much faster, especially for long strings.
And here the amended code:
options ls=120 ps=max;
data have;
key+1;
infile datalines dsd truncover;
attrib JournalTitle ArticleTitle Volume FirstWordArticle FirstWordAuthor informat=$30.
Volume Issue PubYear StartPage informat=8.;
input JournalTitle ArticleTitle Volume Issue PubYear StartPage FirstWordArticle FirstWordAuthor;
datalines;
LANCET,BETA CAROTENE DEFICIENCY,324,1,2005,131,BETA,SILMAN
LANCET,B CAROTENE DEF,324,1,2005,131,B,SILMAN
LANCET,B CAROTENE DEF,,1,2005,,B,SILMAN
LANCET,THROMBOSIS AND NEUROPATHY,,,2006,,THROMBOSIS,
SCIENCE,IMMUNOGLOBULIN ALLOTYPES,11,4,2005,,IMMUNOGLOBULIN,RHYS
;
run;
data work.PossibleDuplicates;
attrib score key length=8
;
set have (in=hash
keep= JournalTitle ArticleTitle Volume FirstWordArticle FirstWordAuthor Issue PubYear StartPage key
rename=(ArticleTitle=_ArticleTitle Volume=_Volume FirstWordArticle=_FirstWordArticle
FirstWordAuthor=_FirstWordAuthor Issue=_Issue PubYear=_PubYear StartPage=_StartPage key=_key)
)
have (in=data)
end=last
;
by JournalTitle;
/* load all relevant rows of a Journal into hash table */
if first.JournalTitle and hash then
do;
/* Declare and instantiate hash object "h1" */
declare hash h1(multidata:'Y');
_rc = h1.defineKey('JournalTitle');
_rc = h1.defineData('_ArticleTitle','_Volume','_FirstWordArticle','_FirstWordAuthor','_Volume','_Issue','_PubYear','_StartPage','_key');
_rc = h1.defineDone( );
declare hiter h1_iter('h1');
/* avoid uninitialized variable notes */
call missing(of _:);
end;
if hash then
do;
/* load data for current journal into hash*/
_rc= h1.add();
return;
end;
/* iterate over all rows in hash table (same journal) */
if data then
do;
_rc = h1_iter.first();
do while (_rc=0);
if key ne _key then /* do not compare an article with itself */
do;
/*** calculate scores ***/
/* assume same title if condition true: set score to 0 */
if _Volume=Volume AND _Issue=Issue AND _PubYear=PubYear AND _StartPage=StartPage then
do;
score=0;
key_close=_key;
end;
/* calculate score for all other cases */
else
do;
score=spedis(cats(ArticleTitle,Volume,FirstWordArticle),cats(_ArticleTitle,_Volume,_FirstWordArticle));
end;
/* write possible duplicates to targe table */
if score<20 then
do;
ParentKey=_key;
output;
end;
end; /* end: key ne _key */
_rc = h1_iter.next();
end; /* end: do while */
end; /* end: if data */
run;
proc sql;
create view V_PossDup as
select *
from work.PossibleDuplicates
order by _key,key,score
;
quit;
title 'List of possible duplicate records';
proc print data=V_PossDup noobs uniform;
run;
title;
I made the assumption that sorting out duplicates will have to be a manual process. So the result of the string comparison should possibly be a data set with possible duplicates.
HTH
Patrick
Message was edited by: Patrick