Re: Same dataset match-merge-count?

Hans · Posted 04-27-2011 03:47 PM

Hi,

I was hoping someone could point me to what procedures / coding to use conceptually as I'm having trouble figuring out how to get started on the following and have been stumped all day trying to figure out this:

I have a database consisting of ~100M observations that is sorted by Journal, Article, Volume, and Issue. I want to get Journal-Article counts which I know how to do with a PROC SQL operation, however, the problem is the Article has misspellings for what is essentially the same article and I need to write a clean up code.

In regards to cleaning the code, we have variables in the dataset: PubDate, StartDate, FirstWordArticle, FirstWordAuthor. I want to compare each record to all of the other records within the same journal in the same dataset to see if there is a match on at least 3 of these 4 variables, if so then I want to somehow standardize the ArticleTitle to take the value of the most frequent distinct ArticleTitle found within this group of matches if any spelling variations are found for ArticleTitle. The end goal is to hopefully produce a more realistic top journal-article count for reporting purposes.

I'm not sure what procedures or coding templates to use / refer to in order to start to go about doing this, if anyone has any tips it would be greatly appreciated...

art297 · Posted 04-27-2011 06:25 PM

It would help if you can provide a small sample dataset that contains some of the discrepancies you are trying to capture.

Regardless, I would start by converting everything to upper case. Then, some of the fuzzy matching techniques may be all that you need.

Art

Ksharp · Posted 04-27-2011 10:51 PM

You need special operator =* which is to sound like sth.
[pre]
where name =* 'peter';
[/pre]
Name is to sound like Peter.

Or can use function spedis() which has more control about variable.

Ksharp

Hans · Posted 04-28-2011 10:59 AM

Art / Ksharp, thanks for your help.

Art -

The data has already been trimmed and upper cased, I also COMPRESS out any dashes.

An example of some observations would be (sorry if text wraps):

Obs JournalTitle ArticleTitle Volume Issue PubYear StartPage FirstWordArticle FirstWordAuthor

1 LANCET BETA CAROTENE DEFICIENCY 324 1 2005 131 BETA SILMAN
2 LANCET B CAROTENE DEF 324 1 2005 131 B SILMAN
3 LANCET THROMBOSIS AND NEUROPATHY 2006 THROMBOSIS
4 SCIENCE IMMUNOGLOBULIN ALLOTYPES 11 4 2005 IMMUNOGLOBULIN RHYS

Note that obs 1 & 2 are basically really the same article which is the point of the clean up to attempt to correct irregular article spellings.

The problem is that I need to do the cleanup on a dataset that has ~100 mil observations across around 200k distinct journals. Article cleanups should be specific to each journal so that I don't accidentally clean an article from another journal.

Ksharp - Yes I was thinking of doing a SPEDIS function as an alternative to scoring, however to my understanding the function requires an entry to look for within the function i.e. Value = spedis(ArticleTitle, 'IMMUNOGLOBULIN ALLOTYPES'); I would need to restrict it so that it only runs within the same journal, and also I would need some way of automatically feeding in the different article values that are present within each journal against all the other records in the same journal to do the fuzzy match, and don't really know how to conceptualize how to go about that.

Any advice / thoughts are appreciated. Thanks for your help... Message was edited by: Hans

art297 · Posted 04-28-2011 01:03 PM

Hans,

First of all, if you're budget can take it, you might want to look into the SAS product called dataflux. It might provide your best solution.

That said, some nice examples of using fuzzy logic can be found in TIP00000 near the top of the page at: http://www.sconsig.com/sastip.htm

I would think that you'll want to start by first cleaning up journal names. The following is oversimplified (and, admittedly, not very good coding practice), but it might give you some direction:
[pre]
data have;
informat ArticleTitle $30.;
informat FirstWordArticle $20.;
informat FirstWordAuthor $20.;
infile cards truncover;
input JournalTitle $ ArticleTitle &
Volume Issue PubYear StartPage
FirstWordArticle FirstWordAuthor;
cards;
LANCET BETA CAROTENE DEFICIENCY 324 1 2005 131 BETA SILMAN
LANCETT B CAROTENE DEF 324 1 2005 131 B SILMAN
LANCET THROMBOSIS AND NEUROPATHY . . 2006 . THROMBOSIS
SCIENCE IMMUNOGLOBULIN ALLOTYPES 11 4 2005 . IMMUNOGLOBULIN RHYS
;

proc sql noprint;
create table journals as
select unique JournalTitle
from have
;
quit;

proc transpose data=journals out=test;
var JournalTitle;
run;

data test (drop=hold);
array title(*) $30. col1-col3;
set test;
do i=1 to 3;
if i gt 1 then do;
hold=title(i);
title(i)=title(1);
title(1)=hold;
end;
output;
end;
run;

data test;
array title(*) $30. col1-col3;
array score(2);
set test;
do i=2 to 3;
score(i-1)=spedis(title(1),title(i));
end;
run;
[/pre]

HTH,
Art

Ksharp · Posted 04-29-2011 03:01 AM

I code some.Hope will you a little bit.
[pre]
data have;
informat ArticleTitle $30.;
informat FirstWordArticle $20.;
informat FirstWordAuthor $20.;
infile cards truncover;
input JournalTitle $ ArticleTitle &
Volume Issue PubYear StartPage
FirstWordArticle FirstWordAuthor;
cards;
LANCET BETA CAROTENE DEFICIENCY 324 1 2005 131 BETA SILMAN
LANCETT B CAROTENE DEF 324 1 2005 131 B SILMAN
LANCET THROMBOSIS AND NEUROPATHY . . 2006 . THROMBOSIS
SCIENCE IMMUNOGLOBULIN ALLOTYPES 11 4 2005 . IMMUNOGLOBULIN RHYS
;
run;

data result;
set have;
if _n_ eq 1 then do;_journal=journaltitle;_article=articletitle;end;
dis_journal=spedis(journaltitle,_journal);
dis_article=spedis(articletitle,_article);
if dis_journal ge 10 then _journal=journaltitle;
if dis_article ge 10 then _article=articletitle;
run;
[/pre]
Ksharp

Patrick · Posted 04-29-2011 08:51 AM

Hi Hans

This was now an interesting excercise.

Using the DataFlux to calculate match codes would sure give the best results.

But after calculating these match codes you still would have to compare the rows with each other.

With 100M rows it's all about programming for performance. I would expect a hash/hash iter object to perform best as things can be done in memory (http://support.sas.com/rnd/base/datastep/dot/hash-tip-sheet.pdf).
You probabely want to turn on option fullstimer and check that your process gets enough memory and no paging occurs (or using hashes might not be the right thing to do).

The code example below load for every journal all rows of this journal into a hash table and then compares every row from source (per journal) with all entries in the hash table (all rows of this journal).

The result per row is an additional variable (key_close) which contains the key to the row which gave the lowest score (was most similar) to the current row.

You will still have to work a lot on the logic to calculate this score and get the closest key. You want eventually also to collect all keys with the same minimal score - my code just picks the last lowest one.

In the end I would expect that you can come up with some threashold score value where you can say that it's the same article.

And here the code:

options ls=120 ps=max;

data have;
key+1;
infile datalines dsd truncover;
attrib JournalTitle ArticleTitle Volume FirstWordArticle FirstWordAuthor informat=$30.
Volume Issue PubYear StartPage informat=8.;

input JournalTitle ArticleTitle Volume Issue PubYear StartPage FirstWordArticle FirstWordAuthor;
datalines;
LANCET,BETA CAROTENE DEFICIENCY,324,1,2005,131,BETA,SILMAN
LANCET,B CAROTENE DEF,324,1,2005,131,B,SILMAN
LANCET,B CAROTENE DEF,,1,2005,,B,SILMAN
LANCET,THROMBOSIS AND NEUROPATHY,,,2006,,THROMBOSIS,
SCIENCE,IMMUNOGLOBULIN ALLOTYPES,11,4,2005,,IMMUNOGLOBULIN,RHYS
;
run;

data score(drop=_:);

set have nobs=nobs;
by JournalTitle;

/* load all relevant rows of a Journal into hash table */
if first.JournalTitle then
do;
/* Declare and instantiate hash object "h1" */
declare hash h1(multidata:'Y');
_rc = h1.defineKey('JournalTitle');
_rc = h1.defineData('_ArticleTitle','_Volume','_FirstWordArticle','_FirstWordAuthor','_Volume','_Issue','_PubYear','_StartPage','_key');
_rc = h1.defineDone( );
declare hiter h1_iter('h1');

/* avoid uninitialized variable notes */
call missing(of _:);

do i=_n_ to nobs;
set have
(keep= JournalTitle ArticleTitle Volume FirstWordArticle FirstWordAuthor Issue PubYear StartPage key
rename=(JournalTitle=JournalTitleX ArticleTitle=_ArticleTitle Volume=_Volume FirstWordArticle=_FirstWordArticle
FirstWordAuthor=_FirstWordAuthor Issue=_Issue PubYear=_PubYear StartPage=_StartPage key=_key)
)
point=i;
if JournalTitle ne JournalTitleX then leave;

/* load data for current journal into hash*/
_rc= h1.add();
end;
end;

/* iterate over all rows in hash table (same journal) */
_rc = h1_iter.first();
do while (_rc=0);
if key ne _key then
do;
/* assume same title if condition true */
if _Volume=Volume AND _Issue=Issue AND _PubYear=PubYear AND _StartPage=_StartPage then
do;
score=0;
key_close=_key;
leave;
end;

/* find closest articel in same journal (lowest score) */
else
do;
_NewScore=spedis(cats(ArticleTitle,Volume,FirstWordArticle),cats(_ArticleTitle,_Volume,_FirstWordArticle));
if missing(score) or ( score ne min(score,_NewScore) ) then
do;
score=_NewScore;
key_close=_key;
end;
end;
end;
_rc = h1_iter.next();
end;

run;

proc print data=score;
run;

May be also "Cody's Data Cleaning Techniques Using SAS, Second Edition" could be helpful. Haven't read it - but it's well known.

HTH
Patrick

Hans · Posted 04-29-2011 10:16 AM

Art - Thank you for the code. The journals actually are already cleaned through another script, however you gave me an idea, I'm thinking maybe to ease up on the processing time, the dataset could be broken into thousands of smaller datasets based on journal title which may make the cleanup easier as the smaller datasets won't have to account for different journals in the same dataset. However this may also complicate the process in other ways by requiring a macro code to run against all such generated datasets... I'll have to think about it but thanks for your input!

Ksharp - Thanks for the SPEDIS coding - it makes perfect sense and I think 10 is a good acceptable distance. Unfortunately I ran the SPEDIS idea across to the team and they want a multi-variable scoring match as they think it will reduce any potential false positives vs. using spelling distance.

Patrick - That code is great, I haven't touched hash tables for awhile and will need to review, but if I do end up preserving the main dataset rather than breaking it up into smaller ones, I think I will be giving this a try. The scoring threshold is still being thought out conceptually by my team but I think we are in agreement that it will be a combination of Start Page, PubYear, potentially the first and last words of the article title itself, and author. We will also include volume and issue in the match and would assign a heavy weight to this relative to the others. (a match on volume and issue means it's basically in the same "book" so to speak, a journal being composed of volumes containing issues which in turn contain 5 to 10 articles each issue).

Again, thinking through it a little more - I'm thinking of breaking up the data into smaller datasets but am concerned it might complicate things as it would remove the problem of memory / CPU issues but would result in thousands of different datasets that would require a macro to be written to process each dataset.

Any further ideas will be most appreciated, thanks so much for these code examples, I'm slightly less lost now. 🙂

Hans

Patrick · Posted 04-30-2011 01:10 AM

Hi Hans

The code I've posted loads only the data of one journal at a time into the hash (memory) plus one row at a time from source (well: 2 rows actually).

To implement this way is the part which took me longest to make it work - especially avoiding to re-scan the whole source table for every new journal to be loaded into hash.

I would assume that hardware in a professional environment can easily handle the memory requirements (I've successfully used hash tables with several millions of rows).

The reason why I gave some "memory warning":
There is no explicit “destroy hash” command so I’m not 100% sure if declaring a new hash with the same name than an already existing one actually removes the old hash table from memory (I hope it does) or just “un-links” it.

I didn’t know how many rows per journal there might be and eventually setting a value for the “exp” parameter could speed up things for you (the default of 8 should be o.k. for 1M rows).

And just as a side note: Instead of splitting up this huge tables you might consider storing it on several disks using the SPDE engine and also to create an index over journal. This should later on speed up queries quite a bit.

HTH
Patrick

Message was edited by: Patrick

Hans · Posted 05-03-2011 03:22 PM

Patrick -

Now that I've read up on hashing I can now appreciate how much time (both processing and coding) your method will save.

Our SAS is housed on a UNIX box so yes I think it should be able to handle it, I often batch large macro analytical coding overnight. In regards to the hash declaration, I'm going to do some research - will post here I find anything on the same.

I've actually created a test dataset comprising of only 4 million rows for experimentation, I'll see how the coding plays out against that set in the next few days.

I appreciate your coding thoughts - this is something I haven't used before and I wouldn't have thought of it. 🙂

Hans

Patrick · Posted 05-04-2011 06:51 AM

Hi Hans

An old dog just learnt a new trick 🙂
I couldn't resist to use it for the code I've posted.

A "self-interleaving dataset" for populating the hash - the code just looks better.
https://groups.google.com/group/comp.soft-sys.sas/browse_thread/thread/71ff40f1a21e05ad?hl=en#

By the way: The doco says the following about spedis():
The SPEDIS function is similar to the COMPLEV and COMPGED functions, but COMPLEV and COMPGED are much faster, especially for long strings.

And here the amended code:

options ls=120 ps=max;

data have;
  key+1;
  infile datalines dsd truncover;
  attrib JournalTitle ArticleTitle Volume FirstWordArticle FirstWordAuthor informat=$30.
  Volume Issue PubYear StartPage informat=8.;

  input JournalTitle ArticleTitle Volume Issue PubYear StartPage FirstWordArticle FirstWordAuthor;
  datalines;
LANCET,BETA CAROTENE DEFICIENCY,324,1,2005,131,BETA,SILMAN
LANCET,B CAROTENE DEF,324,1,2005,131,B,SILMAN
LANCET,B CAROTENE DEF,,1,2005,,B,SILMAN
LANCET,THROMBOSIS AND NEUROPATHY,,,2006,,THROMBOSIS,
SCIENCE,IMMUNOGLOBULIN ALLOTYPES,11,4,2005,,IMMUNOGLOBULIN,RHYS
;
run;

data work.PossibleDuplicates;

  attrib score key length=8
  ;

  set have (in=hash
            keep= JournalTitle ArticleTitle Volume FirstWordArticle FirstWordAuthor Issue PubYear StartPage key
            rename=(ArticleTitle=_ArticleTitle Volume=_Volume FirstWordArticle=_FirstWordArticle
                    FirstWordAuthor=_FirstWordAuthor Issue=_Issue PubYear=_PubYear StartPage=_StartPage key=_key)
            )

      have (in=data)

      end=last
    ;
  by JournalTitle;

  /* load all relevant rows of a Journal into hash table */
  if first.JournalTitle and hash then
  do;
    /* Declare and instantiate hash object "h1" */
    declare hash h1(multidata:'Y');
    _rc = h1.defineKey('JournalTitle');
    _rc = h1.defineData('_ArticleTitle','_Volume','_FirstWordArticle','_FirstWordAuthor','_Volume','_Issue','_PubYear','_StartPage','_key');
    _rc = h1.defineDone( );
    declare hiter h1_iter('h1');

    /* avoid uninitialized variable notes */
    call missing(of _:);
  end;

  if hash then
  do;
    /* load data for current journal into hash*/
    _rc= h1.add();
    return;
  end;

  /* iterate over all rows in hash table (same journal) */
  if data then
  do;

    _rc = h1_iter.first();
    do while (_rc=0);

      if key ne _key then /* do not compare an article with itself */
      do;

        /*** calculate scores ***/

        /* assume same title if condition true: set score to 0 */
        if _Volume=Volume AND _Issue=Issue AND _PubYear=PubYear AND _StartPage=StartPage then
        do;
          score=0;
          key_close=_key;
        end;

        /* calculate score for all other cases */
        else
        do;
          score=spedis(cats(ArticleTitle,Volume,FirstWordArticle),cats(_ArticleTitle,_Volume,_FirstWordArticle));
        end;

        /* write possible duplicates to targe table */
        if score<20 then
        do;
          ParentKey=_key;
          output;
        end;

      end; /* end: key ne _key */

      _rc = h1_iter.next();

    end; /* end: do while */

  end; /* end: if data */

run;

proc sql;
  create view V_PossDup as
    select *
    from work.PossibleDuplicates
    order by _key,key,score
  ;
quit;

title 'List of possible duplicate records';
proc print data=V_PossDup noobs uniform;
run;
title;

I made the assumption that sorting out duplicates will have to be a manual process. So the result of the string comparison should possibly be a data set with possible duplicates.

HTH
Patrick

Message was edited by: Patrick

Hans · Posted 05-04-2011 03:57 PM

Hi Patrick,

Just wanted to thank you again for that code - it works on my test sample! 🙂

Took me awhile to understand what's going on in there but I think I kind of get it now. I made a few mods, I made the possible duplicates table to be scores between .00001 and 20 as opposed to 0 and 20 since a 0 would basically be the exact title match later on in SPEDIS. As we are assuming if they match exactly on Volume, Issue, PubYear, etc. they are one and the same, so I give them a score of .00001 if everything else is the same but the ArticleTitle. I only want to examine the ones that have a difference in spelling variation in Article, so I added a piece of code before that part to assign a 0 score if ArticleTitle = _ArticleTitle. The SPEDIS takes care of the remainder.

An example of the nice result of your code is the following duplication detection, note the minor variations in article spelling, but the very close match on the other variables, the combination of such in the concatentation within your SPEDIS function resulted in a score of 4. Here's a very small partial print of what I am seeing:

In order: ArticleTitle, PubYear, Volume, Issue

HOW "DEUTSCH" A REQUIEM? ABSOLUTE MUSIC, UNIVERSALITY, AND THE RECEPTION OF BRAHMS'S "EIN DEUTSCHES REQUIEM," OP. 45 1998 1 3
HOW DEUTSCH A REQUIEM? ABSOLUTE MUSIC, UNIVERSALITY, AND THE RECEPTION OF BRAHMS'S EIN DEUTSCHES REQUIEM, OP. 45 1998 1 3-19

There are some instances of false positives due to the way the article title comes in:

FACULTY POSITIONS AS A CAREER CHOICE FOR PROFESSIONALS--PART II 1991 4 329
FACULTY POSITIONS AS A CAREER CHOICE FOR PROFESSIONALS--PART II 1991 4 329
FACULTY POSITIONS AS A CAREER CHOICE FOR PROFESSIONALS--PART I 1991 3 202
FACULTY POSITIONS AS A CAREER CHOICE FOR PROFESSIONALS--PART I 1991 3 202

I'm thinking of somehow maybe creating a last word variable for the article to get around this situation and perhaps overpenalize non-matches to alleviate the issue.

Still have a long way to go but thank you for getting me started! :) Message was edited by: Hans

Ksharp · Posted 05-04-2011 11:34 PM

Hi.It looks like it a big challenge .
So It would be helpful to post some more data to see more details about data for Patrick and for others who care about this include me :).

Ksharp

Hans · Posted 05-05-2011 03:31 PM

Thanks for offering your help! It definitely is a big challenge...

Definitely I will be happy to share some data if you wanted to play around with it - I'm actually awaiting a refreshed version of the dataset with more variables, so the code might change slightly to incorporate new labeling etc. Probably won't happen until next week though so I figure I can cut a sample then and get the most recent.

In the meantime, I tested the code Patrick sent on a larger amount of observations. Based on some log feedback, I'm getting 11 min. to process 5k observations, and if I remember correctly roughly 2 hrs to process 50k observations. I think the time spent is non-linear and will be variant depending on the amount of observations contained per journal - as I understand it the hash object has to essentially repopulate itself for each journal, larger journals will take longer of course as every row is being compared to all other rows in the same journal without replacement. I think the possible duplicate dataset will be a bit bulky given that we will have 100m rows - I have to think about this further but its definitely a start.

I'm possibly considering making the cleanup two part - examine the SPEDIS value on the article title alone, then examine the SPEDIS of the other match variables. If both are <10 then I'd suspect a true dup and it would alleviate the problem of falsely id'ing a similarly named title in the same journal like for example:

Article, vol, issue:

X - Part I 142 53
X - Part II 143 35

In this case SPEDIS would be low for the article and high for the concatenation of vol and issue so the latter prevents false positives.

Yellow Park 199 12
Yellow Par. 199 12

In this case SPEDIS is low for both so its probably the same article.

Just brainstorming - I have to put more thought into it...

Anyway I'm going to revisit the problem next week - I'm out Friday unfortunately.

Thanks much... 🙂

Hans Message was edited by: Hans

Hans · Posted 05-09-2011 05:12 PM

Patrick

By the way,

I did some research and I found a way to delete an existing hash table before loading in new information to the same-named hash. For each new member of a BY group, the hash can be deleted in the following manner:

if first.JournalTitle and _n_ ne 1 then do;
h.delete();
h.=_new_ hash;
...

end;

Basically, for a hash object h defined in a previous section in the same DATA step, a new hash table can be created per each new value of JournalTitle, which removes the possibility of any memory issues arising from utilizing the same hash reference without doing the clearing out of the prior information.

Still working on loading the new data so I won't be able to revisit the test code until later.

Register Today!

SAS Training: Just a Click Away