Proc sql code to help link fuzzy records

Occasional Contributor
Posts: 17

Proc sql code to help link fuzzy records

I'm working with fuzzy data and trying to link prenatal care records within a dataset. See my original question post with the link below, where unfortunately none of the proposed solutions worked for me. My latest attempt is to use a working paper on "Linking records matching on at least m of n identifying keys" ( Unfortunately, I have yet to learn SQL so I'm sort of blindly submitting these statements. If someone could see what my error might be, I'd appreciate it. It creates a table with nothing in it.



proc sql;

      create table pairs as

            select       datasetX.subjectid as subjectid1,

                        datasetY.subjectid as subjectid2

            from datasetX datasetY

            where       (      (datasetX.name1 eq datasetY.name1)

                        +       (datasetX.name2 eq datasetY.name2)

                        +       (datasetX.name3 eq datasetY.name3)

                        +       ( eq

                        +       (datasetX.dob eq datasetY.dob)

                        +      (datasetX.village eq datasetY.village) ) >= 2

            and      (datasetX.subjectid < datasetY.subjectid)

            order by subjectid1, subjectid2;



Thanks so much.

Super User
Posts: 13,523

Re: Proc sql code to help link fuzzy records

You might investigate Link Plus is a tool CDC uses to match data from different sources. It may save you a lot of work.


It does exact matches and then assigns probability of matches for non-exact matches. if you need to process very large numbers of records you can set a thershold for the probability to treat as 'assumed' matched in a batch mode.


Cost is free.

Super User
Posts: 23,700

Re: Proc sql code to help link fuzzy records

There's a helpful post on here from FriedEgg

Occasional Contributor
Posts: 17

Re: Proc sql code to help link fuzzy records

Thanks, I'll try those today andhopefully one of the strategies will work. 


Just as an FYI, I had trouble with the Link King because it required some elements that my database didn't have, and wasn't able to use more than one "flex" variable (ie. not one of the core identifier variables that it felt like I should be using) to identify my records, which would have been helpful in my circumstance. However, even when I was successful in getting most of my data to work within its parameters, it mysteriously deleted about 1/2 of my records. I spent half a day trouble shooting that, to no success.

Ask a Question
Discussion stats
  • 3 replies
  • 1 like
  • 3 in conversation