BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
HB
Barite | Level 11 HB
Barite | Level 11

I am doing some basic data cleaning- addresses.

I have a base file with all the observations.

I can write little code blocks that will peel off a subset of the data that meet this or that specific condition- this block deals with PO boxes, that block deals with Rural Routes, that block attacks things without spaces, etc.

So what I have now is:

data poboxes;

set addresses;

if some conditions;

data manipulation;

data ruralroutes;

set addresses;

if some conditions;

data manipulation;

data oneswithoutspaces;

set addresses;

if some conditions;

data manipulation;

etc etc

what I want is to be able to break off this little part, deal with it, and then break off the next little part FROM WHAT IS LEFT and deal with that, and then break off another part FROM WHAT IS LEFT AFTER THAT, etc all the way down.  Creating an ever slimming "base" file.

More like:

data poboxes;

set addresses;

if some conditions;

data manipulation;

data ruralroutes;

set addresses minus the stuff I took out and put in poboxes;

if some conditions;

data manipulation;

data oneswithoutspaces;

set (addresses minus the stuff I took out and put in poboxes) minus the stuff I took out and put in ruralroutes;

if some conditions;

data manipulation;

etc etc

But I don't know the most elegant/best practice in coding structure to do this.

Suggestions?  Please?

Mike

1 ACCEPTED SOLUTION

Accepted Solutions
art297
Opal | Level 21

Just a slight change in how you approach the problem:

data poboxes ruralroutes oneswithoutspaces;

  set addresses;

  if some conditions then do;

    data manipulation;

                   output poboxes;;

                 end;

  else if some conditions then do;

    data manipulation;

                   output ruralroutes;

              end;

  else if some conditions then do;

    data manipulation;

                   output oneswithoutspacess;

              end;

            run;

View solution in original post

7 REPLIES 7
art297
Opal | Level 21

Just a slight change in how you approach the problem:

data poboxes ruralroutes oneswithoutspaces;

  set addresses;

  if some conditions then do;

    data manipulation;

                   output poboxes;;

                 end;

  else if some conditions then do;

    data manipulation;

                   output ruralroutes;

              end;

  else if some conditions then do;

    data manipulation;

                   output oneswithoutspacess;

              end;

            run;

HB
Barite | Level 11 HB
Barite | Level 11

That did it!.  Thanks.

The "if..ends" and "else if ...ends" do peel away the small groups I want to work with.

I just had to figure out the syntax to leave me with the leftovers, as it were.

if some conditions then

do;

    data manipulation;

                   output poboxes;;

               end;


else if some conditions then

do;

    data manipulation;

                   output ruralroutes;

               end;

 

else if some conditions then

do;

    data manipulation;

                   output oneswithoutspacess;

               end;

           

               else

               do;

    data manipulation;

                   output everythingelse;

               end;

               run;

Hao
Calcite | Level 5 Hao
Calcite | Level 5

addresses data  is really hard to clean;

currently, i do just try to scan first one or two words as key words write your if condition; hopefully these help

e.g.

data poboxes;

set addresses;

if scan(address,1) in ('PO', 'POBOX', 'PBOX') THEN

                              DO;

                              ADDRSS = 'POBOX'||' '||SCAN(ADDRESS,-1) ;

                              END;

ELSE IF .....THEN DO;

               ....

               END;

RUN;

HB
Barite | Level 11 HB
Barite | Level 11

I agree.  I think the data will never clean 100 percent on SCAN, TRANWRD, STRIP, SUBSTR, etc alone.

I have to match a target file to a reference file.

My plan is to auto clean down the relatively easy stuff (upper-lower case, PO boxes, rural routes), do all the exact matches first, then fuzzy match (using compged), and then hand clean.

If I can get the match rate in the high 90's percent wise with less than 100 hand cleanings required I will call it a success.

I'm starting with at this point no more than 150,000 records in the target file..   

TomKari
Onyx | Level 15

Hi, HB

I've done this sort of thing a couple of times. Your idea is sound, it'll just come down to how messy the data is.

One very good tool for this kind of thing is the PRX suite of functions, to use Regular Expressions. The good news is, they are INCREDIBLY powerful and flexible. The bad news is, they can be INCREDIBLY difficult to figure out exactly what you want, and to get them working properly.

Good luck!

  Tom

HB
Barite | Level 11 HB
Barite | Level 11

Thanks for the encouragement.  I have used regular expressions before in ASP/VBscript on unrelated stuff and I may end up trying to use them again.  I agree with your assessment.  I liken them to a good sharp knife- an excellent tool but quite capable of slicing your leg if you aren't paying attention.  I actually did that- 4 staples.

Addresses seem to have a few more rules than names (I think names are awful).  Most addresses start with a number, for example, and I'm hoping I can work with stuff like that.  I don't actually have my reference file for matching yet, so I'm really just sort of messing around with ground work and a dumy match file prior to getting in the game.

My next step is to search the Net for somebody else's work I can steal/borrow from. This task absolutely has been done before, the questions is has anybody put it out on the Net for little old me to use.  Off to look.

Hao
Calcite | Level 5 Hao
Calcite | Level 5

Hi HB,

I did these process for acqusition campaigns for a year; matched third party several million records with existing customers  database; the way I did kind of like baysien rules hopfully these help you or any other people who have the same issues;

e.g. column required: name / address/ postal code;

1. match all records by postal code level (really depends on database size, if you have more than million records it will take really long time for entire fuzzy matching process),

2. most of address read as suit # - street number  street name or street number  street name suit #; try to scan the first / second word (street number) to reduce the records numbers; (Po Box may run seperately)

3. scan two or three street name words use function compged to fuzzy match; then you will have matched at building level;

4. if you have name almost do the same thing as step3; I pretty sure you may have a really good match rate;

PS. necessary data cleanning is required before the whole process; e.g six -> 6, highway -> hwy

Hopefully these work for u;

Good luck

Hao

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 962 views
  • 6 likes
  • 4 in conversation