Help using Base SAS procedures

Subsetting data- slowly slimming down a base file

Accepted Solution Solved
Reply
Regular Contributor
Regular Contributor
Posts: 162
Accepted Solution

Subsetting data- slowly slimming down a base file

I am doing some basic data cleaning- addresses.

I have a base file with all the observations.

I can write little code blocks that will peel off a subset of the data that meet this or that specific condition- this block deals with PO boxes, that block deals with Rural Routes, that block attacks things without spaces, etc.

So what I have now is:

data poboxes;

set addresses;

if some conditions;

data manipulation;

data ruralroutes;

set addresses;

if some conditions;

data manipulation;

data oneswithoutspaces;

set addresses;

if some conditions;

data manipulation;

etc etc

what I want is to be able to break off this little part, deal with it, and then break off the next little part FROM WHAT IS LEFT and deal with that, and then break off another part FROM WHAT IS LEFT AFTER THAT, etc all the way down.  Creating an ever slimming "base" file.

More like:

data poboxes;

set addresses;

if some conditions;

data manipulation;

data ruralroutes;

set addresses minus the stuff I took out and put in poboxes;

if some conditions;

data manipulation;

data oneswithoutspaces;

set (addresses minus the stuff I took out and put in poboxes) minus the stuff I took out and put in ruralroutes;

if some conditions;

data manipulation;

etc etc

But I don't know the most elegant/best practice in coding structure to do this.

Suggestions?  Please?

Mike


Accepted Solutions
Solution
‎03-08-2013 01:09 PM
PROC Star
Posts: 7,492

Re: Subsetting data- slowly slimming down a base file

Just a slight change in how you approach the problem:

data poboxes ruralroutes oneswithoutspaces;

  set addresses;

  if some conditions then do;

    data manipulation;

                   output poboxes;;

                 end;

  else if some conditions then do;

    data manipulation;

                   output ruralroutes;

              end;

  else if some conditions then do;

    data manipulation;

                   output oneswithoutspacess;

              end;

            run;

View solution in original post


All Replies
Solution
‎03-08-2013 01:09 PM
PROC Star
Posts: 7,492

Re: Subsetting data- slowly slimming down a base file

Just a slight change in how you approach the problem:

data poboxes ruralroutes oneswithoutspaces;

  set addresses;

  if some conditions then do;

    data manipulation;

                   output poboxes;;

                 end;

  else if some conditions then do;

    data manipulation;

                   output ruralroutes;

              end;

  else if some conditions then do;

    data manipulation;

                   output oneswithoutspacess;

              end;

            run;

Regular Contributor
Regular Contributor
Posts: 162

Re: Subsetting data- slowly slimming down a base file

That did it!.  Thanks.

The "if..ends" and "else if ...ends" do peel away the small groups I want to work with.

I just had to figure out the syntax to leave me with the leftovers, as it were.

if some conditions then

do;

    data manipulation;

                   output poboxes;;

               end;


else if some conditions then

do;

    data manipulation;

                   output ruralroutes;

               end;

 

else if some conditions then

do;

    data manipulation;

                   output oneswithoutspacess;

               end;

           

               else

               do;

    data manipulation;

                   output everythingelse;

               end;

               run;

New Contributor Hao
New Contributor
Posts: 3

Re: Subsetting data- slowly slimming down a base file

addresses data  is really hard to clean;

currently, i do just try to scan first one or two words as key words write your if condition; hopefully these help

e.g.

data poboxes;

set addresses;

if scan(address,1) in ('PO', 'POBOX', 'PBOX') THEN

                              DO;

                              ADDRSS = 'POBOX'||' '||SCAN(ADDRESS,-1) ;

                              END;

ELSE IF .....THEN DO;

               ....

               END;

RUN;

Regular Contributor
Regular Contributor
Posts: 162

Re: Subsetting data- slowly slimming down a base file

I agree.  I think the data will never clean 100 percent on SCAN, TRANWRD, STRIP, SUBSTR, etc alone.

I have to match a target file to a reference file.

My plan is to auto clean down the relatively easy stuff (upper-lower case, PO boxes, rural routes), do all the exact matches first, then fuzzy match (using compged), and then hand clean.

If I can get the match rate in the high 90's percent wise with less than 100 hand cleanings required I will call it a success.

I'm starting with at this point no more than 150,000 records in the target file..   

PROC Star
Posts: 1,167

Re: Subsetting data- slowly slimming down a base file

Hi, HB

I've done this sort of thing a couple of times. Your idea is sound, it'll just come down to how messy the data is.

One very good tool for this kind of thing is the PRX suite of functions, to use Regular Expressions. The good news is, they are INCREDIBLY powerful and flexible. The bad news is, they can be INCREDIBLY difficult to figure out exactly what you want, and to get them working properly.

Good luck!

  Tom

Regular Contributor
Regular Contributor
Posts: 162

Re: Subsetting data- slowly slimming down a base file

Thanks for the encouragement.  I have used regular expressions before in ASP/VBscript on unrelated stuff and I may end up trying to use them again.  I agree with your assessment.  I liken them to a good sharp knife- an excellent tool but quite capable of slicing your leg if you aren't paying attention.  I actually did that- 4 staples.

Addresses seem to have a few more rules than names (I think names are awful).  Most addresses start with a number, for example, and I'm hoping I can work with stuff like that.  I don't actually have my reference file for matching yet, so I'm really just sort of messing around with ground work and a dumy match file prior to getting in the game.

My next step is to search the Net for somebody else's work I can steal/borrow from. This task absolutely has been done before, the questions is has anybody put it out on the Net for little old me to use.  Off to look.

New Contributor Hao
New Contributor
Posts: 3

Re: Subsetting data- slowly slimming down a base file

Hi HB,

I did these process for acqusition campaigns for a year; matched third party several million records with existing customers  database; the way I did kind of like baysien rules hopfully these help you or any other people who have the same issues;

e.g. column required: name / address/ postal code;

1. match all records by postal code level (really depends on database size, if you have more than million records it will take really long time for entire fuzzy matching process),

2. most of address read as suit # - street number  street name or street number  street name suit #; try to scan the first / second word (street number) to reduce the records numbers; (Po Box may run seperately)

3. scan two or three street name words use function compged to fuzzy match; then you will have matched at building level;

4. if you have name almost do the same thing as step3; I pretty sure you may have a really good match rate;

PS. necessary data cleanning is required before the whole process; e.g six -> 6, highway -> hwy

Hopefully these work for u;

Good luck

Hao

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 261 views
  • 6 likes
  • 4 in conversation