Re: Cleaning data using SAS

anureddy10 · Posted 12-06-2015 07:35 PM

Hello Guys,

I need your help. I am new to SAS and I got a project regarding Data cleaning . My project is to check the spellings and update the unique name of all the employers of almost 75000 new hires and enter the new unique new employer data into new column(Variable in SAS) , I am right now using If else , but I need to write almost 100 If then statements.

Example -

If Employer = 'Apple' OR Employer = 'Apple Inc' --- then 'Apple'

else Employer

as New Employer

For this the employer distinct names are almost 25000 rows . So have no idea how many days this project might take to complete . Any ideas what will be the estimated timeline to inform to my Manager?Is there a easy method to clean the data in SAS?

SAS Pro's please help me

SASKiwi · Posted 12-06-2015 09:22 PM

SAS has a product just for this:

http://www.sas.com/content/dam/SAS/en_us/doc/factsheet/sas-data-quality-101422.pdf

If you don't already have SAS Data Quality / Dataflux then this would be difficult to justify for the one task you have.

Of course you can also build your own cleaning processes as well in Base SAS as you are doing. And you could use lookup techniques and a lookup table to hold the employer distinct names. It would be reasonably easy to use this lookup table to validate your new hires' employers. A lot depends on how messy your employer data is to begin with.

anureddy10 · Posted 12-07-2015 10:59 PM

Thanks for the help.

My data is really very messy. Have all three/two employers in a single record. Not exactly sure which Employer I have to consider. I tried using distinct to sort out the unique employers but that gave half of the total rows (almost 35000 rows). I thought it might take more of excel work than SAS.

PGStats · Posted 12-07-2015 12:04 AM

SAS base offers you many tool to facilitate this task. String functions (e.g. upcase, compbl, translate, findw), edit distance functions (e.g. soundex, complev, compged), and most importantly, regular expressions functions (e.g. prxmatch, prxchange). For example:

data test;
input name :&$32.;
datalines;
IBM
IBM Corp.
Apple
Apple, Inc
Apple INC
;

data clean;
length newName $32;
set test;
newName = upcase(compbl(translate(name, " ", ";.,:")));
newName = prxchange("s/\b(INC|CORP|PTY|LTD)\b//o", -1, newName);
run;

proc print data=clean; run;

                                     new
                             Obs    Name     name

                              1     IBM      IBM
                              2     IBM      IBM Corp.
                              3     APPLE    Apple
                              4     APPLE    Apple, Inc
                              5     APPLE    Apple INC

PG

anureddy10 · Posted 12-07-2015 11:03 PM

Thanks for the suggestion.

I did try the same code but using Proc SQL and wrote compress(upcase(Column)).But My data is really very messy. Have all three/two employers in a single record. Not exactly sure which Employer I have to consider. It has something like

Example

Apple/Best Buy

HSBC/Best Buy

In this situation it is tough to analyze the data. So,I thought it might take more of excel work than SAS.

Can yu suggest me anything for this kind of messy data??

PGStats · Posted 12-07-2015 11:20 PM

Sure, if multiple names are separated with a known set of characters, it could even be done in SQL (not as flexible as the data step) with the help of a small auxiliary dataset:

data test;
input name :&$32.;
datalines;
IBM
IBM Corp.
Apple
Apple, Inc
Apple INC
Apple/Best Buy
HSBC/Best Buy 
;

data onetwothree;
do i = 1, 2, 3; output; end;
run;

proc sql;
create table clean as
select 
    name, 
    scan(name,i,"/") as single,
    prxchange("s/\b(INC|CORP|PTY|LTD)\b//o", -1, 
        upcase(compbl(translate(calculated single, " ", ";.,:")))) as newName
from test, onetwothree
where calculated single is not missing;
select * from clean;
quit;

name                              single                            newName


IBM                               IBM                               IBM
IBM Corp.                         IBM Corp.                         IBM
Apple                             Apple                             APPLE
Apple, Inc                        Apple, Inc                        APPLE
Apple INC                         Apple INC                         APPLE
Apple/Best Buy                    Apple                             APPLE
HSBC/Best Buy                     HSBC                              HSBC
Apple/Best Buy                    Best Buy                          BEST BUY
HSBC/Best Buy                     Best Buy                          BEST BUY

PG

RW9 · Posted 12-07-2015 04:10 AM

Do you have a list of those what values map to what data elements? What I mean is what is your logic for the mapping, Apple = Apple, ok, but what about Aple, apple inc, etc.

If you have a list then its very simple to use a lookup list:

proc sql;
  create table WANT as
  select  A.*,
             B.VALUE
  from    HAVE A
  left join CODED B
  on        upcase(compress(A.CODED_ITEM))=B.CODED_ITEM;
quit;

In the above I just handle casing and spaces, but you could strip out special characters, do a best case match etc.

anureddy10 · Posted 12-07-2015 11:09 PM

Thanks for the Suggestion.

The thing is I dont have a list if data is like ( Apple, apple inc, apple company, apple store) I have to take a unique Name ie "Apple" for all this kind of Data and same goes with all the other employers which has spellings spelt wrong or company has diff or new names. So even I thought to create a Look up table . But I am kinda struck with records having two or three employers like "Apple/Best Buy/HSBC". In this case I am not sure what code I have to write to take either the first company or might be the last company .

Like in this situations if it has to consider lets suppose only first company then I am not sure how to make SAS understand that logic ?

RW9 · Posted 12-08-2015 04:26 AM

I would suggest then that stage 1 would be to create one unique row per item, something like:

data inter; 
  set have;
  i=1;
  do until (scan(employers,i,"/,")="");
    emp=scan(employers,i,"/,");
    output;
  end;
run;

And also get rid of ambiguities like capitilisation, spacing etc. This should get rid of a lot of differences. You could also take out known differences, INC or STORE for example. The simple answer is its going to take a bit of work.

ballardw · Posted 12-07-2015 12:47 PM

Possibly hopeless for this but do you by chance have one of the company identification codes associated such as a tax number or employer identification ID? Much better to use one of those and build a single look up table. One reason is that companies may change the names on logos and such but retain the other Ids.

anureddy10 · Posted 12-07-2015 11:12 PM

No for this data I dont have any unique ID for the companies . It is just the company Name like apple, best buy . Whatever name the Google gives or I give the company

Lety08 · Posted 04-22-2021 09:25 AM

hello,

what do you ended doing? I am in the same situation with about 10000 entries but more variation in the data names.

Thanks

SASKiwi · Posted 04-22-2021 04:10 PM

Try the techniques suggested and see what works best. I find PROC FREQ works well to count distinct values and variations.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away