DATA Step, Macro, Functions and more

Cleaning data using SAS

Reply
Occasional Contributor
Posts: 5

Cleaning data using SAS

Hello Guys, 

I need your help. I am new to SAS and I got a project regarding Data cleaning . My project is to check the spellings and update the unique name of all the employers of almost 75000 new hires and enter the new unique new employer data into new column(Variable in SAS) , I am right now using If else , but I need to write almost 100 If then statements. 

 

Example -

If Employer = 'Apple' OR Employer = 'Apple Inc' --- then 'Apple' 

else Employer

as New Employer

 

For this the employer distinct names are almost 25000 rows . So have no idea how many days this project might take to complete . Any ideas what will be the estimated timeline to inform to my Manager?Is there a easy method to clean the data in SAS? 

SAS Pro's please help me 

Super User
Posts: 3,256

Re: Cleaning data using SAS

[ Edited ]
Posted in reply to anureddy10

SAS has a product just for this:

 

http://www.sas.com/content/dam/SAS/en_us/doc/factsheet/sas-data-quality-101422.pdf

 

If you don't already have SAS Data Quality / Dataflux then this would be difficult to justify for the one task you have.

 

Of course you can also build your own cleaning processes as well in Base SAS as you are doing. And you could use lookup techniques and a lookup table to hold the employer distinct names. It would be reasonably easy to use this lookup table to validate your new hires' employers. A lot depends on how messy your employer data is to begin with.  

Occasional Contributor
Posts: 5

Re: Cleaning data using SAS

Thanks for the help. 

My data is really very messy. Have all three/two employers in a single record. Not exactly sure which Employer I have to consider. I tried using distinct to sort out the unique employers but that gave  half of the total rows (almost 35000 rows). I thought it might take more of excel work than SAS. 

Respected Advisor
Posts: 4,927

Re: Cleaning data using SAS

Posted in reply to anureddy10

SAS base offers you many tool to facilitate this task. String functions (e.g. upcase, compbl, translate, findw), edit distance functions (e.g. soundex, complev, compged), and most importantly, regular expressions functions (e.g. prxmatch, prxchange). For example:

 

data test;
input name :&$32.;
datalines;
IBM
IBM Corp.
Apple
Apple, Inc
Apple INC
;

data clean;
length newName $32;
set test;
newName = upcase(compbl(translate(name, " ", ";.,:")));
newName = prxchange("s/\b(INC|CORP|PTY|LTD)\b//o", -1, newName);
run;

proc print data=clean; run;
                                     new
                             Obs    Name     name

                              1     IBM      IBM
                              2     IBM      IBM Corp.
                              3     APPLE    Apple
                              4     APPLE    Apple, Inc
                              5     APPLE    Apple INC
PG
Occasional Contributor
Posts: 5

Re: Cleaning data using SAS

Thanks for the suggestion. 

 

I did try the same code but using Proc SQL and wrote compress(upcase(Column)).But My data is really very messy. Have all three/two employers in a single record. Not exactly sure which Employer I have to consider. It has something like

 

Example

Apple/Best Buy

HSBC/Best Buy 

 

In this situation it is tough to analyze the data. So,I thought it might take more of excel work than SAS. 

Can yu suggest me anything for this kind of messy data??

Respected Advisor
Posts: 4,927

Re: Cleaning data using SAS

Posted in reply to anureddy10

Sure, if multiple names are separated with a known set of characters, it could even be done in SQL (not as flexible as the data step) with the help of a small auxiliary dataset:

 

data test;
input name :&$32.;
datalines;
IBM
IBM Corp.
Apple
Apple, Inc
Apple INC
Apple/Best Buy
HSBC/Best Buy 
;

data onetwothree;
do i = 1, 2, 3; output; end;
run;

proc sql;
create table clean as
select 
    name, 
    scan(name,i,"/") as single,
    prxchange("s/\b(INC|CORP|PTY|LTD)\b//o", -1, 
        upcase(compbl(translate(calculated single, " ", ";.,:")))) as newName
from test, onetwothree
where calculated single is not missing;
select * from clean;
quit;
name                              single                            newName


IBM                               IBM                               IBM
IBM Corp.                         IBM Corp.                         IBM
Apple                             Apple                             APPLE
Apple, Inc                        Apple, Inc                        APPLE
Apple INC                         Apple INC                         APPLE
Apple/Best Buy                    Apple                             APPLE
HSBC/Best Buy                     HSBC                              HSBC
Apple/Best Buy                    Best Buy                          BEST BUY
HSBC/Best Buy                     Best Buy                          BEST BUY
PG
Super User
Super User
Posts: 7,977

Re: Cleaning data using SAS

Posted in reply to anureddy10

Do you have a list of those what values map to what data elements?  What I mean is what is your logic for the mapping, Apple = Apple, ok, but what about Aple, apple inc, etc.

 

If you have a list then its very simple to use a lookup list:

proc sql;
  create table WANT as
  select  A.*,
             B.VALUE
  from    HAVE A
  left join CODED B
  on        upcase(compress(A.CODED_ITEM))=B.CODED_ITEM;
quit;

In the above I just handle casing and spaces, but you could strip out special characters, do a best case match etc.

Occasional Contributor
Posts: 5

Re: Cleaning data using SAS

Thanks for the Suggestion.

The thing is I dont have a list if data is like ( Apple, apple inc, apple company, apple store) I have to take a unique Name ie "Apple" for all this kind of Data and same goes with all the other employers which has spellings spelt wrong or company has diff or new names. So even I thought to create a Look up table . But I am kinda struck with records having two or three employers like "Apple/Best Buy/HSBC". In this case I am not sure what code I have to write to take either the first company or might be the last company .

 

Like in this situations if it has to consider lets suppose only first company then I am not sure how to make SAS understand that logic ?

Super User
Super User
Posts: 7,977

Re: Cleaning data using SAS

Posted in reply to anureddy10

I would suggest then that stage 1 would be to create one unique row per item, something like:

data inter; 
  set have;
  i=1;
  do until (scan(employers,i,"/,")="");
    emp=scan(employers,i,"/,");
    output;
  end;
run;

And also get rid of ambiguities like capitilisation, spacing etc.  This should get rid of a lot of differences.  You could also take out known differences, INC or STORE for example.  The simple answer is its going to take a bit of work.

Super User
Posts: 11,343

Re: Cleaning data using SAS

Posted in reply to anureddy10

Possibly hopeless for this but do you by chance have one of the company identification codes associated such as a tax number or employer identification ID? Much better to use one of those and build a single look up table. One reason is that companies may change the names on logos and such but retain the other Ids.

Occasional Contributor
Posts: 5

Re: Cleaning data using SAS

No for this data I dont have any unique ID for the companies . It is just the company Name like apple, best buy . Whatever name the Google gives or I give the company

Ask a Question
Discussion stats
  • 10 replies
  • 474 views
  • 4 likes
  • 5 in conversation