BookmarkSubscribeRSS Feed
anureddy10
Fluorite | Level 6

Hello Guys, 

I need your help. I am new to SAS and I got a project regarding Data cleaning . My project is to check the spellings and update the unique name of all the employers of almost 75000 new hires and enter the new unique new employer data into new column(Variable in SAS) , I am right now using If else , but I need to write almost 100 If then statements. 

 

Example -

If Employer = 'Apple' OR Employer = 'Apple Inc' --- then 'Apple' 

else Employer

as New Employer

 

For this the employer distinct names are almost 25000 rows . So have no idea how many days this project might take to complete . Any ideas what will be the estimated timeline to inform to my Manager?Is there a easy method to clean the data in SAS? 

SAS Pro's please help me 

12 REPLIES 12
SASKiwi
PROC Star

SAS has a product just for this:

 

http://www.sas.com/content/dam/SAS/en_us/doc/factsheet/sas-data-quality-101422.pdf

 

If you don't already have SAS Data Quality / Dataflux then this would be difficult to justify for the one task you have.

 

Of course you can also build your own cleaning processes as well in Base SAS as you are doing. And you could use lookup techniques and a lookup table to hold the employer distinct names. It would be reasonably easy to use this lookup table to validate your new hires' employers. A lot depends on how messy your employer data is to begin with.  

anureddy10
Fluorite | Level 6

Thanks for the help. 

My data is really very messy. Have all three/two employers in a single record. Not exactly sure which Employer I have to consider. I tried using distinct to sort out the unique employers but that gave  half of the total rows (almost 35000 rows). I thought it might take more of excel work than SAS. 

PGStats
Opal | Level 21

SAS base offers you many tool to facilitate this task. String functions (e.g. upcase, compbl, translate, findw), edit distance functions (e.g. soundex, complev, compged), and most importantly, regular expressions functions (e.g. prxmatch, prxchange). For example:

 

data test;
input name :&$32.;
datalines;
IBM
IBM Corp.
Apple
Apple, Inc
Apple INC
;

data clean;
length newName $32;
set test;
newName = upcase(compbl(translate(name, " ", ";.,:")));
newName = prxchange("s/\b(INC|CORP|PTY|LTD)\b//o", -1, newName);
run;

proc print data=clean; run;
                                     new
                             Obs    Name     name

                              1     IBM      IBM
                              2     IBM      IBM Corp.
                              3     APPLE    Apple
                              4     APPLE    Apple, Inc
                              5     APPLE    Apple INC
PG
anureddy10
Fluorite | Level 6

Thanks for the suggestion. 

 

I did try the same code but using Proc SQL and wrote compress(upcase(Column)).But My data is really very messy. Have all three/two employers in a single record. Not exactly sure which Employer I have to consider. It has something like

 

Example

Apple/Best Buy

HSBC/Best Buy 

 

In this situation it is tough to analyze the data. So,I thought it might take more of excel work than SAS. 

Can yu suggest me anything for this kind of messy data??

PGStats
Opal | Level 21

Sure, if multiple names are separated with a known set of characters, it could even be done in SQL (not as flexible as the data step) with the help of a small auxiliary dataset:

 

data test;
input name :&$32.;
datalines;
IBM
IBM Corp.
Apple
Apple, Inc
Apple INC
Apple/Best Buy
HSBC/Best Buy 
;

data onetwothree;
do i = 1, 2, 3; output; end;
run;

proc sql;
create table clean as
select 
    name, 
    scan(name,i,"/") as single,
    prxchange("s/\b(INC|CORP|PTY|LTD)\b//o", -1, 
        upcase(compbl(translate(calculated single, " ", ";.,:")))) as newName
from test, onetwothree
where calculated single is not missing;
select * from clean;
quit;
name                              single                            newName


IBM                               IBM                               IBM
IBM Corp.                         IBM Corp.                         IBM
Apple                             Apple                             APPLE
Apple, Inc                        Apple, Inc                        APPLE
Apple INC                         Apple INC                         APPLE
Apple/Best Buy                    Apple                             APPLE
HSBC/Best Buy                     HSBC                              HSBC
Apple/Best Buy                    Best Buy                          BEST BUY
HSBC/Best Buy                     Best Buy                          BEST BUY
PG
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Do you have a list of those what values map to what data elements?  What I mean is what is your logic for the mapping, Apple = Apple, ok, but what about Aple, apple inc, etc.

 

If you have a list then its very simple to use a lookup list:

proc sql;
  create table WANT as
  select  A.*,
             B.VALUE
  from    HAVE A
  left join CODED B
  on        upcase(compress(A.CODED_ITEM))=B.CODED_ITEM;
quit;

In the above I just handle casing and spaces, but you could strip out special characters, do a best case match etc.

anureddy10
Fluorite | Level 6

Thanks for the Suggestion.

The thing is I dont have a list if data is like ( Apple, apple inc, apple company, apple store) I have to take a unique Name ie "Apple" for all this kind of Data and same goes with all the other employers which has spellings spelt wrong or company has diff or new names. So even I thought to create a Look up table . But I am kinda struck with records having two or three employers like "Apple/Best Buy/HSBC". In this case I am not sure what code I have to write to take either the first company or might be the last company .

 

Like in this situations if it has to consider lets suppose only first company then I am not sure how to make SAS understand that logic ?

RW9
Diamond | Level 26 RW9
Diamond | Level 26

I would suggest then that stage 1 would be to create one unique row per item, something like:

data inter; 
  set have;
  i=1;
  do until (scan(employers,i,"/,")="");
    emp=scan(employers,i,"/,");
    output;
  end;
run;

And also get rid of ambiguities like capitilisation, spacing etc.  This should get rid of a lot of differences.  You could also take out known differences, INC or STORE for example.  The simple answer is its going to take a bit of work.

ballardw
Super User

Possibly hopeless for this but do you by chance have one of the company identification codes associated such as a tax number or employer identification ID? Much better to use one of those and build a single look up table. One reason is that companies may change the names on logos and such but retain the other Ids.

anureddy10
Fluorite | Level 6

No for this data I dont have any unique ID for the companies . It is just the company Name like apple, best buy . Whatever name the Google gives or I give the company

Lety08
Calcite | Level 5

hello,

what do you ended doing? I am in the same situation with about 10000 entries but more variation in the data names.

 

Thanks

SASKiwi
PROC Star

Try the techniques suggested and see what works best. I find PROC FREQ works well to count distinct values and variations.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 12 replies
  • 3662 views
  • 4 likes
  • 6 in conversation