12-06-2015 07:35 PM
I need your help. I am new to SAS and I got a project regarding Data cleaning . My project is to check the spellings and update the unique name of all the employers of almost 75000 new hires and enter the new unique new employer data into new column(Variable in SAS) , I am right now using If else , but I need to write almost 100 If then statements.
If Employer = 'Apple' OR Employer = 'Apple Inc' --- then 'Apple'
as New Employer
For this the employer distinct names are almost 25000 rows . So have no idea how many days this project might take to complete . Any ideas what will be the estimated timeline to inform to my Manager?Is there a easy method to clean the data in SAS?
SAS Pro's please help me
12-06-2015 09:22 PM - edited 12-06-2015 10:17 PM
SAS has a product just for this:
If you don't already have SAS Data Quality / Dataflux then this would be difficult to justify for the one task you have.
Of course you can also build your own cleaning processes as well in Base SAS as you are doing. And you could use lookup techniques and a lookup table to hold the employer distinct names. It would be reasonably easy to use this lookup table to validate your new hires' employers. A lot depends on how messy your employer data is to begin with.
12-07-2015 10:59 PM
Thanks for the help.
My data is really very messy. Have all three/two employers in a single record. Not exactly sure which Employer I have to consider. I tried using distinct to sort out the unique employers but that gave half of the total rows (almost 35000 rows). I thought it might take more of excel work than SAS.
12-07-2015 12:04 AM
SAS base offers you many tool to facilitate this task. String functions (e.g. upcase, compbl, translate, findw), edit distance functions (e.g. soundex, complev, compged), and most importantly, regular expressions functions (e.g. prxmatch, prxchange). For example:
data test; input name :&$32.; datalines; IBM IBM Corp. Apple Apple, Inc Apple INC ; data clean; length newName $32; set test; newName = upcase(compbl(translate(name, " ", ";.,:"))); newName = prxchange("s/\b(INC|CORP|PTY|LTD)\b//o", -1, newName); run; proc print data=clean; run;
new Obs Name name 1 IBM IBM 2 IBM IBM Corp. 3 APPLE Apple 4 APPLE Apple, Inc 5 APPLE Apple INC
12-07-2015 11:03 PM
Thanks for the suggestion.
I did try the same code but using Proc SQL and wrote compress(upcase(Column)).But My data is really very messy. Have all three/two employers in a single record. Not exactly sure which Employer I have to consider. It has something like
In this situation it is tough to analyze the data. So,I thought it might take more of excel work than SAS.
Can yu suggest me anything for this kind of messy data??
12-07-2015 11:20 PM
Sure, if multiple names are separated with a known set of characters, it could even be done in SQL (not as flexible as the data step) with the help of a small auxiliary dataset:
data test; input name :&$32.; datalines; IBM IBM Corp. Apple Apple, Inc Apple INC Apple/Best Buy HSBC/Best Buy ; data onetwothree; do i = 1, 2, 3; output; end; run; proc sql; create table clean as select name, scan(name,i,"/") as single, prxchange("s/\b(INC|CORP|PTY|LTD)\b//o", -1, upcase(compbl(translate(calculated single, " ", ";.,:")))) as newName from test, onetwothree where calculated single is not missing; select * from clean; quit;
name single newName IBM IBM IBM IBM Corp. IBM Corp. IBM Apple Apple APPLE Apple, Inc Apple, Inc APPLE Apple INC Apple INC APPLE Apple/Best Buy Apple APPLE HSBC/Best Buy HSBC HSBC Apple/Best Buy Best Buy BEST BUY HSBC/Best Buy Best Buy BEST BUY
12-07-2015 04:10 AM
Do you have a list of those what values map to what data elements? What I mean is what is your logic for the mapping, Apple = Apple, ok, but what about Aple, apple inc, etc.
If you have a list then its very simple to use a lookup list:
proc sql; create table WANT as select A.*, B.VALUE from HAVE A left join CODED B on upcase(compress(A.CODED_ITEM))=B.CODED_ITEM; quit;
In the above I just handle casing and spaces, but you could strip out special characters, do a best case match etc.
12-07-2015 11:09 PM
Thanks for the Suggestion.
The thing is I dont have a list if data is like ( Apple, apple inc, apple company, apple store) I have to take a unique Name ie "Apple" for all this kind of Data and same goes with all the other employers which has spellings spelt wrong or company has diff or new names. So even I thought to create a Look up table . But I am kinda struck with records having two or three employers like "Apple/Best Buy/HSBC". In this case I am not sure what code I have to write to take either the first company or might be the last company .
Like in this situations if it has to consider lets suppose only first company then I am not sure how to make SAS understand that logic ?
12-08-2015 04:26 AM
I would suggest then that stage 1 would be to create one unique row per item, something like:
data inter; set have; i=1; do until (scan(employers,i,"/,")=""); emp=scan(employers,i,"/,"); output; end; run;
And also get rid of ambiguities like capitilisation, spacing etc. This should get rid of a lot of differences. You could also take out known differences, INC or STORE for example. The simple answer is its going to take a bit of work.
12-07-2015 12:47 PM
Possibly hopeless for this but do you by chance have one of the company identification codes associated such as a tax number or employer identification ID? Much better to use one of those and build a single look up table. One reason is that companies may change the names on logos and such but retain the other Ids.
12-07-2015 11:12 PM
No for this data I dont have any unique ID for the companies . It is just the company Name like apple, best buy . Whatever name the Google gives or I give the company