Hello Guys,
I need your help. I am new to SAS and I got a project regarding Data cleaning . My project is to check the spellings and update the unique name of all the employers of almost 75000 new hires and enter the new unique new employer data into new column(Variable in SAS) , I am right now using If else , but I need to write almost 100 If then statements.
Example -
If Employer = 'Apple' OR Employer = 'Apple Inc' --- then 'Apple'
else Employer
as New Employer
For this the employer distinct names are almost 25000 rows . So have no idea how many days this project might take to complete . Any ideas what will be the estimated timeline to inform to my Manager?Is there a easy method to clean the data in SAS?
SAS Pro's please help me
SAS has a product just for this:
http://www.sas.com/content/dam/SAS/en_us/doc/factsheet/sas-data-quality-101422.pdf
If you don't already have SAS Data Quality / Dataflux then this would be difficult to justify for the one task you have.
Of course you can also build your own cleaning processes as well in Base SAS as you are doing. And you could use lookup techniques and a lookup table to hold the employer distinct names. It would be reasonably easy to use this lookup table to validate your new hires' employers. A lot depends on how messy your employer data is to begin with.
Thanks for the help.
My data is really very messy. Have all three/two employers in a single record. Not exactly sure which Employer I have to consider. I tried using distinct to sort out the unique employers but that gave half of the total rows (almost 35000 rows). I thought it might take more of excel work than SAS.
SAS base offers you many tool to facilitate this task. String functions (e.g. upcase, compbl, translate, findw), edit distance functions (e.g. soundex, complev, compged), and most importantly, regular expressions functions (e.g. prxmatch, prxchange). For example:
data test;
input name :&$32.;
datalines;
IBM
IBM Corp.
Apple
Apple, Inc
Apple INC
;
data clean;
length newName $32;
set test;
newName = upcase(compbl(translate(name, " ", ";.,:")));
newName = prxchange("s/\b(INC|CORP|PTY|LTD)\b//o", -1, newName);
run;
proc print data=clean; run;
new Obs Name name 1 IBM IBM 2 IBM IBM Corp. 3 APPLE Apple 4 APPLE Apple, Inc 5 APPLE Apple INC
Thanks for the suggestion.
I did try the same code but using Proc SQL and wrote compress(upcase(Column)).But My data is really very messy. Have all three/two employers in a single record. Not exactly sure which Employer I have to consider. It has something like
Example
Apple/Best Buy
HSBC/Best Buy
In this situation it is tough to analyze the data. So,I thought it might take more of excel work than SAS.
Can yu suggest me anything for this kind of messy data??
Sure, if multiple names are separated with a known set of characters, it could even be done in SQL (not as flexible as the data step) with the help of a small auxiliary dataset:
data test;
input name :&$32.;
datalines;
IBM
IBM Corp.
Apple
Apple, Inc
Apple INC
Apple/Best Buy
HSBC/Best Buy
;
data onetwothree;
do i = 1, 2, 3; output; end;
run;
proc sql;
create table clean as
select
name,
scan(name,i,"/") as single,
prxchange("s/\b(INC|CORP|PTY|LTD)\b//o", -1,
upcase(compbl(translate(calculated single, " ", ";.,:")))) as newName
from test, onetwothree
where calculated single is not missing;
select * from clean;
quit;
name single newName IBM IBM IBM IBM Corp. IBM Corp. IBM Apple Apple APPLE Apple, Inc Apple, Inc APPLE Apple INC Apple INC APPLE Apple/Best Buy Apple APPLE HSBC/Best Buy HSBC HSBC Apple/Best Buy Best Buy BEST BUY HSBC/Best Buy Best Buy BEST BUY
Do you have a list of those what values map to what data elements? What I mean is what is your logic for the mapping, Apple = Apple, ok, but what about Aple, apple inc, etc.
If you have a list then its very simple to use a lookup list:
proc sql; create table WANT as select A.*, B.VALUE from HAVE A left join CODED B on upcase(compress(A.CODED_ITEM))=B.CODED_ITEM; quit;
In the above I just handle casing and spaces, but you could strip out special characters, do a best case match etc.
Thanks for the Suggestion.
The thing is I dont have a list if data is like ( Apple, apple inc, apple company, apple store) I have to take a unique Name ie "Apple" for all this kind of Data and same goes with all the other employers which has spellings spelt wrong or company has diff or new names. So even I thought to create a Look up table . But I am kinda struck with records having two or three employers like "Apple/Best Buy/HSBC". In this case I am not sure what code I have to write to take either the first company or might be the last company .
Like in this situations if it has to consider lets suppose only first company then I am not sure how to make SAS understand that logic ?
I would suggest then that stage 1 would be to create one unique row per item, something like:
data inter; set have; i=1; do until (scan(employers,i,"/,")=""); emp=scan(employers,i,"/,"); output; end; run;
And also get rid of ambiguities like capitilisation, spacing etc. This should get rid of a lot of differences. You could also take out known differences, INC or STORE for example. The simple answer is its going to take a bit of work.
Possibly hopeless for this but do you by chance have one of the company identification codes associated such as a tax number or employer identification ID? Much better to use one of those and build a single look up table. One reason is that companies may change the names on logos and such but retain the other Ids.
No for this data I dont have any unique ID for the companies . It is just the company Name like apple, best buy . Whatever name the Google gives or I give the company
hello,
what do you ended doing? I am in the same situation with about 10000 entries but more variation in the data names.
Thanks
Try the techniques suggested and see what works best. I find PROC FREQ works well to count distinct values and variations.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.