12-02-2015 06:14 AM
I am working on a dataset with 1 million observations.
During the process of booking, the booking team manually enteres the Company name in the system which further integrates to our ERP.
Now a name could be written like "ABC Ltd", "*ABC Ltd", "abcltd" "abc ltd" and in many other formats.
my objetive is to create one single entry lets say "ABC Ltd" which should have all the relavant count of entires.
A case statement would have been perect here however there are like 1000 duplicates hence for these i cannot implement a case statement.
If you could please suggest an alternate.
12-02-2015 06:45 AM
It depends very much on your logic really and what you will allow. In the example you give, simple functions can be used:
if upcase(compress(<value>," "))="ABCLTD" then coded="ABC Ltd"
You can also do pattern matching using perl regular expressions (do a Google searh and you will see lots of examples), but that requires more complicated syntax.
One other thing to bear in mind is how many options you will have. What I would suggest is you create a coded table, maybe something like:
ABCLTD ABC Ltd
You can then do one of two things:
1) Merge this data onto the orginal based on upcase(compress(<value>," "))=code
2) Generate the datastep out:
set coded end=last;
if _n_=1 then call execute('data want; set have;');
call execute(' if upcase(compress(<value>," "))='||code||' then uncoded="'||coded||'";');
if last then call execute(' run;');
The second one allows more flexibility, for instance you could put the logic match in another variable if it is different for different types:
CODE UNCODED LOGIC
ABCLTD ABC Ltd upcase(compress(<value>," "))
Its hard to say without knowing your data and rules.
12-04-2015 01:12 AM
Thanks for suggestion. I havent used perl expressions earlier hence might have to get some learning on the same - seems like a new venture all together.
I came to know that there are few paid packages available in the market which help you with these kind of activities. Is anyone in the community know about the same.
Kindly let me know as these kind of cases are very often with the data that we deal with.
12-04-2015 03:56 AM
With regards to Perl, one of the benefits of knowing it is that it is not SAS dependant, it is quite widely used in most programming languages, and other packages, so worth knowing the basics at least.
You would need to narrow down what you mean by "paid packages", in my field there are lots of packages which I won't name, and the vast majority are really not worth it.
Yes, this comes under the heading of data cleaning. I can only speak for the industry I am in, but to me it is very important and carried out at all levels. It leads directly into standard data structures, standard outputs etc. for the required models, so clean data is essential. This can take many forms however, from database entry checks, edit checks, data listings/review, protocol deviation outputs, modelling, CDISC checks, analysis review etc. They share some fundamentals, but actual application may vary.
12-02-2015 09:54 AM
If this is a problem for a single column/attribute, I would suggest that build a look up table, with incorrect value - corrected value pairs. You can use thin in look-up/join operations, and perhaps together with pattern match logic.
If this is a recurring issue, you might want to look at a Data Quality SW, which helps you a lot in these situations.