- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello All,
Problem:
I have a Data Cleaning problem. I have this data where There is a Product manufacturer and Product Brand.
Each manufacturer has several Brands. The brands data is manually entered and is messy. I need to check the data which is manually entered and create standard brand name.
Here is the example,
Product Manufacturer | Product Brand |
Abila | Taquila |
Abila | TAQUILA |
Abila | tAqueela |
Tambola | Daqueela |
Tambola | Taqueela |
Abila | daqueela |
Tambola | Samila |
Here is the desired output,
Product Manufacturer | Product Brand | Standard Brand |
Abila | Taquila | Taquila |
Abila | TAQUILA | Taquila |
Abila | tAqueela | Taquila |
Tambola | Daqueela | Daqueela |
Tambola | Taqueela | Taquila |
Abila | daqueela | Daqueela |
Tambola | SAMmila | Samila |
The next step would be get the count of unique combinations. Which I can achieve once I got the above output.
The problem here is, I need to do this for only 2 Manufacturers but all brands there are (~3,000) of them.
Approach:
I did run a select distinct brand name and got around 200,000 names which are different combinations of same name. For example, the result retuned as
Abila
abila
abla
cabla
Cabl
Cabla
..
So, Since I cannot extract all possible combinations, I extracted count as well, which gave me most frequent used notation, I ran a program to delete all less frequent notations of brand name.
Now I have list of Product Brands ~90,000
But, I cannot code multiple WHEN … THEN Statements like below (I should code (90,000) of them, which is impossible)
CASE
WHEN MISC.PRD_BRAND_NM Like 'ABila%' OR MISC.PRD_BRAND_NM Like 'abila%' OR MISC.PRD_BRAND_NM Like 'ab%' THEN 'Abila' AS Standard_name
I’m looking is there a way that I create a macro and pass all variables through a function to create standard name?
Some thing like below,
CASE
WHEN MISC.PRD_BRAND_NM Like &pass. OR MISC.PRD_BRAND_NM Like &pass. OR MISC.PRD_BRAND_NM Like &pass. THEN &pass. AS Standard_name
Do <this function> until <end of all brands>
<this function>
To replace macro variables with all combinations.
Or any other way I can achieve the objective?
Thanks a ton!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Do you have a master list of what should be the names?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You have two choices. You can either labouriously code a custom, do-it-yourself solution to your problem or you could consider a tool that has this functionality built-in such as SAS Data Quality / Dataflux. The latter solution is definitely a better one if you are doing a lot of data cleansing. Does your organisation have any such tools?