BookmarkSubscribeRSS Feed
mahes9687
Calcite | Level 5

Working on data cleaning activity, I have 100k observation data and I have to replace the duplication in one of the variable. the duplication is possible by spelling mistakes too.

Consider a table

id fruits price

1 grape 100

2 orange 120

3 grepe 110

4 apple 200

5 grapes 90

6 appll 190

how the above data will be cleaned and should be as below

id fruits price

1 grapes 100

2 orange 120

3 grapes 110

4 apple 200

5 grapes 90

6 apple 190

Consider same way I have to clean 100k of data, how it is cleansed?

Thanks in Advance....

2 REPLIES 2
PeterClemmensen
Tourmaline | Level 20

Here is how I would do it.

 

Save all the correct spellings of your fruits in a data set called fruits.

 

Read that data set into a hash object.

 

Read your have data set and look for the current value of fruit in the hash object. If it is there, do nothing (the spelling is correct). If it is not there, then run through the entire hash object and compare the current value of fruit to the correct spellings. If the distance between them is appropriately low, then replace the current value of fruit.

 

Let me know if this works for you 🙂

 

data have;
input id fruits $ price;
datalines;
1 grape 100
2 orange 120
3 grepe 110
4 apple 200
5 grapes 90
6 appll 190
;

data fruits;
input _fruits $;
datalines;
grapes
orange
apple
;

data want(drop=_:);
    length _fruits $8;
    if _N_=1 then do;
        declare hash h(dataset:'fruits');
        h.definekey('_fruits');
        h.definedone();
        declare hiter hi('h');
        call missing(_fruits);
    end;

    set have;

    if h.find(key:fruits) ne 0 then do;
        do while (hi.next() = 0);
           if compged(fruits, _fruits) le 200 then fruits=_fruits;
        end;
    end;
run;
Astounding
PROC Star

Here's the low-tech, very easy to understand way.  Find what is actually in the data:

 

proc freq data=have;
   tables fruits;
run;

Look through the results, and find the ones you would like to change.  Then write a program to change them:

 

data want;
   set have;
   if fruits='grape' then fruits='grapes';
run;

Then repeat the process to double-check that you found everything you want to change.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 729 views
  • 1 like
  • 3 in conversation