Good day,
I have a data set like the following :
data testing;
infile datalines dlm="09"x;
input Name :$80. Category :$40.;
datalines;
A Cote restaurant
A Cote restaurant
7-ELEVEN service
7ELEVEN service
A BIG BIG restaurant
A BIG restaurant
A CUT ABOVEPARKSON PVL restaurant
A CUT ABOVANGSAR VIL restaurant
A DRIVING SCHOOL service
A DRIVING service
A LI SHAN XI DING JI shop
A LI SHAN shop
A PLUS BEAUTY MEDICAL medical
A PLUS BEAUTY HOUSE medical
;
run;
i am writing a program to group the similar merchant into one name
here is my program but not able to solve it
data testing2;
set testing;
dif=compged(Name,lag(name));
dif2=compged(Category,lag(Category));
if dif<=70 and dif2 <=70 then match="ok";
run;
what I trying to do is
if N+1 name and Category are both similar to N
then assign the same name to N+1 as N
and keep comparing the data as follow logic:
row_n to row_n+1
row_n+1 to row_n+2
row_n+2 to row_n+3
can some help or give me some insight?
what i expecting is
A Cote restaurant
A Cote restaurant
7-ELEVEN service
7-ELEVEN service
A BIG BIG restaurant
A BIG BIG restaurant
A CUT ABOVEPARKSON PVL restaurant
A CUT ABOVANGSAR VIL restaurant
A DRIVING SCHOOL service
A DRIVING SCHOOL service
A LI SHAN XI DING JI shop
A LI SHAN shop
A PLUS BEAUTY MEDICAL medical
A PLUS BEAUTY HOUSE medical
thanks in advance
Harry
Hi @harrylui ,
Something like this:
data testing;
infile datalines dlm="\";
input Name :$80. Category :$40.;
datalines;
A Cote\restaurant
A Cote\restaurant
7-ELEVEN\service
7ELEVEN\service
A BIG BIG\restaurant
A BIG\restaurant
A CUT ABOVEPARKSON PVL\restaurant
A CUT ABOVANGSAR VIL\restaurant
A DRIVING SCHOOL\service
A DRIVING\service
A LI SHAN XI DING JI\shop
A LI SHAN\shop
A PLUS BEAUTY MEDICAL\medical
A PLUS BEAUTY HOUSE\medical
;
run;
data testing2(drop=_:);
set testing;
_Name = lag(Name);
_Category = lag(Category);
dif=compged(Name, _Name);
dif2=compged(Category, _Category);
if dif<=70 and dif2 <=70 then
do;
Name = _Name;
Category = _Category;
match = "ok";
end;
run;
proc print;
run;
?
All the best
Bart
HI @harrylui
I suggest that you proceed in two main steps:
1/ group similar categories and rename them accordingly (NB: in your entry dataset, I have introduced some misspelling in the categories to test the code)
2/ group similar names insides these 'clean' categories and rename them accordingly
I have indicated the assumption I made to determine what is the 'true' value to impute. You can change them for example by sorting data by length of string, etc.
Hope this helps.
My best,
data testing;
infile datalines dlm="09"x;
input Name :$80. Category :$40.;
datalines;
A Cote restaurant
A Cote restaurant
7-ELEVEN service
7ELEVEN service
A BIG BIG restaurant
A BIG restaurant
A CUT ABOVEPARKSON PVL restaurat
A CUT ABOVANGSAR VIL restaurant
A DRIVING SCHOOL service
A DRIVING services
A LI SHAN XI DING JI shop
A LI SHAN sho
A PLUS BEAUTY MEDICAL medical
A PLUS BEAUTY HOUSE medilal
;
run;
proc sort data=testing out=testing1;
by category; /* Assumption 1: the 'true' category is the first in alphabetical order */
run;
/* Step1 : group similar categories */
data testing2;
set testing1;
length New_category $ 40.;
retain New_category;
_lag_c = lag(Category);
dif = compged(Category,_lag_c);
if dif > 100 then New_category = Category;
drop _lag_c dif;
run;
/* Step2: Sort data by these 'clean' categories */
proc sort data=testing2 out=testing3;
by category descending Name; /* Assumption 2: the 'true' name is the first in descending alphabetical order */
run;
/* Step3 : group similar names */
data testing4;
set testing3;
length New_Name $ 80.;
by New_category;
retain New_Name;
_lag_n = lag(Name);
dif = compged(Name,_lag_n);
if first.New_category or dif > 100 then New_Name = Name;
drop _lag_n dif;
run;
Hi @harrylui ,
Something like this:
data testing;
infile datalines dlm="\";
input Name :$80. Category :$40.;
datalines;
A Cote\restaurant
A Cote\restaurant
7-ELEVEN\service
7ELEVEN\service
A BIG BIG\restaurant
A BIG\restaurant
A CUT ABOVEPARKSON PVL\restaurant
A CUT ABOVANGSAR VIL\restaurant
A DRIVING SCHOOL\service
A DRIVING\service
A LI SHAN XI DING JI\shop
A LI SHAN\shop
A PLUS BEAUTY MEDICAL\medical
A PLUS BEAUTY HOUSE\medical
;
run;
data testing2(drop=_:);
set testing;
_Name = lag(Name);
_Category = lag(Category);
dif=compged(Name, _Name);
dif2=compged(Category, _Category);
if dif<=70 and dif2 <=70 then
do;
Name = _Name;
Category = _Category;
match = "ok";
end;
run;
proc print;
run;
?
All the best
Bart
hi all,
tried both of your program, still encounter an issue
if there are three or more observations pass the condition test,
nothing will be changed after the second observation.
i believe the program set the lag(name) at the first initialization and when the program working, the lag(name) never change and cause to this problem. i build a stupid program but how can i enhance it to make it recycle?
hopefully, i can get all the observations =A DRIVING SCHOOL if i pass the dif test.
data testing;
infile datalines dlm="\";
input Name :$80. Category :$40.;
datalines;
A DRIVING SCHOOL\service
A DRIVING\service
A DRIVINGA\service
A DRIVINGb\service
A DRIVINGc\service
;
run;
data testing2;
set testing;
_Name = lag(Name);
_Category = lag(Category);
dif=compged(Name, _Name);
dif2=compged(Category, _Category);
if dif<=100 and dif2 <=100 then
do;
Name = _Name;
Category = _Category;
match = "ok";
/**/
/* _Name = lag(Name);*/
/* _Category = lag(Category);*/
end;
run;
proc print;
run;
data testing3;
set testing2;
drop
_Name
_Category;
run;
data testing4;
set testing3;
_Name = lag(Name);
_Category = lag(Category);
run;
data testing5;
set testing4;
_Name = lag(Name);
_Category = lag(Category);
dif=compged(Name, _Name);
dif2=compged(Category, _Category);
if dif<=100 and dif2 <=100 then
do;
Name = _Name;
Category = _Category;
match = "ok";
/**/
/* _Name = lag(Name);*/
/* _Category = lag(Category);*/
end;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.