BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
harrylui
Obsidian | Level 7

Good day,

 

I have a data set like the following :

 

data testing;
infile datalines dlm="09"x;
input Name :$80. Category :$40.;
datalines;
A Cote restaurant
A Cote restaurant
7-ELEVEN service
7ELEVEN service
A BIG BIG restaurant
A BIG restaurant
A CUT ABOVEPARKSON PVL restaurant
A CUT ABOVANGSAR VIL restaurant
A DRIVING SCHOOL service
A DRIVING service
A LI SHAN XI DING JI shop
A LI SHAN shop
A PLUS BEAUTY MEDICAL medical
A PLUS BEAUTY HOUSE medical
;
run;

 

i am writing a program to group the similar merchant into one name 

 

here is my program but not able to solve it


data testing2;
set testing;
dif=compged(Name,lag(name));
dif2=compged(Category,lag(Category));
if dif<=70 and dif2 <=70 then match="ok";
run;

 

what I trying to do is 

 

if N+1 name and Category are both similar to N 

then assign the same name to N+1 as N

and keep comparing the data as follow logic: 

row_n to row_n+1

row_n+1 to row_n+2

row_n+2 to row_n+3

 

can some help or give me some insight?

 

what i expecting is 

 

A Cote              restaurant
A Cote              restaurant
7-ELEVEN        service
7-ELEVEN        service
A BIG BIG         restaurant
A BIG BIG         restaurant
A CUT ABOVEPARKSON PVL    restaurant
A CUT ABOVANGSAR VIL           restaurant
A DRIVING SCHOOL                    service
A DRIVING SCHOOL                    service
A LI SHAN XI DING JI                   shop
A LI SHAN                                     shop
A PLUS BEAUTY MEDICAL         medical
A PLUS BEAUTY HOUSE            medical

 

 

thanks in advance

 

Harry

1 ACCEPTED SOLUTION

Accepted Solutions
yabwon
Onyx | Level 15

Hi @harrylui ,

 

Something like this:

data testing;
infile datalines dlm="\";
input Name :$80. Category :$40.;
datalines;
A Cote\restaurant
A Cote\restaurant
7-ELEVEN\service
7ELEVEN\service
A BIG BIG\restaurant
A BIG\restaurant
A CUT ABOVEPARKSON PVL\restaurant
A CUT ABOVANGSAR VIL\restaurant
A DRIVING SCHOOL\service
A DRIVING\service
A LI SHAN XI DING JI\shop
A LI SHAN\shop
A PLUS BEAUTY MEDICAL\medical
A PLUS BEAUTY HOUSE\medical
;
run;

data testing2(drop=_:);
  set testing;
  _Name = lag(Name);
  _Category = lag(Category);
  dif=compged(Name, _Name);
  dif2=compged(Category, _Category);
  if dif<=70 and dif2 <=70 then 
    do;
      Name = _Name;
      Category = _Category;
      match = "ok";
    end;
run;
proc print;
run;

?

 

All the best

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



View solution in original post

4 REPLIES 4
ed_sas_member
Meteorite | Level 14

HI @harrylui 

 

I suggest that you proceed in two main steps:

1/ group similar categories and rename them accordingly (NB: in your entry dataset, I have introduced some misspelling in the categories to test the code)

2/ group similar names insides these 'clean' categories and rename them accordingly

 

I have indicated the assumption I made to determine what is the 'true' value to impute. You can change them for example by sorting data by length of string, etc.

 

Hope this helps.

 

My best,

 

data testing;
	infile datalines dlm="09"x;
	input Name :$80. Category :$40.;
	datalines;
A Cote	restaurant
A Cote	restaurant
7-ELEVEN	service 
7ELEVEN	service
A BIG BIG	restaurant
A BIG	restaurant
A CUT ABOVEPARKSON PVL	restaurat
A CUT ABOVANGSAR VIL	restaurant
A DRIVING SCHOOL	service
A DRIVING	services
A LI SHAN XI DING JI	shop
A LI SHAN	sho
A PLUS BEAUTY MEDICAL	medical
A PLUS BEAUTY HOUSE	medilal
;
run;

proc sort data=testing out=testing1;
	by category; /* Assumption 1: the 'true' category is the first in alphabetical order */
run;

/* Step1 : group similar categories */
data testing2;
	set testing1;
	length New_category $ 40.;
	retain New_category;
	_lag_c = lag(Category);
	dif = compged(Category,_lag_c);
	if dif > 100 then New_category = Category;
	drop _lag_c dif;
run;

/* Step2: Sort data by these 'clean' categories */
proc sort data=testing2 out=testing3;
	by category descending Name; /* Assumption 2: the 'true' name is the first in descending alphabetical order */
run;

/* Step3 : group similar names */
data testing4;
	set testing3;
	length New_Name $ 80.;
	by New_category;
	retain New_Name;
	_lag_n = lag(Name);
	dif = compged(Name,_lag_n);
	if first.New_category or dif > 100 then New_Name = Name;
	drop _lag_n dif;
run;
yabwon
Onyx | Level 15

Hi @harrylui ,

 

Something like this:

data testing;
infile datalines dlm="\";
input Name :$80. Category :$40.;
datalines;
A Cote\restaurant
A Cote\restaurant
7-ELEVEN\service
7ELEVEN\service
A BIG BIG\restaurant
A BIG\restaurant
A CUT ABOVEPARKSON PVL\restaurant
A CUT ABOVANGSAR VIL\restaurant
A DRIVING SCHOOL\service
A DRIVING\service
A LI SHAN XI DING JI\shop
A LI SHAN\shop
A PLUS BEAUTY MEDICAL\medical
A PLUS BEAUTY HOUSE\medical
;
run;

data testing2(drop=_:);
  set testing;
  _Name = lag(Name);
  _Category = lag(Category);
  dif=compged(Name, _Name);
  dif2=compged(Category, _Category);
  if dif<=70 and dif2 <=70 then 
    do;
      Name = _Name;
      Category = _Category;
      match = "ok";
    end;
run;
proc print;
run;

?

 

All the best

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



harrylui
Obsidian | Level 7

hi all,

 

tried both of your program, still encounter an issue

if there are three or more observations pass the condition test,

 

nothing will be changed after the second observation.

 

i believe the program  set the lag(name) at the first initialization and when the program working, the lag(name) never change and cause to this problem. i build a stupid program but how can i enhance it to make it recycle?

 

hopefully, i can get all the observations =A DRIVING SCHOOL if i pass the dif test.

 


data testing;
infile datalines dlm="\";
input Name :$80. Category :$40.;
datalines;
A DRIVING SCHOOL\service
A DRIVING\service
A DRIVINGA\service
A DRIVINGb\service
A DRIVINGc\service

;
run;
data testing2;
set testing;
_Name = lag(Name);
_Category = lag(Category);
dif=compged(Name, _Name);
dif2=compged(Category, _Category);
if dif<=100 and dif2 <=100 then
do;
Name = _Name;
Category = _Category;
match = "ok";

/**/
/* _Name = lag(Name);*/
/* _Category = lag(Category);*/
end;
run;
proc print;
run;

data testing3;
set testing2;
drop
_Name
_Category;
run;

data testing4;
set testing3;
_Name = lag(Name);
_Category = lag(Category);
run;


data testing5;
set testing4;
_Name = lag(Name);
_Category = lag(Category);
dif=compged(Name, _Name);
dif2=compged(Category, _Category);
if dif<=100 and dif2 <=100 then
do;
Name = _Name;
Category = _Category;
match = "ok";

/**/
/* _Name = lag(Name);*/
/* _Category = lag(Category);*/
end;
run;

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 482 views
  • 0 likes
  • 3 in conversation