Solved: Re: Comparing a row in a dataset with the next row

harrylui · Posted 02-17-2020 03:15 AM

Good day,

I have a data set like the following :

data testing;
infile datalines dlm="09"x;
input Name :$80. Category :$40.;
datalines;
A Cote restaurant
A Cote restaurant
7-ELEVEN service
7ELEVEN service
A BIG BIG restaurant
A BIG restaurant
A CUT ABOVEPARKSON PVL restaurant
A CUT ABOVANGSAR VIL restaurant
A DRIVING SCHOOL service
A DRIVING service
A LI SHAN XI DING JI shop
A LI SHAN shop
A PLUS BEAUTY MEDICAL medical
A PLUS BEAUTY HOUSE medical
;
run;

i am writing a program to group the similar merchant into one name

here is my program but not able to solve it

data testing2;
set testing;
dif=compged(Name,lag(name));
dif2=compged(Category,lag(Category));
if dif<=70 and dif2 <=70 then match="ok";
run;

what I trying to do is

if N+1 name and Category are both similar to N

then assign the same name to N+1 as N

and keep comparing the data as follow logic:

row_n to row_n+1

row_n+1 to row_n+2

row_n+2 to row_n+3

can some help or give me some insight?

what i expecting is

A Cote restaurant
A Cote restaurant
7-ELEVEN service
7-ELEVEN service
A BIG BIG restaurant
A BIG BIG restaurant
A CUT ABOVEPARKSON PVL restaurant
A CUT ABOVANGSAR VIL restaurant
A DRIVING SCHOOL service
A DRIVING SCHOOL service
A LI SHAN XI DING JI shop
A LI SHAN shop
A PLUS BEAUTY MEDICAL medical
A PLUS BEAUTY HOUSE medical

thanks in advance

Harry

yabwon · Posted 02-17-2020 04:16 AM

Hi @harrylui ,

Something like this:

data testing;
infile datalines dlm="\";
input Name :$80. Category :$40.;
datalines;
A Cote\restaurant
A Cote\restaurant
7-ELEVEN\service
7ELEVEN\service
A BIG BIG\restaurant
A BIG\restaurant
A CUT ABOVEPARKSON PVL\restaurant
A CUT ABOVANGSAR VIL\restaurant
A DRIVING SCHOOL\service
A DRIVING\service
A LI SHAN XI DING JI\shop
A LI SHAN\shop
A PLUS BEAUTY MEDICAL\medical
A PLUS BEAUTY HOUSE\medical
;
run;

data testing2(drop=_:);
  set testing;
  _Name = lag(Name);
  _Category = lag(Category);
  dif=compged(Name, _Name);
  dif2=compged(Category, _Category);
  if dif<=70 and dif2 <=70 then 
    do;
      Name = _Name;
      Category = _Category;
      match = "ok";
    end;
run;
proc print;
run;

?

All the best

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

View solution in original post

ed_sas_member · Posted 02-17-2020 04:01 AM

HI @harrylui

I suggest that you proceed in two main steps:

1/ group similar categories and rename them accordingly (NB: in your entry dataset, I have introduced some misspelling in the categories to test the code)

2/ group similar names insides these 'clean' categories and rename them accordingly

I have indicated the assumption I made to determine what is the 'true' value to impute. You can change them for example by sorting data by length of string, etc.

Hope this helps.

My best,

data testing;
	infile datalines dlm="09"x;
	input Name :$80. Category :$40.;
	datalines;
A Cote	restaurant
A Cote	restaurant
7-ELEVEN	service 
7ELEVEN	service
A BIG BIG	restaurant
A BIG	restaurant
A CUT ABOVEPARKSON PVL	restaurat
A CUT ABOVANGSAR VIL	restaurant
A DRIVING SCHOOL	service
A DRIVING	services
A LI SHAN XI DING JI	shop
A LI SHAN	sho
A PLUS BEAUTY MEDICAL	medical
A PLUS BEAUTY HOUSE	medilal
;
run;

proc sort data=testing out=testing1;
	by category; /* Assumption 1: the 'true' category is the first in alphabetical order */
run;

/* Step1 : group similar categories */
data testing2;
	set testing1;
	length New_category $ 40.;
	retain New_category;
	_lag_c = lag(Category);
	dif = compged(Category,_lag_c);
	if dif > 100 then New_category = Category;
	drop _lag_c dif;
run;

/* Step2: Sort data by these 'clean' categories */
proc sort data=testing2 out=testing3;
	by category descending Name; /* Assumption 2: the 'true' name is the first in descending alphabetical order */
run;

/* Step3 : group similar names */
data testing4;
	set testing3;
	length New_Name $ 80.;
	by New_category;
	retain New_Name;
	_lag_n = lag(Name);
	dif = compged(Name,_lag_n);
	if first.New_category or dif > 100 then New_Name = Name;
	drop _lag_n dif;
run;

harrylui · Posted 02-19-2020 02:33 AM

thank you

yabwon · Posted 02-17-2020 04:16 AM

Hi @harrylui ,

Something like this:

data testing;
infile datalines dlm="\";
input Name :$80. Category :$40.;
datalines;
A Cote\restaurant
A Cote\restaurant
7-ELEVEN\service
7ELEVEN\service
A BIG BIG\restaurant
A BIG\restaurant
A CUT ABOVEPARKSON PVL\restaurant
A CUT ABOVANGSAR VIL\restaurant
A DRIVING SCHOOL\service
A DRIVING\service
A LI SHAN XI DING JI\shop
A LI SHAN\shop
A PLUS BEAUTY MEDICAL\medical
A PLUS BEAUTY HOUSE\medical
;
run;

data testing2(drop=_:);
  set testing;
  _Name = lag(Name);
  _Category = lag(Category);
  dif=compged(Name, _Name);
  dif2=compged(Category, _Category);
  if dif<=70 and dif2 <=70 then 
    do;
      Name = _Name;
      Category = _Category;
      match = "ok";
    end;
run;
proc print;
run;

?

All the best

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

harrylui · Posted 02-17-2020 11:11 PM

hi all,

tried both of your program, still encounter an issue

if there are three or more observations pass the condition test,

nothing will be changed after the second observation.

i believe the program set the lag(name) at the first initialization and when the program working, the lag(name) never change and cause to this problem. i build a stupid program but how can i enhance it to make it recycle?

hopefully, i can get all the observations =A DRIVING SCHOOL if i pass the dif test.

data testing;
infile datalines dlm="\";
input Name :$80. Category :$40.;
datalines;
A DRIVING SCHOOL\service
A DRIVING\service
A DRIVINGA\service
A DRIVINGb\service
A DRIVINGc\service

;
run;
data testing2;
set testing;
_Name = lag(Name);
_Category = lag(Category);
dif=compged(Name, _Name);
dif2=compged(Category, _Category);
if dif<=100 and dif2 <=100 then
do;
Name = _Name;
Category = _Category;
match = "ok";

/**/
/* _Name = lag(Name);*/
/* _Category = lag(Category);*/
end;
run;
proc print;
run;

data testing3;
set testing2;
drop
_Name
_Category;
run;

data testing4;
set testing3;
_Name = lag(Name);
_Category = lag(Category);
run;

data testing5;
set testing4;
_Name = lag(Name);
_Category = lag(Category);
dif=compged(Name, _Name);
dif2=compged(Category, _Category);
if dif<=100 and dif2 <=100 then
do;
Name = _Name;
Category = _Category;
match = "ok";

/**/
/* _Name = lag(Name);*/
/* _Category = lag(Category);*/
end;
run;

Comparing a row in a dataset with the next row

Re: Comparing a row in a dataset with the next row

Re: Comparing a row in a dataset with the next row

Re: Comparing a row in a dataset with the next row

Re: Comparing a row in a dataset with the next row

Re: Comparing a row in a dataset with the next row

Ready to join fellow brilliant minds for the SAS Hackathon?

Click image to register for webinar

Classroom Training Available!