BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Ivan555
Quartz | Level 8

Hello!

 

Performing one of the tasks, I got stuck and can not find the best way to solve the problem.

 

I have a table.

 

 

proc sql;
create table tmp (num int, name char(200));
insert into tmp
	values (1,'281.3891.3891.281')
	values (2,'3891.281.281.3891')
	values (3,'1162.5645.5645.500835.500835.1162')
	values (4,'5645.500835.500835.1162.1162.5645')
	values (5,'500835.1162.1162.5645.5645.500835')
	values (6,'1349.1162.1162.5645.5645.500835.500835.1349')
	values (7,'1162.5645.5645.500835.500835.1349.1349.1162')
	values (8,'5645.500835.500835.1349.1349.1162.1162.5645')
	values (9,'500835.1349.1349.1162.1162.5645.5645.500835');
quit;

 

 

Each line is a chain, and taking this into account, it is clear that the lines {1;2}, {3;4;5}, {6,7,8,9} are duplicates.

 

The question is, how would it be most correct to filter the rows so that only one row remains inside each of the duplicate groups?

(for example the lowest value of the first number like rows 1, 3, 7)

 

I was thinking of creating an array for each row and sorting the values ​​by a common shift within each array. But I'm not sure if this is the smartest way ..

 

THX!

1 ACCEPTED SOLUTION

Accepted Solutions
hashman
Ammonite | Level 13

@Ivan555:

As I understand, your definition of duplicates is the names whose dot-delimited parts form the same circular queue. If so, we need to:

  1. find the largest (or smallest) part
  2. rotate the queue until this part ends up at the end (or beginning) of the name
  3. if a subsequent name formed in such a manner is the same as previous, ignore the record

(EDIT: @ChrisNZ has pointed out - correctly - that there's a hole in the original queue rotation logic in case the largest chunks ends up both at the top and the bottom. The queue should be rotated further and stop when the largest chunk is on the top but not on the bottom. Below, a provision is also made to stop the loop from rotating infinitely if all the chunks in the string are the same. The test string @ChrisNZ used to show the flaw is included in the sample data set as the 0 record.)

 

In other (SAS) words:

data have ;                                                                                                                             
  input NUM NAME :$200. ;                                                                                                               
  cards ;                                                                                                                               
 0  281.281.3891.3891                                                                                                                   
 1  281.3891.3891.281                                                                                                                   
 2  3891.281.281.3891                                                                                                                   
 3  1162.5645.5645.500835.500835.1162                                                                                                   
 4  5645.500835.500835.1162.1162.5645                                                                                                   
 5  500835.1162.1162.5645.5645.500835                                                                                                   
 6  1349.1162.1162.5645.5645.500835.500835.1349                                                                                         
 7  1162.5645.5645.500835.500835.1349.1349.1162                                                                                         
 8  5645.500835.500835.1349.1349.1162.1162.5645                                                                                         
 9  500835.1349.1349.1162.1162.5645.5645.500835                                                                                         
10  5645.1162.1162.500835.500835.5645                                                                                                   
11  1162.500835.500835.5645.5645.1162                                                                                                   
12  500835.5645.5645.1162.1162.500835                                                                                                   
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
data want (drop = _:) ;                                                                                                                 
  if _n_ = 1 then do ;                                                                                                                  
    dcl hash h () ;                                                                                                                     
    h.definekey ("_nm") ;                                                                                                               
    h.definedone () ;                                                                                                                   
  end ;                                                                                                                                 
  set have ;                                                                                                                            
  length _tm _t $ 16 ;                                                                                                                  
  _nm = name ;                                                                                                                          
  do _n_ = 1 to countw (_nm) ;                                                                                                          
    _t = scan (_nm, _n_) ;                                                                                                              
    if _t > _tm then _tm = _t ;                                                                                                         
  end ;                                                                                                                                 
  do _n_ = 1 to countw (_nm) - 1 while (not (scan (_nm, 1) < _t = _tm)) ;                                                               
    _nm = catx (".", substr (_nm, findc (_nm, ".") + 1), scan (_nm, 1)) ;                                                               
    _t = scan (_nm, -1) ;                                                                                                               
  end ;                                                                                                                                 
  if h.check() ne 0 ;                                                                                                                   
  h.add() ;                                                                                                                             
run ;                                              

As a result, you get the records 0, 3, 6, 10.

 

Kind regards

Paul D. 

View solution in original post

41 REPLIES 41
ballardw
Super User

I am not exactly sure that I understand exactly what you want but see if this gets close.

proc sql;
create table tmp (num int, name char(200));
insert into tmp
	values (1,'281.3891.3891.281')
	values (2,'3891.281.281.3891')
	values (3,'1162.5645.5645.500835.500835.1162')
	values (4,'5645.500835.500835.1162.1162.5645')
	values (5,'500835.1162.1162.5645.5645.500835')
	values (6,'1349.1162.1162.5645.5645.500835.500835.1349')
	values (7,'1162.5645.5645.500835.500835.1349.1349.1162')
	values (8,'5645.500835.500835.1349.1349.1162.1162.5645')
	values (9,'500835.1349.1349.1162.1162.5645.5645.500835');
quit;

data tmp2;
   set tmp;
   array v (50) $ 6;
   length newname $200;
   do i= 1 to countw(name);
      v[i]= scan(name,i,'.');
   end;
   call sortc(of v(*));
   newname=catx('.',of v(*));
   drop v: i;
run;

proc sort data=tmp2 out=want nodupkey;
  by newname;
run;
   
PGStats
Opal | Level 21

Another approach using SQL:

 

proc sql;
create table want as
select *
from have as a 
where not exists (
	select * from have as b
	where a.num > b.num and length(a.name) = length(b.name) and
		index(cats(".", b.name, ".", b.name,"."), cats(".", a.name, ".")) > 0);
quit;
PG
hashman
Ammonite | Level 13

@PGStats:

Wow ... finally a way of doing this purely in terms of sets without procedural code. A clever way to identify the cycles, too!

 

Kind regards

Paul D. 

ChrisNZ
Tourmaline | Level 20

@PGStats Why the surrounding dots? Isn't this enough?

	where a.num > b.num and length(a.name) = length(b.name) and
		index(cats( b.name, ".", b.name), strip( a.name )) 

 

 

PGStats
Opal | Level 21

@ChrisNZ Adding delimiters might indeed not be necessary. It just seemed safer Smiley Wink

PG
hashman
Ammonite | Level 13

@ChrisNZ:

Keeps getting still better. Which makes me wonder why the condition equating the lengths is needed, as the query does fine without it, too:

proc sql ;                                                                                                                              
  create table want as                                                                                                                  
  select * from have a                                                                                                                  
  where not exists                                                                                                                      
 (select 1 from have b                                                                                                               
  where  a.num > b.num                                                                                                                  
  and    find (cats (b.name, ".", b.name), trim (a.name))                                                                              
  ) ;                                                                                                                                   
quit ;    

Kind regards

Paul D.

ChrisNZ
Tourmaline | Level 20

> why the condition equating the lengths is needed

For speed: it's much faster to check the length than the cat() result.

PGStats
Opal | Level 21

@hashman , @ChrisNZ ... requiring equal lengths is good for speed, yes, but mostly, it is essential to avoid matching substrings: 1.2.3.4.5.6 with 3.4.5

PG
FreelanceReinh
Jade | Level 19

@ChrisNZ wrote:

@PGStats Why the surrounding dots? Isn't this enough?

	where a.num > b.num and length(a.name) = length(b.name) and
		index(cats( b.name, ".", b.name), strip( a.name )) 

 


The surrounding delimiters are necessary. Try this input dataset:

data have;
input num name $;
cards;
1 111
2 1.1
;
Reeza
Super User
Does it have to be an exact match?
What if the 2nd row was missing one code but since the codes are duplicated?

ie

if 2 was
3891.281.3891 would it still be a duplicate?
Ivan555
Quartz | Level 8

@ballardw 

Thank you, I am checking

 

 

@Reeza  

Does it have to be an exact match?

Yes, exact match using parallel shift(can't catch how better to say).

3891.281.3891 would it still be a duplicate?

 

Perhars in start message there was placed not very good example.. groups {3;4;5} and {10;11;12} are different

 

proc sql;
create table tmp (num int, name char(200));
insert into tmp
	values (1,'281.3891.3891.281')
	values (2,'3891.281.281.3891')
	values (3,'1162.5645.5645.500835.500835.1162')
	values (4,'5645.500835.500835.1162.1162.5645')
	values (5,'500835.1162.1162.5645.5645.500835')
	values (6,'1349.1162.1162.5645.5645.500835.500835.1349')
	values (7,'1162.5645.5645.500835.500835.1349.1349.1162')
	values (8,'5645.500835.500835.1349.1349.1162.1162.5645')
	values (9,'500835.1349.1349.1162.1162.5645.5645.500835')
	values (10,'5645.1162.1162.500835.500835.5645')
	values (11,'1162.500835.500835.5645.5645.1162')
	values (12,'500835.5645.5645.1162.1162.500835');
quit;

 

p.s. While writing I am trying to do my best, but seems my English is rather unclear.. I am very Sorry for it..

Ivan555
Quartz | Level 8

@ballardw 

 

It works and it is close, Thank you!

Could you say is it possible to improve your code according to my second example? - groups {3;4;5} and {10;11;12} should differ.

Reeza
Super User
What makes them different?
Ivan555
Quartz | Level 8

Values in 'name' field are looped chains.

If sequence of the elements is different, we have different chains.

Under numbers 3, 4, 5 there is the same sequence of the elements with different starting point; the chain is the same - so we are having duplicates.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 41 replies
  • 2389 views
  • 52 likes
  • 8 in conversation