Solved: Re: non-trivial algorythm for char fields

Ivan555 · Posted 10-22-2019 11:21 AM

Hello!

Performing one of the tasks, I got stuck and can not find the best way to solve the problem.

I have a table.

proc sql;
create table tmp (num int, name char(200));
insert into tmp
	values (1,'281.3891.3891.281')
	values (2,'3891.281.281.3891')
	values (3,'1162.5645.5645.500835.500835.1162')
	values (4,'5645.500835.500835.1162.1162.5645')
	values (5,'500835.1162.1162.5645.5645.500835')
	values (6,'1349.1162.1162.5645.5645.500835.500835.1349')
	values (7,'1162.5645.5645.500835.500835.1349.1349.1162')
	values (8,'5645.500835.500835.1349.1349.1162.1162.5645')
	values (9,'500835.1349.1349.1162.1162.5645.5645.500835');
quit;

Each line is a chain, and taking this into account, it is clear that the lines {1;2}, {3;4;5}, {6,7,8,9} are duplicates.

The question is, how would it be most correct to filter the rows so that only one row remains inside each of the duplicate groups?

(for example the lowest value of the first number like rows 1, 3, 7)

I was thinking of creating an array for each row and sorting the values by a common shift within each array. But I'm not sure if this is the smartest way ..

THX!

hashman · Posted 10-22-2019 03:40 PM

@Ivan555:

As I understand, your definition of duplicates is the names whose dot-delimited parts form the same circular queue. If so, we need to:

find the largest (or smallest) part
rotate the queue until this part ends up at the end (or beginning) of the name
if a subsequent name formed in such a manner is the same as previous, ignore the record

(EDIT: @ChrisNZ has pointed out - correctly - that there's a hole in the original queue rotation logic in case the largest chunks ends up both at the top and the bottom. The queue should be rotated further and stop when the largest chunk is on the top but not on the bottom. Below, a provision is also made to stop the loop from rotating infinitely if all the chunks in the string are the same. The test string @ChrisNZ used to show the flaw is included in the sample data set as the 0 record.)

In other (SAS) words:

data have ;                                                                                                                             
  input NUM NAME :$200. ;                                                                                                               
  cards ;                                                                                                                               
 0  281.281.3891.3891                                                                                                                   
 1  281.3891.3891.281                                                                                                                   
 2  3891.281.281.3891                                                                                                                   
 3  1162.5645.5645.500835.500835.1162                                                                                                   
 4  5645.500835.500835.1162.1162.5645                                                                                                   
 5  500835.1162.1162.5645.5645.500835                                                                                                   
 6  1349.1162.1162.5645.5645.500835.500835.1349                                                                                         
 7  1162.5645.5645.500835.500835.1349.1349.1162                                                                                         
 8  5645.500835.500835.1349.1349.1162.1162.5645                                                                                         
 9  500835.1349.1349.1162.1162.5645.5645.500835                                                                                         
10  5645.1162.1162.500835.500835.5645                                                                                                   
11  1162.500835.500835.5645.5645.1162                                                                                                   
12  500835.5645.5645.1162.1162.500835                                                                                                   
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
data want (drop = _:) ;                                                                                                                 
  if _n_ = 1 then do ;                                                                                                                  
    dcl hash h () ;                                                                                                                     
    h.definekey ("_nm") ;                                                                                                               
    h.definedone () ;                                                                                                                   
  end ;                                                                                                                                 
  set have ;                                                                                                                            
  length _tm _t $ 16 ;                                                                                                                  
  _nm = name ;                                                                                                                          
  do _n_ = 1 to countw (_nm) ;                                                                                                          
    _t = scan (_nm, _n_) ;                                                                                                              
    if _t > _tm then _tm = _t ;                                                                                                         
  end ;                                                                                                                                 
  do _n_ = 1 to countw (_nm) - 1 while (not (scan (_nm, 1) < _t = _tm)) ;                                                               
    _nm = catx (".", substr (_nm, findc (_nm, ".") + 1), scan (_nm, 1)) ;                                                               
    _t = scan (_nm, -1) ;                                                                                                               
  end ;                                                                                                                                 
  if h.check() ne 0 ;                                                                                                                   
  h.add() ;                                                                                                                             
run ;

As a result, you get the records 0, 3, 6, 10.

Kind regards

Paul D.

View solution in original post

ballardw · Posted 10-22-2019 11:56 AM

I am not exactly sure that I understand exactly what you want but see if this gets close.

proc sql;
create table tmp (num int, name char(200));
insert into tmp
	values (1,'281.3891.3891.281')
	values (2,'3891.281.281.3891')
	values (3,'1162.5645.5645.500835.500835.1162')
	values (4,'5645.500835.500835.1162.1162.5645')
	values (5,'500835.1162.1162.5645.5645.500835')
	values (6,'1349.1162.1162.5645.5645.500835.500835.1349')
	values (7,'1162.5645.5645.500835.500835.1349.1349.1162')
	values (8,'5645.500835.500835.1349.1349.1162.1162.5645')
	values (9,'500835.1349.1349.1162.1162.5645.5645.500835');
quit;

data tmp2;
   set tmp;
   array v (50) $ 6;
   length newname $200;
   do i= 1 to countw(name);
      v[i]= scan(name,i,'.');
   end;
   call sortc(of v(*));
   newname=catx('.',of v(*));
   drop v: i;
run;

proc sort data=tmp2 out=want nodupkey;
  by newname;
run;

PGStats · Posted 10-23-2019 02:14 PM

Another approach using SQL:

proc sql;
create table want as
select *
from have as a 
where not exists (
	select * from have as b
	where a.num > b.num and length(a.name) = length(b.name) and
		index(cats(".", b.name, ".", b.name,"."), cats(".", a.name, ".")) > 0);
quit;

PG

hashman · Posted 10-23-2019 02:48 PM

@PGStats:

Wow ... finally a way of doing this purely in terms of sets without procedural code. A clever way to identify the cycles, too!

Kind regards

Paul D.

ChrisNZ · Posted 10-23-2019 04:49 PM

@PGStats Why the surrounding dots? Isn't this enough?

	where a.num > b.num and length(a.name) = length(b.name) and
		index(cats( b.name, ".", b.name), strip( a.name ))

High-Performance SAS Coding - Third Edition

PGStats · Posted 10-23-2019 06:41 PM

@ChrisNZ Adding delimiters might indeed not be necessary. It just seemed safer

PG

hashman · Posted 10-23-2019 10:36 PM

@ChrisNZ:

Keeps getting still better. Which makes me wonder why the condition equating the lengths is needed, as the query does fine without it, too:

proc sql ;                                                                                                                              
  create table want as                                                                                                                  
  select * from have a                                                                                                                  
  where not exists                                                                                                                      
 (select 1 from have b                                                                                                               
  where  a.num > b.num                                                                                                                  
  and    find (cats (b.name, ".", b.name), trim (a.name))                                                                              
  ) ;                                                                                                                                   
quit ;

Kind regards

Paul D.

ChrisNZ · Posted 10-23-2019 10:49 PM

> why the condition equating the lengths is needed

For speed: it's much faster to check the length than the cat() result.

High-Performance SAS Coding - Third Edition

PGStats · Posted 10-23-2019 11:07 PM

@hashman , @ChrisNZ ... requiring equal lengths is good for speed, yes, but mostly, it is essential to avoid matching substrings: 1.2.3.4.5.6 with 3.4.5

PG

FreelanceReinh · Posted 10-24-2019 02:57 AM

@ChrisNZ wrote:

@PGStats Why the surrounding dots? Isn't this enough?
	where a.num > b.num and length(a.name) = length(b.name) and
		index(cats( b.name, ".", b.name), strip( a.name )) 

The surrounding delimiters are necessary. Try this input dataset:

data have;
input num name $;
cards;
1 111
2 1.1
;

Reeza · Posted 10-22-2019 12:04 PM

Does it have to be an exact match?
What if the 2nd row was missing one code but since the codes are duplicated?

ie

if 2 was
3891.281.3891 would it still be a duplicate?

Ivan555 · Posted 10-22-2019 12:40 PM

@ballardw

Thank you, I am checking

@Reeza

Does it have to be an exact match?

Yes, exact match using parallel shift(can't catch how better to say).

3891.281.3891 would it still be a duplicate?

Perhars in start message there was placed not very good example.. groups {3;4;5} and {10;11;12} are different

proc sql;
create table tmp (num int, name char(200));
insert into tmp
	values (1,'281.3891.3891.281')
	values (2,'3891.281.281.3891')
	values (3,'1162.5645.5645.500835.500835.1162')
	values (4,'5645.500835.500835.1162.1162.5645')
	values (5,'500835.1162.1162.5645.5645.500835')
	values (6,'1349.1162.1162.5645.5645.500835.500835.1349')
	values (7,'1162.5645.5645.500835.500835.1349.1349.1162')
	values (8,'5645.500835.500835.1349.1349.1162.1162.5645')
	values (9,'500835.1349.1349.1162.1162.5645.5645.500835')
	values (10,'5645.1162.1162.500835.500835.5645')
	values (11,'1162.500835.500835.5645.5645.1162')
	values (12,'500835.5645.5645.1162.1162.500835');
quit;

p.s. While writing I am trying to do my best, but seems my English is rather unclear.. I am very Sorry for it..

Ivan555 · Posted 10-22-2019 01:08 PM

@ballardw

It works and it is close, Thank you!

Could you say is it possible to improve your code according to my second example? - groups {3;4;5} and {10;11;12} should differ.

Reeza · Posted 10-22-2019 01:10 PM

What makes them different?

Ivan555 · Posted 10-22-2019 01:25 PM

Values in 'name' field are looped chains.

If sequence of the elements is different, we have different chains.

Under numbers 3, 4, 5 there is the same sequence of the elements with different starting point; the chain is the same - so we are having duplicates.

SAS Innovate 2025: Call for Content

Classroom Training Available!