Hello!
Performing one of the tasks, I got stuck and can not find the best way to solve the problem.
I have a table.
proc sql;
create table tmp (num int, name char(200));
insert into tmp
values (1,'281.3891.3891.281')
values (2,'3891.281.281.3891')
values (3,'1162.5645.5645.500835.500835.1162')
values (4,'5645.500835.500835.1162.1162.5645')
values (5,'500835.1162.1162.5645.5645.500835')
values (6,'1349.1162.1162.5645.5645.500835.500835.1349')
values (7,'1162.5645.5645.500835.500835.1349.1349.1162')
values (8,'5645.500835.500835.1349.1349.1162.1162.5645')
values (9,'500835.1349.1349.1162.1162.5645.5645.500835');
quit;
Each line is a chain, and taking this into account, it is clear that the lines {1;2}, {3;4;5}, {6,7,8,9} are duplicates.
The question is, how would it be most correct to filter the rows so that only one row remains inside each of the duplicate groups?
(for example the lowest value of the first number like rows 1, 3, 7)
I was thinking of creating an array for each row and sorting the values by a common shift within each array. But I'm not sure if this is the smartest way ..
THX!
As I understand, your definition of duplicates is the names whose dot-delimited parts form the same circular queue. If so, we need to:
(EDIT: @ChrisNZ has pointed out - correctly - that there's a hole in the original queue rotation logic in case the largest chunks ends up both at the top and the bottom. The queue should be rotated further and stop when the largest chunk is on the top but not on the bottom. Below, a provision is also made to stop the loop from rotating infinitely if all the chunks in the string are the same. The test string @ChrisNZ used to show the flaw is included in the sample data set as the 0 record.)
In other (SAS) words:
data have ;
input NUM NAME :$200. ;
cards ;
0 281.281.3891.3891
1 281.3891.3891.281
2 3891.281.281.3891
3 1162.5645.5645.500835.500835.1162
4 5645.500835.500835.1162.1162.5645
5 500835.1162.1162.5645.5645.500835
6 1349.1162.1162.5645.5645.500835.500835.1349
7 1162.5645.5645.500835.500835.1349.1349.1162
8 5645.500835.500835.1349.1349.1162.1162.5645
9 500835.1349.1349.1162.1162.5645.5645.500835
10 5645.1162.1162.500835.500835.5645
11 1162.500835.500835.5645.5645.1162
12 500835.5645.5645.1162.1162.500835
;
run ;
data want (drop = _:) ;
if _n_ = 1 then do ;
dcl hash h () ;
h.definekey ("_nm") ;
h.definedone () ;
end ;
set have ;
length _tm _t $ 16 ;
_nm = name ;
do _n_ = 1 to countw (_nm) ;
_t = scan (_nm, _n_) ;
if _t > _tm then _tm = _t ;
end ;
do _n_ = 1 to countw (_nm) - 1 while (not (scan (_nm, 1) < _t = _tm)) ;
_nm = catx (".", substr (_nm, findc (_nm, ".") + 1), scan (_nm, 1)) ;
_t = scan (_nm, -1) ;
end ;
if h.check() ne 0 ;
h.add() ;
run ;
As a result, you get the records 0, 3, 6, 10.
Kind regards
Paul D.
I am not exactly sure that I understand exactly what you want but see if this gets close.
proc sql; create table tmp (num int, name char(200)); insert into tmp values (1,'281.3891.3891.281') values (2,'3891.281.281.3891') values (3,'1162.5645.5645.500835.500835.1162') values (4,'5645.500835.500835.1162.1162.5645') values (5,'500835.1162.1162.5645.5645.500835') values (6,'1349.1162.1162.5645.5645.500835.500835.1349') values (7,'1162.5645.5645.500835.500835.1349.1349.1162') values (8,'5645.500835.500835.1349.1349.1162.1162.5645') values (9,'500835.1349.1349.1162.1162.5645.5645.500835'); quit; data tmp2; set tmp; array v (50) $ 6; length newname $200; do i= 1 to countw(name); v[i]= scan(name,i,'.'); end; call sortc(of v(*)); newname=catx('.',of v(*)); drop v: i; run; proc sort data=tmp2 out=want nodupkey; by newname; run;
Another approach using SQL:
proc sql;
create table want as
select *
from have as a
where not exists (
select * from have as b
where a.num > b.num and length(a.name) = length(b.name) and
index(cats(".", b.name, ".", b.name,"."), cats(".", a.name, ".")) > 0);
quit;
Wow ... finally a way of doing this purely in terms of sets without procedural code. A clever way to identify the cycles, too!
Kind regards
Paul D.
@PGStats Why the surrounding dots? Isn't this enough?
where a.num > b.num and length(a.name) = length(b.name) and
index(cats( b.name, ".", b.name), strip( a.name ))
@ChrisNZ Adding delimiters might indeed not be necessary. It just seemed safer
Keeps getting still better. Which makes me wonder why the condition equating the lengths is needed, as the query does fine without it, too:
proc sql ;
create table want as
select * from have a
where not exists
(select 1 from have b
where a.num > b.num
and find (cats (b.name, ".", b.name), trim (a.name))
) ;
quit ;
Kind regards
Paul D.
> why the condition equating the lengths is needed
For speed: it's much faster to check the length than the cat() result.
Thank you, I am checking
Does it have to be an exact match?
Yes, exact match using parallel shift(can't catch how better to say).
3891.281.3891 would it still be a duplicate?
Perhars in start message there was placed not very good example.. groups {3;4;5} and {10;11;12} are different
proc sql;
create table tmp (num int, name char(200));
insert into tmp
values (1,'281.3891.3891.281')
values (2,'3891.281.281.3891')
values (3,'1162.5645.5645.500835.500835.1162')
values (4,'5645.500835.500835.1162.1162.5645')
values (5,'500835.1162.1162.5645.5645.500835')
values (6,'1349.1162.1162.5645.5645.500835.500835.1349')
values (7,'1162.5645.5645.500835.500835.1349.1349.1162')
values (8,'5645.500835.500835.1349.1349.1162.1162.5645')
values (9,'500835.1349.1349.1162.1162.5645.5645.500835')
values (10,'5645.1162.1162.500835.500835.5645')
values (11,'1162.500835.500835.5645.5645.1162')
values (12,'500835.5645.5645.1162.1162.500835');
quit;
p.s. While writing I am trying to do my best, but seems my English is rather unclear.. I am very Sorry for it..
It works and it is close, Thank you!
Could you say is it possible to improve your code according to my second example? - groups {3;4;5} and {10;11;12} should differ.
Values in 'name' field are looped chains.
If sequence of the elements is different, we have different chains.
Under numbers 3, 4, 5 there is the same sequence of the elements with different starting point; the chain is the same - so we are having duplicates.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.