BookmarkSubscribeRSS Feed
afsand
Calcite | Level 5

Hi,

 

I have a SAS table column which contains some SAS files names extracted from a directory (see below some examples). I want to keep only one row for similar names, I am not sure how to do that, for example the first three rows must be replaced by only one row which can be the first row 0-2004-editcacacon.sas.

 

0-2004-editcacacon.sas
0-2005-editcacacon.sas

0-2112-editcacacon.sas

1.1 sald-etudettsald_2018.sas
1.1 sald-etudettsald_2020.sas
1.1 sald-etudettsald_2020-base 2018.sas
1.1 vie-etudettvie_2018.sas
1.1 vie-etudettvie_2020.sas
1.1 vie-etudettvie_2020-qx_anc.sas
1.2 sald-kkldmajtab_2018.sas
1.2 sald-kkldmajtab_2020.sas
1.2 vie- kkviemajtab_2018.sas
1.2 vie-kkviemajtab_2020.sas
1.2 vie-kkviemajtab_2020-qx_anc.sas
4 REPLIES 4
mkeintz
PROC Star

What are your criteria for "similar names"?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
andreas_lds
Jade | Level 19

Start by writing down all rules that identify similar values.

PGStats
Opal | Level 21

A simple way would be to extract a long string of letters (at least 8 long for example) from the file names and only keep unique ones. Using a regular expression, for example :

 

data test;
length txt $60;
input txt &;
datalines;
0-2004-editcacacon.sas
0-2005-editcacacon.sas
0-2112-editcacacon.sas
1.1 sald-etudettsald_2018.sas
1.1 sald-etudettsald_2020.sas
1.1 sald-etudettsald_2020-base 2018.sas
1.1 vie-etudettvie_2018.sas
1.1 vie-etudettvie_2020.sas
1.1 vie-etudettvie_2020-qx_anc.sas
1.2 sald-kkldmajtab_2018.sas
1.2 sald-kkldmajtab_2020.sas
1.2 vie- kkviemajtab_2018.sas
1.2 vie-kkviemajtab_2020.sas
1.2 vie-kkviemajtab_2020-qx_anc.sas
9.9 2002 pouet-pouet.sas
;

data temp;
if _n_=1 then id + prxparse("/[a-z]{8,}/"); /* minimum word length = 8 */
set test;
call prxsubstr(id, txt, pos, len);
length w $60;
if pos then w = substr(txt, pos, len);
else w = catx(" ", "No key", _n_);
drop id len pos;
run;

proc sort data=temp out=want(drop=w) nodupkey; by w; run;

proc print data=want noobs; run;

PGStats_0-1657142680673.png

 

PG
Ksharp
Super User

It is really not easy. Here is the code you could start with .

 

data test;
length txt $60;
input txt &;
datalines;
0-2004-editcacacon.sas
0-2005-editcacacon.sas
0-2112-editcacacon.sas
1.1 sald-etudettsald_2018.sas
1.1 sald-etudettsald_2020.sas
1.1 sald-etudettsald_2020-base 2018.sas
1.1 vie-etudettvie_2018.sas
1.1 vie-etudettvie_2020.sas
1.1 vie-etudettvie_2020-qx_anc.sas
1.2 sald-kkldmajtab_2018.sas
1.2 sald-kkldmajtab_2020.sas
1.2 vie- kkviemajtab_2018.sas
1.2 vie-kkviemajtab_2020.sas
1.2 vie-kkviemajtab_2020-qx_anc.sas
9.9 2002 pouet-pouet.sas
;

proc sql;
create table have as
select a.txt as a_txt,b.txt as b_txt ,spedis(compress(a.txt,,'pds'),compress(b.txt,,'pds')) as distance
 from test as a,test as b
  where calculated distance<50
  order by a.txt,distance ;
quit;

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 1134 views
  • 0 likes
  • 5 in conversation