BookmarkSubscribeRSS Feed
afsand
Calcite | Level 5

Hi,

 

I have a SAS table column which contains some SAS files names extracted from a directory (see below some examples). I want to keep only one row for similar names, I am not sure how to do that, for example the first three rows must be replaced by only one row which can be the first row 0-2004-editcacacon.sas.

 

0-2004-editcacacon.sas
0-2005-editcacacon.sas

0-2112-editcacacon.sas

1.1 sald-etudettsald_2018.sas
1.1 sald-etudettsald_2020.sas
1.1 sald-etudettsald_2020-base 2018.sas
1.1 vie-etudettvie_2018.sas
1.1 vie-etudettvie_2020.sas
1.1 vie-etudettvie_2020-qx_anc.sas
1.2 sald-kkldmajtab_2018.sas
1.2 sald-kkldmajtab_2020.sas
1.2 vie- kkviemajtab_2018.sas
1.2 vie-kkviemajtab_2020.sas
1.2 vie-kkviemajtab_2020-qx_anc.sas
4 REPLIES 4
mkeintz
PROC Star

What are your criteria for "similar names"?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
andreas_lds
Jade | Level 19

Start by writing down all rules that identify similar values.

PGStats
Opal | Level 21

A simple way would be to extract a long string of letters (at least 8 long for example) from the file names and only keep unique ones. Using a regular expression, for example :

 

data test;
length txt $60;
input txt &;
datalines;
0-2004-editcacacon.sas
0-2005-editcacacon.sas
0-2112-editcacacon.sas
1.1 sald-etudettsald_2018.sas
1.1 sald-etudettsald_2020.sas
1.1 sald-etudettsald_2020-base 2018.sas
1.1 vie-etudettvie_2018.sas
1.1 vie-etudettvie_2020.sas
1.1 vie-etudettvie_2020-qx_anc.sas
1.2 sald-kkldmajtab_2018.sas
1.2 sald-kkldmajtab_2020.sas
1.2 vie- kkviemajtab_2018.sas
1.2 vie-kkviemajtab_2020.sas
1.2 vie-kkviemajtab_2020-qx_anc.sas
9.9 2002 pouet-pouet.sas
;

data temp;
if _n_=1 then id + prxparse("/[a-z]{8,}/"); /* minimum word length = 8 */
set test;
call prxsubstr(id, txt, pos, len);
length w $60;
if pos then w = substr(txt, pos, len);
else w = catx(" ", "No key", _n_);
drop id len pos;
run;

proc sort data=temp out=want(drop=w) nodupkey; by w; run;

proc print data=want noobs; run;

PGStats_0-1657142680673.png

 

PG
Ksharp
Super User

It is really not easy. Here is the code you could start with .

 

data test;
length txt $60;
input txt &;
datalines;
0-2004-editcacacon.sas
0-2005-editcacacon.sas
0-2112-editcacacon.sas
1.1 sald-etudettsald_2018.sas
1.1 sald-etudettsald_2020.sas
1.1 sald-etudettsald_2020-base 2018.sas
1.1 vie-etudettvie_2018.sas
1.1 vie-etudettvie_2020.sas
1.1 vie-etudettvie_2020-qx_anc.sas
1.2 sald-kkldmajtab_2018.sas
1.2 sald-kkldmajtab_2020.sas
1.2 vie- kkviemajtab_2018.sas
1.2 vie-kkviemajtab_2020.sas
1.2 vie-kkviemajtab_2020-qx_anc.sas
9.9 2002 pouet-pouet.sas
;

proc sql;
create table have as
select a.txt as a_txt,b.txt as b_txt ,spedis(compress(a.txt,,'pds'),compress(b.txt,,'pds')) as distance
 from test as a,test as b
  where calculated distance<50
  order by a.txt,distance ;
quit;

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 628 views
  • 0 likes
  • 5 in conversation