Hi,
I have a SAS table column which contains some SAS files names extracted from a directory (see below some examples). I want to keep only one row for similar names, I am not sure how to do that, for example the first three rows must be replaced by only one row which can be the first row 0-2004-editcacacon.sas.
0-2004-editcacacon.sas |
0-2005-editcacacon.sas |
0-2112-editcacacon.sas |
1.1 sald-etudettsald_2018.sas |
1.1 sald-etudettsald_2020.sas |
1.1 sald-etudettsald_2020-base 2018.sas |
1.1 vie-etudettvie_2018.sas |
1.1 vie-etudettvie_2020.sas |
1.1 vie-etudettvie_2020-qx_anc.sas |
1.2 sald-kkldmajtab_2018.sas |
1.2 sald-kkldmajtab_2020.sas |
1.2 vie- kkviemajtab_2018.sas |
1.2 vie-kkviemajtab_2020.sas |
1.2 vie-kkviemajtab_2020-qx_anc.sas |
What are your criteria for "similar names"?
Start by writing down all rules that identify similar values.
A simple way would be to extract a long string of letters (at least 8 long for example) from the file names and only keep unique ones. Using a regular expression, for example :
data test;
length txt $60;
input txt &;
datalines;
0-2004-editcacacon.sas
0-2005-editcacacon.sas
0-2112-editcacacon.sas
1.1 sald-etudettsald_2018.sas
1.1 sald-etudettsald_2020.sas
1.1 sald-etudettsald_2020-base 2018.sas
1.1 vie-etudettvie_2018.sas
1.1 vie-etudettvie_2020.sas
1.1 vie-etudettvie_2020-qx_anc.sas
1.2 sald-kkldmajtab_2018.sas
1.2 sald-kkldmajtab_2020.sas
1.2 vie- kkviemajtab_2018.sas
1.2 vie-kkviemajtab_2020.sas
1.2 vie-kkviemajtab_2020-qx_anc.sas
9.9 2002 pouet-pouet.sas
;
data temp;
if _n_=1 then id + prxparse("/[a-z]{8,}/"); /* minimum word length = 8 */
set test;
call prxsubstr(id, txt, pos, len);
length w $60;
if pos then w = substr(txt, pos, len);
else w = catx(" ", "No key", _n_);
drop id len pos;
run;
proc sort data=temp out=want(drop=w) nodupkey; by w; run;
proc print data=want noobs; run;
It is really not easy. Here is the code you could start with .
data test;
length txt $60;
input txt &;
datalines;
0-2004-editcacacon.sas
0-2005-editcacacon.sas
0-2112-editcacacon.sas
1.1 sald-etudettsald_2018.sas
1.1 sald-etudettsald_2020.sas
1.1 sald-etudettsald_2020-base 2018.sas
1.1 vie-etudettvie_2018.sas
1.1 vie-etudettvie_2020.sas
1.1 vie-etudettvie_2020-qx_anc.sas
1.2 sald-kkldmajtab_2018.sas
1.2 sald-kkldmajtab_2020.sas
1.2 vie- kkviemajtab_2018.sas
1.2 vie-kkviemajtab_2020.sas
1.2 vie-kkviemajtab_2020-qx_anc.sas
9.9 2002 pouet-pouet.sas
;
proc sql;
create table have as
select a.txt as a_txt,b.txt as b_txt ,spedis(compress(a.txt,,'pds'),compress(b.txt,,'pds')) as distance
from test as a,test as b
where calculated distance<50
order by a.txt,distance ;
quit;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.