Hi,
I have a SAS table column which contains some SAS files names extracted from a directory (see below some examples). I want to keep only one row for similar names, I am not sure how to do that, for example the first three rows must be replaced by only one row which can be the first row 0-2004-editcacacon.sas.
0-2004-editcacacon.sas |
0-2005-editcacacon.sas |
0-2112-editcacacon.sas |
1.1 sald-etudettsald_2018.sas |
1.1 sald-etudettsald_2020.sas |
1.1 sald-etudettsald_2020-base 2018.sas |
1.1 vie-etudettvie_2018.sas |
1.1 vie-etudettvie_2020.sas |
1.1 vie-etudettvie_2020-qx_anc.sas |
1.2 sald-kkldmajtab_2018.sas |
1.2 sald-kkldmajtab_2020.sas |
1.2 vie- kkviemajtab_2018.sas |
1.2 vie-kkviemajtab_2020.sas |
1.2 vie-kkviemajtab_2020-qx_anc.sas |
What are your criteria for "similar names"?
Start by writing down all rules that identify similar values.
A simple way would be to extract a long string of letters (at least 8 long for example) from the file names and only keep unique ones. Using a regular expression, for example :
data test;
length txt $60;
input txt &;
datalines;
0-2004-editcacacon.sas
0-2005-editcacacon.sas
0-2112-editcacacon.sas
1.1 sald-etudettsald_2018.sas
1.1 sald-etudettsald_2020.sas
1.1 sald-etudettsald_2020-base 2018.sas
1.1 vie-etudettvie_2018.sas
1.1 vie-etudettvie_2020.sas
1.1 vie-etudettvie_2020-qx_anc.sas
1.2 sald-kkldmajtab_2018.sas
1.2 sald-kkldmajtab_2020.sas
1.2 vie- kkviemajtab_2018.sas
1.2 vie-kkviemajtab_2020.sas
1.2 vie-kkviemajtab_2020-qx_anc.sas
9.9 2002 pouet-pouet.sas
;
data temp;
if _n_=1 then id + prxparse("/[a-z]{8,}/"); /* minimum word length = 8 */
set test;
call prxsubstr(id, txt, pos, len);
length w $60;
if pos then w = substr(txt, pos, len);
else w = catx(" ", "No key", _n_);
drop id len pos;
run;
proc sort data=temp out=want(drop=w) nodupkey; by w; run;
proc print data=want noobs; run;
It is really not easy. Here is the code you could start with .
data test;
length txt $60;
input txt &;
datalines;
0-2004-editcacacon.sas
0-2005-editcacacon.sas
0-2112-editcacacon.sas
1.1 sald-etudettsald_2018.sas
1.1 sald-etudettsald_2020.sas
1.1 sald-etudettsald_2020-base 2018.sas
1.1 vie-etudettvie_2018.sas
1.1 vie-etudettvie_2020.sas
1.1 vie-etudettvie_2020-qx_anc.sas
1.2 sald-kkldmajtab_2018.sas
1.2 sald-kkldmajtab_2020.sas
1.2 vie- kkviemajtab_2018.sas
1.2 vie-kkviemajtab_2020.sas
1.2 vie-kkviemajtab_2020-qx_anc.sas
9.9 2002 pouet-pouet.sas
;
proc sql;
create table have as
select a.txt as a_txt,b.txt as b_txt ,spedis(compress(a.txt,,'pds'),compress(b.txt,,'pds')) as distance
from test as a,test as b
where calculated distance<50
order by a.txt,distance ;
quit;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.