BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Walternate
Obsidian | Level 7

Hi all,

 

I have a dataset which is a directory listing produced using a pipe statement. It has folder and file names. 

 

What I'm trying to get at specifically is that there are some cases where the file name is close, but not identical.

 

Think:

 

clinic_file_clin1.csv

clinic_file_clin1_c.csv

 

So clinic_file_ is the beginning of all the file names, clin# is the clinic name (should only be one per directory). The difference will be something at the end like _c, _copy, _fixed, or something like that.

 

I'm trying to identify the files that occur in the directory more than once. 

 

What is the best approach for this? I assume fuzzy matching of some kind, but from poking around it seems like that is usually across two columns/while matching.

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

If this is only at the end, then something like this:

data have;
input file_name :$50.;
cards;
clinic_file_clin1.csv
clinic_file_clin1_c.csv
clinic_file_clin2.csv
clinic_file_clin2_fixed.csv
;

proc sql;
create table want as
select t1.file_name, t2.file_name as possible_duplicate
from have as t1
cross join have as t2
where find(scan(t1.file_name, 1, "."), scan(t2.file_name, 1, "."), 'it') >0 and t1.file_name ne t2.file_name;
quit;

View solution in original post

3 REPLIES 3
Reeza
Super User
You can self join a dataset, if you know it will always be the same with a suffix, you can use like.
ballardw
Super User

"I'm trying to identify the files that occur in the directory more than once. "  Easy: none.

SIMILARLY named is valid question.

 

The functions COMPGED and similar would be a place to start.

Here is a small example generating three different scores. Lower score are "more similar" within one of the comparisons.

data have;
   input str $;
datalines;
abc
abc1
abc_1
abc*
pdq
pdc1
;

proc sql;
   create table example as
   select a.str as stringa,b.str as stringb
         ,compged(a.str,b.str) as compgedscore         
         ,complev(a.str,b.str) as complevscore
         ,spedis(a.str,b.str) as spedisscore
   from have as a, have as b
   where a.str ne b.str
   ;
run;

Look at some of your values and you could set a threshold for a specific scoring method to select likely "similar" names.

Read the documentation for the functions. I'm not going to repeat paragraphs of details.

Not the you can create custom scoring rules for COMPGED using the CALL COMPCOST routine but likely not worth the effort unless you have hundreds of similarity patterns to look for.

Reeza
Super User

If this is only at the end, then something like this:

data have;
input file_name :$50.;
cards;
clinic_file_clin1.csv
clinic_file_clin1_c.csv
clinic_file_clin2.csv
clinic_file_clin2_fixed.csv
;

proc sql;
create table want as
select t1.file_name, t2.file_name as possible_duplicate
from have as t1
cross join have as t2
where find(scan(t1.file_name, 1, "."), scan(t2.file_name, 1, "."), 'it') >0 and t1.file_name ne t2.file_name;
quit;

SAS INNOVATE 2024

Innovate_SAS_Blue.png

Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.

If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website. 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Get the $99 certification deal.jpg

 

 

Back in the Classroom!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 189 views
  • 1 like
  • 3 in conversation