Re: How to find similar character values in a variable

knveraraju91 · Posted 05-17-2018 06:09 PM

Dear,

I am validating a data set that may have values similar to the data set one. When I produce a table,
the number of subjects by 'term' variable value-it should show 4 for "bilirubin decreased' and 3 for vitamin B1 deficiency" for this data set. But because the values are upper and lower case i will one for each 'term' value.

Is there any way I can find these similar character values to make a correction before i output table.

Thank you

data one;
input id 1 term $3-40;
DATALINES;
1 bilirubin decreased
2 Bilirubin decreased
3 bilirubindecreased
4 BILIRUBIN DECREASED
5 Vitamin B1 deficiency
6 vitamin b1 deficiency
7 Vitamin B1 Deficiency
;
PROC SQL;
CREATE TABLE TWO AS
SELECT COUNT(DISTINCT ID) AS NOBS,TERM
FROM ONE
GROUP BY TERM;
QUIT;

Reeza · Posted 05-17-2018 06:13 PM

fuzzy matching 😞

I came across this paper last week, which had some really good tips and instructions on this topic:

https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2018/2886-2018.pdf

Basically, COMPGED and some other functions exist to check the 'difference' measure between text strings and then you can limit the results by filtering on the similarity scores.

art297 · Posted 05-17-2018 06:19 PM

Most of your problem could be solved by simply using the upcase function. If you actually doe have some cases like ID3 (where the difference also incorporates the using or not using embedded spaces, then compged would be an excellent choice.

Art, CEO, AnalystFinder.com

art297 · Posted 05-17-2018 06:25 PM

Of course, an alternative to using compged in the case of your examples, would be to just get rid of the blanks. e.g.:

PROC SQL;
  CREATE TABLE TWO AS
    SELECT COUNT(DISTINCT ID) AS NOBS,upcase(compress(TERM,' '))
      FROM ONE
        GROUP BY upcase(compress(TERM))
  ;
QUIT;

However, if there might be differences in spelling as well, then the paper @Reeza mentioned provides an excellent overview of alternatives.

Art, CEO, AnalystFinder.com

novinosrin · Posted 05-17-2018 06:32 PM

Hi and Good evening @art297 I'm afraid your sql will unfortunately remerge back with original to return 7 as you need an assignment to in the select clause to use in the group by clause. Just my 2 cents

adding distinct corrects it though, however it still will remerge and eliminate duplicates:

PROC SQL;
  CREATE TABLE TWO1 AS
    SELECT  distinct COUNT(DISTINCT ID) AS NOBS,upcase(compress(TERM,' '))
      FROM ONE
        GROUP BY upcase(compress(TERM))
  ;
QUIT;

NOTE: The query requires remerging summary statistics back with the
original data.
NOTE: Table WORK.TWO1 created, with 2 rows and 2 columns.

novinosrin · Posted 05-17-2018 06:25 PM

data one;
input id 1 term $3-40;
DATALINES;
1 bilirubin decreased
2 Bilirubin decreased
3 bilirubin decreased
4 BILIRUBIN DECREASED
5 Vitamin B1 deficiency
6 vitamin b1 deficiency
7 Vitamin B1 Deficiency
;

PROC SQL;
CREATE TABLE TWO AS
SELECT  upcase(term) as term,COUNT(DISTINCT ID) AS NOBS
FROM ONE
GROUP BY 1;
QUIT;

Reeza · Posted 05-17-2018 06:35 PM

@novinosrin your third record does not match the original data. If the space is a typo that will work, but if the record is missing the space it will not match correctly.

novinosrin · Posted 05-17-2018 06:37 PM

@Reeza Nice catch, here is a small tweak

data one;
input id 1 term $3-40;
DATALINES;
1 bilirubin decreased
2 Bilirubin decreased
3 bilirubindecreased
4 BILIRUBIN DECREASED
5 Vitamin B1 deficiency
6 vitamin b1 deficiency
7 Vitamin B1 Deficiency
;

PROC SQL;
CREATE TABLE TWO AS
SELECT  upcase(compress(term)) as term,COUNT(DISTINCT ID) AS NOBS
FROM ONE
GROUP BY 1;
QUIT;

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away