Dear,
I am validating a data set that may have values similar to the data set one. When I produce a table,
the number of subjects by 'term' variable value-it should show 4 for "bilirubin decreased' and 3 for vitamin B1 deficiency" for this data set. But because the values are upper and lower case i will one for each 'term' value.
Is there any way I can find these similar character values to make a correction before i output table.
Thank you
data one;
input id 1 term $3-40;
DATALINES;
1 bilirubin decreased
2 Bilirubin decreased
3 bilirubindecreased
4 BILIRUBIN DECREASED
5 Vitamin B1 deficiency
6 vitamin b1 deficiency
7 Vitamin B1 Deficiency
;
PROC SQL;
CREATE TABLE TWO AS
SELECT COUNT(DISTINCT ID) AS NOBS,TERM
FROM ONE
GROUP BY TERM;
QUIT;
fuzzy matching 😞
I came across this paper last week, which had some really good tips and instructions on this topic:
https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2018/2886-2018.pdf
Basically, COMPGED and some other functions exist to check the 'difference' measure between text strings and then you can limit the results by filtering on the similarity scores.
Most of your problem could be solved by simply using the upcase function. If you actually doe have some cases like ID3 (where the difference also incorporates the using or not using embedded spaces, then compged would be an excellent choice.
Art, CEO, AnalystFinder.com
Of course, an alternative to using compged in the case of your examples, would be to just get rid of the blanks. e.g.:
PROC SQL;
CREATE TABLE TWO AS
SELECT COUNT(DISTINCT ID) AS NOBS,upcase(compress(TERM,' '))
FROM ONE
GROUP BY upcase(compress(TERM))
;
QUIT;
However, if there might be differences in spelling as well, then the paper @Reeza mentioned provides an excellent overview of alternatives.
Art, CEO, AnalystFinder.com
Hi and Good evening @art297 I'm afraid your sql will unfortunately remerge back with original to return 7 as you need an assignment to in the select clause to use in the group by clause. Just my 2 cents
adding distinct corrects it though, however it still will remerge and eliminate duplicates:
PROC SQL;
CREATE TABLE TWO1 AS
SELECT distinct COUNT(DISTINCT ID) AS NOBS,upcase(compress(TERM,' '))
FROM ONE
GROUP BY upcase(compress(TERM))
;
QUIT;
NOTE: The query requires remerging summary statistics back with the
original data.
NOTE: Table WORK.TWO1 created, with 2 rows and 2 columns.
data one;
input id 1 term $3-40;
DATALINES;
1 bilirubin decreased
2 Bilirubin decreased
3 bilirubin decreased
4 BILIRUBIN DECREASED
5 Vitamin B1 deficiency
6 vitamin b1 deficiency
7 Vitamin B1 Deficiency
;
PROC SQL;
CREATE TABLE TWO AS
SELECT upcase(term) as term,COUNT(DISTINCT ID) AS NOBS
FROM ONE
GROUP BY 1;
QUIT;
@novinosrin your third record does not match the original data. If the space is a typo that will work, but if the record is missing the space it will not match correctly.
@Reeza Nice catch, here is a small tweak
data one;
input id 1 term $3-40;
DATALINES;
1 bilirubin decreased
2 Bilirubin decreased
3 bilirubindecreased
4 BILIRUBIN DECREASED
5 Vitamin B1 deficiency
6 vitamin b1 deficiency
7 Vitamin B1 Deficiency
;
PROC SQL;
CREATE TABLE TWO AS
SELECT upcase(compress(term)) as term,COUNT(DISTINCT ID) AS NOBS
FROM ONE
GROUP BY 1;
QUIT;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.