Solved: How to compare delimited text in a column

binhle50 · Posted 09-15-2022 10:31 AM

Hi all,

I have a diagnosis_combine variable that contains multiple diagnoses separated by " | " . I need to create the variable "want" to scan over the diagnosis_combine to see if all diagnoses are the same or different as below:

Diagnosis_combine	want
Ehlers-Danlos syndrome	same
Storage Pool Disease \| Storage Pool Disease \| Factor I, hereditary \| Storage Pool Disease \| Storage Pool Disease	different
Factor I, hereditary \| Factor I, hereditary \| Factor I, hereditary \| Factor I, hereditary	same
vWD, type 2A \| Factor I, hereditary	different
Factor I, hereditary \| Factor I, hereditary \| Factor I, hereditary \| Factor I, hereditary \| Factor I, hereditary \| Factor I, hereditary	same
Factor XIII, hereditary \| Factor VIII, hereditary	same
vWD, type 1 \| Storage Pool Disease

I would highly appreciate if you can help with the best SAS procedures to do it without having to separate diagnosis_combine into multiple columns and compare all columns together.

Best,

Le

ballardw · Posted 09-15-2022 11:20 AM

Is there a known maximum number of diagnoses in that column?

Very likely a solution will involve creating additional variables and if there is not a known maximum then that will be a problem.

With an array of variables it is not that hard to compare all of the values.

One example:

data have;
   infile datalines truncover;
   input Diagnosis_combine $100.;
datalines;
Ehlers-Danlos syndrome 	
Storage Pool Disease | Storage Pool Disease | Factor I, hereditary | Storage Pool Disease | Storage Pool Disease 	
Factor I, hereditary | Factor I, hereditary | Factor I, hereditary | Factor I, hereditary 	
vWD, type 2A | Factor I, hereditary 	
Factor I, hereditary | Factor I, hereditary | Factor I, hereditary | Factor I, hereditary | Factor I, hereditary | Factor I, hereditary 	
Factor XIII, hereditary | Factor VIII, hereditary 	
vWD, type 1 |  Storage Pool Disease 	
;

data want;
   set have;
   array d (10) $ 25;
   do i=1 to countw(Diagnosis_combine,'|');
      d[i] = strip(scan(Diagnosis_combine,i,'|'));
   end;
   length want $ 9;
   want='same';
   do i=1 to (countw(Diagnosis_combine,'|')-1);
      do j=(i+1) to (countw(Diagnosis_combine,'|'));
         if d[i] ne d[j] then do;
             want='different';
             leave;
         end;
      end;
   end;
   /* after verifying things are working then uncomment 
   the following line*/
/*   drop i j d1-d10;*/
run;

The data step is to have something to work with.

The array size should be the largest known (or expected) number of elements. Replace 10 with that number or "guess", you know your data better than we do.

The first do loop pulls each value into a separate variable. The strip is make strings more consistent. It is likely that your actual data may not have a leading space for the value in the first position but does, if your paste is accurate, for those in other positions which would make them "different".

The nested Do i and Do j compares the first value with the "remaining". In the case of only one value the first limit means no comparison is actually attempted.

The LEAVE instruction says to leave the loops when the first not equal value is found.

Drop the variables you don't need when done.

View solution in original post

ballardw · Posted 09-15-2022 11:20 AM

Is there a known maximum number of diagnoses in that column?

Very likely a solution will involve creating additional variables and if there is not a known maximum then that will be a problem.

With an array of variables it is not that hard to compare all of the values.

One example:

data have;
   infile datalines truncover;
   input Diagnosis_combine $100.;
datalines;
Ehlers-Danlos syndrome 	
Storage Pool Disease | Storage Pool Disease | Factor I, hereditary | Storage Pool Disease | Storage Pool Disease 	
Factor I, hereditary | Factor I, hereditary | Factor I, hereditary | Factor I, hereditary 	
vWD, type 2A | Factor I, hereditary 	
Factor I, hereditary | Factor I, hereditary | Factor I, hereditary | Factor I, hereditary | Factor I, hereditary | Factor I, hereditary 	
Factor XIII, hereditary | Factor VIII, hereditary 	
vWD, type 1 |  Storage Pool Disease 	
;

data want;
   set have;
   array d (10) $ 25;
   do i=1 to countw(Diagnosis_combine,'|');
      d[i] = strip(scan(Diagnosis_combine,i,'|'));
   end;
   length want $ 9;
   want='same';
   do i=1 to (countw(Diagnosis_combine,'|')-1);
      do j=(i+1) to (countw(Diagnosis_combine,'|'));
         if d[i] ne d[j] then do;
             want='different';
             leave;
         end;
      end;
   end;
   /* after verifying things are working then uncomment 
   the following line*/
/*   drop i j d1-d10;*/
run;

The data step is to have something to work with.

The array size should be the largest known (or expected) number of elements. Replace 10 with that number or "guess", you know your data better than we do.

The first do loop pulls each value into a separate variable. The strip is make strings more consistent. It is likely that your actual data may not have a leading space for the value in the first position but does, if your paste is accurate, for those in other positions which would make them "different".

The nested Do i and Do j compares the first value with the "remaining". In the case of only one value the first limit means no comparison is actually attempted.

The LEAVE instruction says to leave the loops when the first not equal value is found.

Drop the variables you don't need when done.

binhle50 · Posted 09-15-2022 01:32 PM

Thanks so much! That is exactly what I am looking for. I replaced 10 with the largest known number from my data and it works as I expected. Thanks so much for your help!

Best,
Le

Ksharp · Posted 09-16-2022 07:56 AM

data have;
   infile datalines truncover;
   input Diagnosis_combine $100.;
datalines;
Ehlers-Danlos syndrome  
Storage Pool Disease | Storage Pool Disease | Factor I, hereditary | Storage Pool Disease | Storage Pool Disease  
Factor I, hereditary | Factor I, hereditary | Factor I, hereditary | Factor I, hereditary  
vWD, type 2A | Factor I, hereditary  
Factor I, hereditary | Factor I, hereditary | Factor I, hereditary | Factor I, hereditary | Factor I, hereditary | Factor I, hereditary  
Factor XIII, hereditary | Factor VIII, hereditary  
vWD, type 1 |  Storage Pool Disease  
;
data want;
 set have;
 want='different';
 if missing(compress(tranwrd(Diagnosis_combine,scan(Diagnosis_combine,1,'|'),''),'|')) then want='same';
run;

binhle50 · Posted 09-16-2022 09:20 AM

Thanks so much!!!
Yes it works great and it should be the most simple method.
Best,
Le

How to compare delimited text in a column

Re: How to compare delimited text in a column

Re: How to compare delimited text in a column

Re: How to compare delimited text in a column

Re: How to compare delimited text in a column

Re: How to compare delimited text in a column

How to compare delimited text in a column

Re: How to compare delimited text in a column

Re: How to compare delimited text in a column

Re: How to compare delimited text in a column

Re: How to compare delimited text in a column

Re: How to compare delimited text in a column

Ready to join fellow brilliant minds for the SAS Hackathon?

Click image to register for webinar

Classroom Training Available!