BookmarkSubscribeRSS Feed
newsas007
Quartz | Level 8

Hi there, 

I have 2 keywords that i need to search for and create new variables. The Keywords Med_A and Med_B are somewhere in free text in ColA and ColB. I need two create 2 new variables Med_A_Yes and Med_B_Yes. Basically these variables say if the ID has Med_A or Med_B.

data have;
   infile datalines dlm=':';
   input ID ColA :$100. ColB :$100.;
 datalines;
1:	aam,Med_A,fsdsd:		fdfd,Med_B
2:	asd,Med_A,dasde:		Med_A,dgdgd
3:	dsd,Med_B,hrte:			Med_B,dddfd
4:	Med_A,dsfsf,rwrr++:		dgdgd,Med_A
5:	fdfdf,gdfgdg,Med_B:		sfdfs,Med_A
6:	Med_B,fssff,rrtwr,:		Med_B,gsgsgs
7:	ffdfd,Med_A,gdgdgg:		Med_A,gsgse
8:	Med_A,fsfgghe,fsgyt:	Med_B,gdfsrer
9:	dsdsf,Med_B,fghryhr:	Med_B,++fdfs
10:	fsfs,Med_B,gdghyyrr:	.
11:	ffsfs,tetetyeyy,Med_B:	.
12:	fsfsg,Med_Ahfhfhfhf:	.	
;
Run;

Data Have:

ID ColA ColB
1 aam,Med_A,fsdsd fdfd,Med_B
2 asd,Med_A,dasde Med_A,dgdgd
3 dsd,Med_B,hrte Med_B,dddfd
4 Med_A,dsfsf,rwrr++ dgdgd,Med_A
5 fdfdf,gdfgdg,Med_B sfdfs,Med_A
6 Med_B,fssff,rrtwr, Med_B,gsgsgs
7 ffdfd,Med_A,gdgdgg Med_A,gsgse
8 Med_A,fsfgghe,fsgyt Med_B,gdfsrer
9 dsdsf,Med_B,fghryhr Med_B,++fdfs
10 fsfs,Med_B,gdghyyrr  
11 ffsfs,tetetyeyy,Med_B  
12 fsfsg,Med_Ahfhfhfhf  

 

Data Want:

ID Med_A_Yes Med_B_Yes
1 Med_A Med_B
2 Med_A  
3   Med_B
4 Med_A  
5 Med_A Med_B
6   Med_B
7 Med_A  
8 Med_A Med_B
9   Med_B
10   Med_B
11   Med_B
12 Med_A  
3 REPLIES 3
ballardw
Super User

@newsas007 wrote:

Hi there, 

I have 2 keywords that i need to search for and create new variables. The Keywords Med_A and Med_B are somewhere in free text in ColA and ColB. I need two create 2 new variables Med_A_Yes and Med_B_Yes. Basically these variables say if the ID has Med_A or Med_B.

data have;
   infile datalines dlm=':';
   input ID ColA :$100. ColB :$100.;
 datalines;
1:	aam,Med_A,fsdsd:		fdfd,Med_B
2:	asd,Med_A,dasde:		Med_A,dgdgd
3:	dsd,Med_B,hrte:			Med_B,dddfd
4:	Med_A,dsfsf,rwrr++:		dgdgd,Med_A
5:	fdfdf,gdfgdg,Med_B:		sfdfs,Med_A
6:	Med_B,fssff,rrtwr,:		Med_B,gsgsgs
7:	ffdfd,Med_A,gdgdgg:		Med_A,gsgse
8:	Med_A,fsfgghe,fsgyt:	Med_B,gdfsrer
9:	dsdsf,Med_B,fghryhr:	Med_B,++fdfs
10:	fsfs,Med_B,gdghyyrr:	.
11:	ffsfs,tetetyeyy,Med_B:	.
12:	fsfsg,Med_Ahfhfhfhf:	.	
;
Run;

Data Have:

ID ColA ColB
1 aam,Med_A,fsdsd fdfd,Med_B
2 asd,Med_A,dasde Med_A,dgdgd
3 dsd,Med_B,hrte Med_B,dddfd
4 Med_A,dsfsf,rwrr++ dgdgd,Med_A
5 fdfdf,gdfgdg,Med_B sfdfs,Med_A
6 Med_B,fssff,rrtwr, Med_B,gsgsgs
7 ffdfd,Med_A,gdgdgg Med_A,gsgse
8 Med_A,fsfgghe,fsgyt Med_B,gdfsrer
9 dsdsf,Med_B,fghryhr Med_B,++fdfs
10 fsfs,Med_B,gdghyyrr  
11 ffsfs,tetetyeyy,Med_B  
12 fsfsg,Med_Ahfhfhfhf  

 

Data Want:

ID Med_A_Yes Med_B_Yes
1 Med_A Med_B
2 Med_A  
3   Med_B
4 Med_A  
5 Med_A Med_B
6   Med_B
7 Med_A  
8 Med_A Med_B
9   Med_B
10   Med_B
11   Med_B
12 Med_A  

This creates numeric flags of 1 and 0 for "true", Med_a or Med_b is found in either of the two column variables.

There are some advantages to 1/0 codes instead of text.

 

HOWEVER there are a couple of things that your real data may have that should be considered.

First is capitalization, such as does "MED_A" or "mad_a" match the desired "Med_A"? If so you can add the option 'i' to the Findw function to ignore case.

Trickier, and with an example like "Med_a" which might a medicine, is how exact do you expect the values to be to match. Consider compound words like "applesauce". Would that be match for a desired value of "apple"?

newsas007
Quartz | Level 8

@ballardw :

The numeric flags 0 or 1 should work. The capitalization doesn't matter: med_A or Med_A is still Med_A. 

Finally, yes, the "applesauce" would that be match for a desired value of "apple"

ballardw
Super User

To make the comparison non-case sensitive and match embedded phrases you could use . Look closely, I have changed the case of some of the values and added a compound word. FIND instead of FINDW matches compound words, the 'i' option ignores case.

 

data have;
   infile datalines dlm=':';
   input ID ColA :$100. ColB :$100.;
 datalines;
1:	aam,Med_A,fsdsd:		fdfd,MED_B
2:	asd,Med_A,dasde:		Med_A,dgdgd
3:	dsd,Med_B,hrte:			Med_B,dddfd
4:	med_a,dsfsf,rwrr++:		dgdgd,Med_A
5:	fdfdf,gdfgdg,Med_B:		sfdfs,Med_A
6:	Med_b,fssff,rrtwr,:		Med_B,gsgsgs
7:	ffdfd,Med_A,gdgdgg:		MeD_A,gsgse
8:	Med_A,fsfgghe,fsgyt:	Med_B,gdfsrer
9:	dsdsf,Med_B,fghryhr:	Med_B,++fdfs
10:	fsfs,Med_B,gdghyyrr:	med_b_compound
11:	ffsfs,tetetyeyy,Med_B:	.
12:	fsfsg,Med_Ahfhfhfhf:	.	
;
Run;

data want;
   set have;
   Med_A_Yes = find(catx(',',colA,Colb),'Med_A','i')>0;
   Med_B_Yes = find(catx(',',colA,Colb),'Med_B','i')>0;
run;

 Reasons to use the 1/0 coding for flags: If you SUM the Med_A_Yes variable you get the number of records where it was true or found, the mean would the percentage of records where it is found. 

If your Id was repeated, such as representing a doctor, clinic or something you want to report on as a group then any of Proc Report, tabulate, means or summary with the Sum and Mean give you simple reports of how many A and B by the group overall. Additional statistics like Range, if 0 tell you all the records have the value present; If the MIN is 0 at least one record did not have the Med value, if the MIN is 1 then ALL the records have the Med. If the MAX is 1 at least one record had the med, if the max is 0 then none of the records have the med (within the report group(s)).

 

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 617 views
  • 0 likes
  • 2 in conversation