- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi there,
I have 2 keywords that i need to search for and create new variables. The Keywords Med_A and Med_B are somewhere in free text in ColA and ColB. I need two create 2 new variables Med_A_Yes and Med_B_Yes. Basically these variables say if the ID has Med_A or Med_B.
data have;
infile datalines dlm=':';
input ID ColA :$100. ColB :$100.;
datalines;
1: aam,Med_A,fsdsd: fdfd,Med_B
2: asd,Med_A,dasde: Med_A,dgdgd
3: dsd,Med_B,hrte: Med_B,dddfd
4: Med_A,dsfsf,rwrr++: dgdgd,Med_A
5: fdfdf,gdfgdg,Med_B: sfdfs,Med_A
6: Med_B,fssff,rrtwr,: Med_B,gsgsgs
7: ffdfd,Med_A,gdgdgg: Med_A,gsgse
8: Med_A,fsfgghe,fsgyt: Med_B,gdfsrer
9: dsdsf,Med_B,fghryhr: Med_B,++fdfs
10: fsfs,Med_B,gdghyyrr: .
11: ffsfs,tetetyeyy,Med_B: .
12: fsfsg,Med_Ahfhfhfhf: .
;
Run;
Data Have:
ID | ColA | ColB |
1 | aam,Med_A,fsdsd | fdfd,Med_B |
2 | asd,Med_A,dasde | Med_A,dgdgd |
3 | dsd,Med_B,hrte | Med_B,dddfd |
4 | Med_A,dsfsf,rwrr++ | dgdgd,Med_A |
5 | fdfdf,gdfgdg,Med_B | sfdfs,Med_A |
6 | Med_B,fssff,rrtwr, | Med_B,gsgsgs |
7 | ffdfd,Med_A,gdgdgg | Med_A,gsgse |
8 | Med_A,fsfgghe,fsgyt | Med_B,gdfsrer |
9 | dsdsf,Med_B,fghryhr | Med_B,++fdfs |
10 | fsfs,Med_B,gdghyyrr | |
11 | ffsfs,tetetyeyy,Med_B | |
12 | fsfsg,Med_Ahfhfhfhf |
Data Want:
ID | Med_A_Yes | Med_B_Yes |
1 | Med_A | Med_B |
2 | Med_A | |
3 | Med_B | |
4 | Med_A | |
5 | Med_A | Med_B |
6 | Med_B | |
7 | Med_A | |
8 | Med_A | Med_B |
9 | Med_B | |
10 | Med_B | |
11 | Med_B | |
12 | Med_A |
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@newsas007 wrote:
Hi there,
I have 2 keywords that i need to search for and create new variables. The Keywords Med_A and Med_B are somewhere in free text in ColA and ColB. I need two create 2 new variables Med_A_Yes and Med_B_Yes. Basically these variables say if the ID has Med_A or Med_B.
data have; infile datalines dlm=':'; input ID ColA :$100. ColB :$100.; datalines; 1: aam,Med_A,fsdsd: fdfd,Med_B 2: asd,Med_A,dasde: Med_A,dgdgd 3: dsd,Med_B,hrte: Med_B,dddfd 4: Med_A,dsfsf,rwrr++: dgdgd,Med_A 5: fdfdf,gdfgdg,Med_B: sfdfs,Med_A 6: Med_B,fssff,rrtwr,: Med_B,gsgsgs 7: ffdfd,Med_A,gdgdgg: Med_A,gsgse 8: Med_A,fsfgghe,fsgyt: Med_B,gdfsrer 9: dsdsf,Med_B,fghryhr: Med_B,++fdfs 10: fsfs,Med_B,gdghyyrr: . 11: ffsfs,tetetyeyy,Med_B: . 12: fsfsg,Med_Ahfhfhfhf: . ; Run;
Data Have:
ID ColA ColB 1 aam,Med_A,fsdsd fdfd,Med_B 2 asd,Med_A,dasde Med_A,dgdgd 3 dsd,Med_B,hrte Med_B,dddfd 4 Med_A,dsfsf,rwrr++ dgdgd,Med_A 5 fdfdf,gdfgdg,Med_B sfdfs,Med_A 6 Med_B,fssff,rrtwr, Med_B,gsgsgs 7 ffdfd,Med_A,gdgdgg Med_A,gsgse 8 Med_A,fsfgghe,fsgyt Med_B,gdfsrer 9 dsdsf,Med_B,fghryhr Med_B,++fdfs 10 fsfs,Med_B,gdghyyrr 11 ffsfs,tetetyeyy,Med_B 12 fsfsg,Med_Ahfhfhfhf
Data Want:
ID Med_A_Yes Med_B_Yes 1 Med_A Med_B 2 Med_A 3 Med_B 4 Med_A 5 Med_A Med_B 6 Med_B 7 Med_A 8 Med_A Med_B 9 Med_B 10 Med_B 11 Med_B 12 Med_A
This creates numeric flags of 1 and 0 for "true", Med_a or Med_b is found in either of the two column variables.
There are some advantages to 1/0 codes instead of text.
HOWEVER there are a couple of things that your real data may have that should be considered.
First is capitalization, such as does "MED_A" or "mad_a" match the desired "Med_A"? If so you can add the option 'i' to the Findw function to ignore case.
Trickier, and with an example like "Med_a" which might a medicine, is how exact do you expect the values to be to match. Consider compound words like "applesauce". Would that be match for a desired value of "apple"?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The numeric flags 0 or 1 should work. The capitalization doesn't matter: med_A or Med_A is still Med_A.
Finally, yes, the "applesauce" would that be match for a desired value of "apple"
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
To make the comparison non-case sensitive and match embedded phrases you could use . Look closely, I have changed the case of some of the values and added a compound word. FIND instead of FINDW matches compound words, the 'i' option ignores case.
data have; infile datalines dlm=':'; input ID ColA :$100. ColB :$100.; datalines; 1: aam,Med_A,fsdsd: fdfd,MED_B 2: asd,Med_A,dasde: Med_A,dgdgd 3: dsd,Med_B,hrte: Med_B,dddfd 4: med_a,dsfsf,rwrr++: dgdgd,Med_A 5: fdfdf,gdfgdg,Med_B: sfdfs,Med_A 6: Med_b,fssff,rrtwr,: Med_B,gsgsgs 7: ffdfd,Med_A,gdgdgg: MeD_A,gsgse 8: Med_A,fsfgghe,fsgyt: Med_B,gdfsrer 9: dsdsf,Med_B,fghryhr: Med_B,++fdfs 10: fsfs,Med_B,gdghyyrr: med_b_compound 11: ffsfs,tetetyeyy,Med_B: . 12: fsfsg,Med_Ahfhfhfhf: . ; Run; data want; set have; Med_A_Yes = find(catx(',',colA,Colb),'Med_A','i')>0; Med_B_Yes = find(catx(',',colA,Colb),'Med_B','i')>0; run;
Reasons to use the 1/0 coding for flags: If you SUM the Med_A_Yes variable you get the number of records where it was true or found, the mean would the percentage of records where it is found.
If your Id was repeated, such as representing a doctor, clinic or something you want to report on as a group then any of Proc Report, tabulate, means or summary with the Sum and Mean give you simple reports of how many A and B by the group overall. Additional statistics like Range, if 0 tell you all the records have the value present; If the MIN is 0 at least one record did not have the Med value, if the MIN is 1 then ALL the records have the Med. If the MAX is 1 at least one record had the med, if the max is 0 then none of the records have the med (within the report group(s)).