Hi,
I have two data sets. The first data set (named 1A) includes a code associated with key words.
The second data set (1B) includes patients' ID with a diagnosis description. See below :
Dataset 1A:
CODE | YES | NO |
q123 | "neck pain" "head pain" | "No" "Not" "No sign of" "Not known for" |
q456 | "back pain", "lombar pain" "lower back pain" | "No" "Not" "No sign of" "Not known for" |
q789 | "Knee pain" "Hip pain" "Leg pain" | "No" "Not" "No sign of" "Not known for" |
Dataset 1B:
ID | DESCRIPTION |
1 | Neck pain |
2 | No sign of lombar pain. Normal exam. |
3 | Lower back pain |
4 | Not known for knee pain |
I am looking for a way to tell SAS that "If the word from the "DESCRIPTION" variable in the 1B data set matches one of the word in the "YES" variable AND that words from the "NO" variable from the 1A data set are not found in the 1B data set, assign the matching code".
To illustrate by thoughts, here is a table of what I am looking for :
ID | DESCRIPTION | CODE |
1 | Neck pain | q123 |
2 | No sign of lombar pain. Normal exam. | |
3 | Lower back pain | q456 |
4 | Not known for knee pain |
Hope it's clear enough. Thanks in advance for your help.
Cheers
Hi,
I would try following code, maybe it is not the most sophisticated but it seems to solve your problem.
all the best
Bart
data codes;
infile cards4 dlm = "#" dsd;
input CODE : $ 10. YES : $ 100. NO : $ 100.;
cards4;
q123#"neck pain" "head pain"#"No" "Not" "No sign of" "Not known for"
q456#"back pain" "lombar pain" "lower back pain"#"No" "Not" "No sign of" "Not known for"
q789#"Knee pain" "Hip pain" "Leg pain"#"No" "Not" "No sign of" "Not known for"
;;;;
run;
data subject;
infile cards4 dlm = "#";
input ID DESCRIPTION : $ 50.;
cards4;
1#Neck pain
2#No sign of lombar pain. Normal exam.
3#Lower back pain
4#Not known for knee pain
;;;;
run;
data want(keep = ID DESCRIPTION CD rename=(CD=CODE));
/* for each observation from SUBJECT... */
set subject; length CD $ 100;
/* ... you are looping through all observations from CODES */
do point=1 to nobs;
set codes nobs=nobs point=point;
drop yes no;
put _N_= point=;
/* then you are parsing the YES list of "pains" for a given code
and traverse through that list
*/
i=1; length Y $ 50;
Y = dequote(scan(YES, i, " ", "q"));
do while (Y ne "");
put i= Y=;
if FIND(DESCRIPTION, Y , 'i') then do; /* if you find "pain" in Description ...*/
NEGATIVE=0; /* ... you look for posible negations in NO list */
j=1; length N $ 50;
N = dequote(scan(NO, j, " ", "q"));
do while (N ne "");
if FIND(DESCRIPTION, N , 'i') then NEGATIVE=1;
put j= N= NEGATIVE=;
j=j+1;
N = dequote(scan(NO, j, " ", "q"));
end;
if not NEGATIVE then CD = catx(" ", CD, CODE); /* if there is no negation you adding a code to the subject */
end;
i=i+1;
y = dequote(scan(yes, i, " ", "q"));
end;
end;
output;
run;
Yes, but it doesn't really make sense going that way. For instance,
Neck pain would not be the same as neck pain, pain in neck, neck ache etc. All of which would likely code to the same code.
What you need to do is some sort of coding exercise as you have not put the rigors into the data capture part (i.e. used free text). To this end you can do some text matching as you present here, but it would need someone to thoroughly look at this to ensure things are coded correctly. For instance if the text is "cervical stenosis", should this be classified as Neck pain? Its not just a matter of pattern matching.
As for the how, well thats simple, use the code list and call execute, something like;
data _null_; set dataset1a end=last; if _n_=1 then call execute('data want; set dataset1b;'); call execute(cat('if index(',upcase(yes),',upcase(description) and not(index(',upcase(no),',upcase(description)) then code="',code,'";'));
if last then call execute(';run;');
run;
That will generate an if for each row in dataset1a. It won't work of course, as you have multiple parts to each item, but demonstrates the method.
Hi,
I would try following code, maybe it is not the most sophisticated but it seems to solve your problem.
all the best
Bart
data codes;
infile cards4 dlm = "#" dsd;
input CODE : $ 10. YES : $ 100. NO : $ 100.;
cards4;
q123#"neck pain" "head pain"#"No" "Not" "No sign of" "Not known for"
q456#"back pain" "lombar pain" "lower back pain"#"No" "Not" "No sign of" "Not known for"
q789#"Knee pain" "Hip pain" "Leg pain"#"No" "Not" "No sign of" "Not known for"
;;;;
run;
data subject;
infile cards4 dlm = "#";
input ID DESCRIPTION : $ 50.;
cards4;
1#Neck pain
2#No sign of lombar pain. Normal exam.
3#Lower back pain
4#Not known for knee pain
;;;;
run;
data want(keep = ID DESCRIPTION CD rename=(CD=CODE));
/* for each observation from SUBJECT... */
set subject; length CD $ 100;
/* ... you are looping through all observations from CODES */
do point=1 to nobs;
set codes nobs=nobs point=point;
drop yes no;
put _N_= point=;
/* then you are parsing the YES list of "pains" for a given code
and traverse through that list
*/
i=1; length Y $ 50;
Y = dequote(scan(YES, i, " ", "q"));
do while (Y ne "");
put i= Y=;
if FIND(DESCRIPTION, Y , 'i') then do; /* if you find "pain" in Description ...*/
NEGATIVE=0; /* ... you look for posible negations in NO list */
j=1; length N $ 50;
N = dequote(scan(NO, j, " ", "q"));
do while (N ne "");
if FIND(DESCRIPTION, N , 'i') then NEGATIVE=1;
put j= N= NEGATIVE=;
j=j+1;
N = dequote(scan(NO, j, " ", "q"));
end;
if not NEGATIVE then CD = catx(" ", CD, CODE); /* if there is no negation you adding a code to the subject */
end;
i=i+1;
y = dequote(scan(yes, i, " ", "q"));
end;
end;
output;
run;
That works perfectly! Thanks Yabwon.
Cheers
One more version which is a little bit more I/O efficient,
Bart
data _null_;
if 0 then set codes nobs=nobs;
call symputx('_NOBS_', nobs, "G");
stop;
run;
data want2(keep = ID DESCRIPTION CD rename=(CD=CODE));
/* load data into memory in temporary arrays */
array _CODE[&_NOBS_.] $ 10 _temporary_;
array _YES[&_NOBS_.] $ 100 _temporary_;
array _NO[&_NOBS_.] $ 100 _temporary_;
do until(eof);
set codes end=eof curobs=curobs;
_CODE[curobs] = CODE;
_YES[curobs] = YES;
_NO[curobs] = NO;
end;
/* loop to get each observation from SUBJECT and... */
do until(_EOF_);
set subject end=_EOF_ curobs=_N_; length CD $ 100;
CD = "";
/* ... you are looping through all observations from CODES,
but now you use data stored in arrays instead of dataset on disk */
do point = 1 to &_NOBS_.;
CODE = _CODE[point];
YES = _YES[point];
NO = _NO[point];
drop yes no;
put _N_= point=;
/* then you are parsing the YES list of "pains" for a given code
and traverse through that list
*/
i=1; length Y $ 50;
Y = dequote(scan(YES, i, " ", "q"));
do while (Y ne "");
put i= Y=;
if FIND(DESCRIPTION, Y , 'i') then do; /* if you find "pain" in Description ...*/
NEGATIVE=0; /* ... you look for posible negations in NO list */
j=1; length N $ 50;
N = dequote(scan(NO, j, " ", "q"));
do while (N ne "");
if FIND(DESCRIPTION, N , 'i') then NEGATIVE=1;
put j= N= NEGATIVE=;
j=j+1;
N = dequote(scan(NO, j, " ", "q"));
end;
if not NEGATIVE then CD = catx(" ", CD, CODE); /* if there is no negation you adding a code to the subject */
end;
i=i+1;
y = dequote(scan(yes, i, " ", "q"));
end;
end;
output;
end;
stop;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.