Hello,
I have a sample dataset called 'Have'. In the 'Test_name' column, there are multiple pathogens with either 'IgG' or 'IgM' indicated after the pathogen name. I would like to create a new column called 'Sub_Analyte' that captures each pathogen and assigns the corresponding 'IgG' or 'IgM' suffix, as shown in the desired outcome in the second dataset 'Want'. Could you please advise on the best way to approach this? Thank you.
data Test;
length Test_Name $2000;
infile datalines delimiter='#';
input Test_Name;
datalines;
Encephalitis Antibody Panel, CSF Includes: Herpes Simplex Virus (HSV) 1/2 (IgG) Type Antibody, CSF Lymphocytic Choriomeningitis (LCM) Virus (IgG, IgM) Antibody, IFA, CSF Measles (Rubeola) (IgG, IgM) Antibody Panel, IFA, CSF Mumps Antibody Panel, IFA, CSF Varicella-Zoster Virus (VZV) Antibodies (Total, IgM), ACIF/IFA, CSF West Nile Virus (WNV) Antibodies (IgG, IgM), CSF #
Herpes Simplex Type 1 and Type 2 Glycoprotein G-Specific Antibodies, IgG by CIA#
;
data Want;
length Sub_Analyte $1000;
infile datalines delimiter='#';
input Sub_Analyte;
datalines;
HSV_IgG, LCM_IgG, LCM_IgM, Measles_IgG,IgM, VZV_IgM, WNV_IgG, WNV_IgM#
HSV_IgG#
;
Apologies for any confusion. In the long text from row 1,
'Herpes Simplex Virus (HSV) 1/2 (IgG) Type Antibody, CSF Lymphocytic Choriomeningitis (LCM) Virus (IgG, IgM) Antibody, IFA, CSF Measles (Rubeola) (IgG, IgM) Antibody Panel, IFA, CSF Mumps Antibody Panel, IFA, CSF Varicella-Zoster Virus (VZV) Antibodies (Total, IgM), ACIF/IFA, CSF West Nile Virus (WNV) Antibodies (IgG, IgM), CSF'
I would like to use 'IgG', 'IgM', or 'IgG, IgM' as labels. If a pathogen appears before the label, I will add an underscore between the pathogen and the antibody type to create a new pathogen_antibody
label. For example:
HSV_IgG
.Measles_IgG, IgM
. (BTW, either Meals or Rebeola is fine cause they are in the same group)Ideally, I would like each result to be listed in a separate row. However, if that's not possible, listing them all in one row would be acceptable.
Also, as the text shows, not all pathogen names have abbreviations or are enclosed in parentheses. Using the full names for the results would be more accurate. Additionally, I noticed that 'CSF' is the last word associated with each pathogen. To improve the search, it might be helpful to use 'CSF' as a separator to distinguish the next pathogen.
data Want_Final;
length Sub_Analyte $1000;
infile datalines delimiter='#';
input Sub_Analyte;
datalines;
Herpes Simplex_IgG#
Lymphocytic Choriomeningitis_IgG,IgM#
Measles_IgG,IgM#
Varicella_IgM#
West Nile Virus_IgG,IgM#
Herpes Simplex_IgG#
;
@ybz12003 If you can describe the logic how to identify and combine the desired terms in the text then I'm sure someone can help with a solution. I can't figure out such logic especially for you row two. How do you end-up with HSV_IgG#?
There are too many scenarios you need to take into account of.
And as Patrick pointed out:
1)Why you end up with "HSV_IgG#" , there are NOT '()' in obs.
2)Should not "Measles_IgG" be "Rubeola_IgG" and "Rubeola_IgM"?
data Test;
length Test_Name $2000;
infile datalines delimiter='#';
input Test_Name;
datalines;
Encephalitis Antibody Panel, CSF Includes: Herpes Simplex Virus (HSV) 1/2 (IgG) Type Antibody, CSF Lymphocytic Choriomeningitis (LCM) Virus (IgG, IgM) Antibody, IFA, CSF Measles (Rubeola) (IgG, IgM) Antibody Panel, IFA, CSF Mumps Antibody Panel, IFA, CSF Varicella-Zoster Virus (VZV) Antibodies (Total, IgM), ACIF/IFA, CSF West Nile Virus (WNV) Antibodies (IgG, IgM), CSF #
Herpes Simplex Type 1 and Type 2 Glycoprotein G-Specific Antibodies, IgG by CIA#
;
data temp;
set test;
id+1;
pid=prxparse('/\(.+?\)/');
start=1;end=length(test_name);
call prxnext(pid,start,end,test_name,p,l);
do while(p>0);
temp=compress(substr(test_name,p,l),'()');output;
call prxnext(pid,start,end,test_name,p,l);
end;
keep id temp;
run;
data temp;
set temp;
if not prxmatch('/igg|igm/i',temp) or id ne lag(id) then group+1;
run;
data temp2;
do until(last.group);
set temp;
by id group;
length x $ 200;
x=catx('|',x,temp);
end;
drop temp;
run;
data want;
set temp2;
pre=scan(x,1,'|');
post=scan(x,-1,'|');
do i=1 to countw(post,' ,');
want=cats(pre,'_',scan(post,i,' ,'));
if prxmatch('/igg|igm/i',want) then output;
end;
keep id group want;
run;
Apologies for any confusion. In the long text from row 1,
'Herpes Simplex Virus (HSV) 1/2 (IgG) Type Antibody, CSF Lymphocytic Choriomeningitis (LCM) Virus (IgG, IgM) Antibody, IFA, CSF Measles (Rubeola) (IgG, IgM) Antibody Panel, IFA, CSF Mumps Antibody Panel, IFA, CSF Varicella-Zoster Virus (VZV) Antibodies (Total, IgM), ACIF/IFA, CSF West Nile Virus (WNV) Antibodies (IgG, IgM), CSF'
I would like to use 'IgG', 'IgM', or 'IgG, IgM' as labels. If a pathogen appears before the label, I will add an underscore between the pathogen and the antibody type to create a new pathogen_antibody
label. For example:
HSV_IgG
.Measles_IgG, IgM
. (BTW, either Meals or Rebeola is fine cause they are in the same group)Ideally, I would like each result to be listed in a separate row. However, if that's not possible, listing them all in one row would be acceptable.
Also, as the text shows, not all pathogen names have abbreviations or are enclosed in parentheses. Using the full names for the results would be more accurate. Additionally, I noticed that 'CSF' is the last word associated with each pathogen. To improve the search, it might be helpful to use 'CSF' as a separator to distinguish the next pathogen.
data Want_Final;
length Sub_Analyte $1000;
infile datalines delimiter='#';
input Sub_Analyte;
datalines;
Herpes Simplex_IgG#
Lymphocytic Choriomeningitis_IgG,IgM#
Measles_IgG,IgM#
Varicella_IgM#
West Nile Virus_IgG,IgM#
Herpes Simplex_IgG#
;
As others have pointed out, you never specified what rules you want to execute. So you have left the job of interpreting the objective to us. Some folks (including me) have, after some study, deduced that each entry in the new variable is supposed to be extracted from successive parenthetical expressions.
At least that is what seems to be the case, even though your sample result does not completely honor that rule. That's why respondents have posted the question about row 2, and about MEASLES vs Rubella in line 1.
To help us help you, please be complete in describing the task.
Here is my suggestion, based on capturing consecutive pairs of parenthetical expressions - I call the first expresssion _CONDITION, and the second as _CODES. Then it checks _CODES for "IgM" and "IgG" to decide whether to add content to the new variable:
data Test;
length Test_Name $2000;
infile datalines delimiter='#';
input Test_Name;
datalines;
Encephalitis Antibody Panel, CSF Includes: Herpes Simplex Virus (HSV) 1/2 (IgG) Type Antibody, CSF Lymphocytic Choriomeningitis (LCM) Virus (IgG, IgM) Antibody, IFA, CSF Measles (Rubeola) (IgG, IgM) Antibody Panel, IFA, CSF Mumps Antibody Panel, IFA, CSF Varicella-Zoster Virus (VZV) Antibodies (Total, IgM), ACIF/IFA, CSF West Nile Virus (WNV) Antibodies (IgG, IgM), CSF #
Herpes Simplex Type 1 and Type 2 Glycoprotein G-Specific Antibodies, IgG by CIA#
run;
data want (drop=_: L R); /*L & R for left & right parentheses*/
set test;
length Sub_Analyte $1000;
do L=findc(test_name,'(') by 0 while(L^=0);
R=findc(test_name,')',,L);
if R=0 then leave;
_condition=substr(test_name,L+1,R-L-1);
L=findc(test_name,'(',,R);
R=findc(test_name,')',,L);
if L=0 or R=0 then leave;
_codes=substr(test_name,L+1,R-L-1);
do _txt='IgG','IgM';
if findw(_codes,trim(_txt),', ')>0 then sub_analyte=catx(', ',sub_analyte,cats(_condition,'_',_txt));
end;
L=findc(test_name,'(',,R);
end;
put sub_analyte=;
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.