Solved: Re: Create a new 'Sub_Analyte' column?

ybz12003 · Posted 11-22-2024 12:57 PM

Hello,

I have a sample dataset called 'Have'. In the 'Test_name' column, there are multiple pathogens with either 'IgG' or 'IgM' indicated after the pathogen name. I would like to create a new column called 'Sub_Analyte' that captures each pathogen and assigns the corresponding 'IgG' or 'IgM' suffix, as shown in the desired outcome in the second dataset 'Want'. Could you please advise on the best way to approach this? Thank you.

data Test;  
	length Test_Name $2000; 
	infile datalines delimiter='#'; 
	input Test_Name;  
	datalines;                     
	Encephalitis Antibody Panel, CSF Includes: Herpes Simplex Virus (HSV) 1/2 (IgG) Type Antibody, CSF Lymphocytic Choriomeningitis (LCM) Virus (IgG, IgM) Antibody, IFA, CSF Measles (Rubeola) (IgG, IgM) Antibody Panel, IFA, CSF Mumps Antibody Panel, IFA, CSF Varicella-Zoster Virus (VZV) Antibodies (Total, IgM), ACIF/IFA, CSF West Nile Virus (WNV) Antibodies (IgG, IgM), CSF #
	Herpes Simplex Type 1 and Type 2 Glycoprotein G-Specific Antibodies, IgG by CIA#
;   

data Want;  
	length Sub_Analyte $1000; 
	infile datalines delimiter='#'; 
	input Sub_Analyte;  
	datalines;
	HSV_IgG, LCM_IgG, LCM_IgM, Measles_IgG,IgM, VZV_IgM, WNV_IgG, WNV_IgM#
	HSV_IgG#
;

ybz12003 · Posted 11-23-2024 08:32 PM

Apologies for any confusion. In the long text from row 1,

'Herpes Simplex Virus (HSV) 1/2 (IgG) Type Antibody, CSF Lymphocytic Choriomeningitis (LCM) Virus (IgG, IgM) Antibody, IFA, CSF Measles (Rubeola) (IgG, IgM) Antibody Panel, IFA, CSF Mumps Antibody Panel, IFA, CSF Varicella-Zoster Virus (VZV) Antibodies (Total, IgM), ACIF/IFA, CSF West Nile Virus (WNV) Antibodies (IgG, IgM), CSF'

I would like to use 'IgG', 'IgM', or 'IgG, IgM' as labels. If a pathogen appears before the label, I will add an underscore between the pathogen and the antibody type to create a new pathogen_antibody label. For example:

Herpes Simplex Virus (HSV) appears before 'IgG', so the first label will be HSV_IgG.
The second label is 'IgG, IgM' and appears after Measles, so the second label is Measles_IgG, IgM. (BTW, either Meals or Rebeola is fine cause they are in the same group)
Mumps does not have a label associated with it, so it will not be included in the results.

Ideally, I would like each result to be listed in a separate row. However, if that's not possible, listing them all in one row would be acceptable.

Also, as the text shows, not all pathogen names have abbreviations or are enclosed in parentheses. Using the full names for the results would be more accurate. Additionally, I noticed that 'CSF' is the last word associated with each pathogen. To improve the search, it might be helpful to use 'CSF' as a separator to distinguish the next pathogen.

data Want_Final;  
	length Sub_Analyte $1000; 
	infile datalines delimiter='#'; 
	input Sub_Analyte;  
	datalines;
	Herpes Simplex_IgG# 
Lymphocytic Choriomeningitis_IgG,IgM# 
Measles_IgG,IgM# 
Varicella_IgM# 
West Nile Virus_IgG,IgM#
Herpes Simplex_IgG#
;

View solution in original post

Patrick · Posted 11-22-2024 09:18 PM

@ybz12003 If you can describe the logic how to identify and combine the desired terms in the text then I'm sure someone can help with a solution. I can't figure out such logic especially for you row two. How do you end-up with HSV_IgG#?

Ksharp · Posted 11-23-2024 03:01 AM

There are too many scenarios you need to take into account of.

And as Patrick pointed out:

1)Why you end up with "HSV_IgG#" , there are NOT '()' in obs.

2)Should not "Measles_IgG" be "Rubeola_IgG" and "Rubeola_IgM"?

data Test;  
	length Test_Name $2000; 
	infile datalines delimiter='#'; 
	input Test_Name;  
	datalines;                     
	Encephalitis Antibody Panel, CSF Includes: Herpes Simplex Virus (HSV) 1/2 (IgG) Type Antibody, CSF Lymphocytic Choriomeningitis (LCM) Virus (IgG, IgM) Antibody, IFA, CSF Measles (Rubeola) (IgG, IgM) Antibody Panel, IFA, CSF Mumps Antibody Panel, IFA, CSF Varicella-Zoster Virus (VZV) Antibodies (Total, IgM), ACIF/IFA, CSF West Nile Virus (WNV) Antibodies (IgG, IgM), CSF #
	Herpes Simplex Type 1 and Type 2 Glycoprotein G-Specific Antibodies, IgG by CIA#
;   
data temp;
 set test;
 id+1;
 pid=prxparse('/\(.+?\)/');
 start=1;end=length(test_name);
 call prxnext(pid,start,end,test_name,p,l);
 do while(p>0);
   temp=compress(substr(test_name,p,l),'()');output;
   call prxnext(pid,start,end,test_name,p,l);
 end;
 keep id temp;
 run;
 data temp;
  set temp;
  if not prxmatch('/igg|igm/i',temp) or id ne lag(id) then group+1;
run;
data temp2;
do until(last.group);
 set temp;
 by id group;
 length x $ 200;
 x=catx('|',x,temp);
end;
drop temp;
run;
data want;
 set temp2;
 pre=scan(x,1,'|');
 post=scan(x,-1,'|');
 do i=1 to countw(post,' ,');
  want=cats(pre,'_',scan(post,i,' ,'));
  if prxmatch('/igg|igm/i',want) then output;
 end;
keep id group want;
run;

ybz12003 · Posted 11-23-2024 08:32 PM

Apologies for any confusion. In the long text from row 1,

'Herpes Simplex Virus (HSV) 1/2 (IgG) Type Antibody, CSF Lymphocytic Choriomeningitis (LCM) Virus (IgG, IgM) Antibody, IFA, CSF Measles (Rubeola) (IgG, IgM) Antibody Panel, IFA, CSF Mumps Antibody Panel, IFA, CSF Varicella-Zoster Virus (VZV) Antibodies (Total, IgM), ACIF/IFA, CSF West Nile Virus (WNV) Antibodies (IgG, IgM), CSF'

I would like to use 'IgG', 'IgM', or 'IgG, IgM' as labels. If a pathogen appears before the label, I will add an underscore between the pathogen and the antibody type to create a new pathogen_antibody label. For example:

Herpes Simplex Virus (HSV) appears before 'IgG', so the first label will be HSV_IgG.
The second label is 'IgG, IgM' and appears after Measles, so the second label is Measles_IgG, IgM. (BTW, either Meals or Rebeola is fine cause they are in the same group)
Mumps does not have a label associated with it, so it will not be included in the results.

Ideally, I would like each result to be listed in a separate row. However, if that's not possible, listing them all in one row would be acceptable.

Also, as the text shows, not all pathogen names have abbreviations or are enclosed in parentheses. Using the full names for the results would be more accurate. Additionally, I noticed that 'CSF' is the last word associated with each pathogen. To improve the search, it might be helpful to use 'CSF' as a separator to distinguish the next pathogen.

data Want_Final;  
	length Sub_Analyte $1000; 
	infile datalines delimiter='#'; 
	input Sub_Analyte;  
	datalines;
	Herpes Simplex_IgG# 
Lymphocytic Choriomeningitis_IgG,IgM# 
Measles_IgG,IgM# 
Varicella_IgM# 
West Nile Virus_IgG,IgM#
Herpes Simplex_IgG#
;

mkeintz · Posted 11-23-2024 08:48 PM

As others have pointed out, you never specified what rules you want to execute. So you have left the job of interpreting the objective to us. Some folks (including me) have, after some study, deduced that each entry in the new variable is supposed to be extracted from successive parenthetical expressions.

At least that is what seems to be the case, even though your sample result does not completely honor that rule. That's why respondents have posted the question about row 2, and about MEASLES vs Rubella in line 1.

To help us help you, please be complete in describing the task.

Here is my suggestion, based on capturing consecutive pairs of parenthetical expressions - I call the first expresssion _CONDITION, and the second as _CODES. Then it checks _CODES for "IgM" and "IgG" to decide whether to add content to the new variable:

data Test;  
  length Test_Name $2000; 
  infile datalines delimiter='#'; 
  input Test_Name;  
datalines;                     
Encephalitis Antibody Panel, CSF Includes: Herpes Simplex Virus (HSV) 1/2 (IgG) Type Antibody, CSF Lymphocytic Choriomeningitis (LCM) Virus (IgG, IgM) Antibody, IFA, CSF Measles (Rubeola) (IgG, IgM) Antibody Panel, IFA, CSF Mumps Antibody Panel, IFA, CSF Varicella-Zoster Virus (VZV) Antibodies (Total, IgM), ACIF/IFA, CSF West Nile Virus (WNV) Antibodies (IgG, IgM), CSF #
Herpes Simplex Type 1 and Type 2 Glycoprotein G-Specific Antibodies, IgG by CIA#
run;   

data want (drop=_: L R);  /*L & R for left & right parentheses*/
  set test;
  length Sub_Analyte $1000; 

  do L=findc(test_name,'(') by 0 while(L^=0);
    R=findc(test_name,')',,L); 
    if R=0 then leave;
    _condition=substr(test_name,L+1,R-L-1);

    L=findc(test_name,'(',,R);    
    R=findc(test_name,')',,L);    
    if L=0 or R=0 then leave;
    _codes=substr(test_name,L+1,R-L-1);

    do _txt='IgG','IgM';
      if findw(_codes,trim(_txt),', ')>0 then sub_analyte=catx(', ',sub_analyte,cats(_condition,'_',_txt));
    end;
    L=findc(test_name,'(',,R);
  end;
  put sub_analyte=;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Create a new 'Sub_Analyte' column?

Re: Create a new 'Sub_Analyte' column?

Re: Create a new 'Sub_Analyte' column?

Re: Create a new 'Sub_Analyte' column?

Re: Create a new 'Sub_Analyte' column?

Re: Create a new 'Sub_Analyte' column?

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away