Hi SAS Community, I need you help in the following problem.
I have a dataset that looks like this
NDC | PHARM_CLASSES |
21434 | GLP-1 Receptor Agonist [EPC],Glucagon-Like Peptide 1 [Chemical/Ingredient],Glucagon-like Peptide-1 (GLP-1) Agonists [MoA] |
21445 | . |
21975 | Androgen [EPC],Androgen Receptor Agonists [MoA],Androstanes [Chemical/Ingredient] |
21977 | Androgen [EPC],Androgen Receptor Agonists [MoA],Androstanes [Chemical/Ingredient] |
23004 | . |
23227 | Norepinephrine Reuptake Inhibitor [EPC],Norepinephrine Uptake Inhibitors [MoA] |
and I am hoping to get an output like this
NDC | EPC | MoA | Chemical Ingredient |
21434 | GLP-1 Receptor Agonist | Glucagon-like Peptide-1 (GLP-1) Agonists | Glucagon-Like Peptide 1 |
21445 | . | . | . |
21975 | Androgen | Androgen Receptor Agonists | Androstanes |
21977 | Androgen | Androgen Receptor Agonists | Androstanes |
23004 | . | . | . |
23227 | Norepinephrine Reuptake Inhibitor | Norepinephrine Uptake Inhibitors | . |
I have got help for the same problem previously but, I get a WARNING: 2761 observations omitted due to missing values in the ID variables due to PROC TRANSPOSE. I don't want to loose any observation. Can anyone help me with this problem.
Thank you in advance
data middle;
set NDCdata(Keep= PHARM_CLASSES NDC);
length term $150 tag $20 ;
do i=1 by 1 until (term=' ');
term=strip(scan(PHARM_CLASSES,i,','));
tag=scan(term,2,'[]');
if term ne ' ' then output;
end;
run;
proc sort data=middle ;
by NDC tag ;
run;
data middle2 ;
set middle ;
by NDC tag ;
rep+1;
if first.tag then rep=1;
run;
proc transpose data=middle2 out=want(drop=_name_) ;
by NDC;
id tag rep ;
var term ;
run;
data cc;
set want (Keep = NDC Chemical_Ingredient1 EPC1 MoA1);
run;
proc sort data = b out = bsort Tagsort nodupkey;
by NDC;
run;
proc sort data = cc out = ccsort Tagsort nodupkey;
by NDC;
run;
data dd ;
merge bsort ccsort ;
by NDC ;
where NDC ^= '.' ;
drop productID;
run
Sure, that's enough to simplify. This is untested, so might need minor tweaking:
data want;
set have;
length EPC MoA Chemical_Ingredient term $ 150 tag $ 20;
do j=1 by 1 until (term=' ');
term = scan(PHARM_CLASSES, j, ',');
if term > ' ' then do;
tag = scan(term, 2, '][');
if upcase(tag) = 'MOA' and MoA=' ' then MoA = scan(term, 1, '][');
else if upcase(tag) = 'EPC' and EPC=' ' then EPC = scan(term, 1, '][');
else if upcase(tag) = 'CHEMICAL/INGREDIENT' and Chemical_Ingredient = ' ' then Chemical_Ingredient = scan(term, 1, '][');
end;
end;
drop j term tag PHARM_CLASSES;
run;
You might want to keep PHARM_CLASSSES at first, until you are confident that the results are correct.
It is possible that the original code you are working with is overkill and could be simplified considerably. The right route depends on how well you know what is in your data. For a single observation, is it possible that there is more than one MoA compound listed? (That question applies to EPC and Chemical/Ingredient as well.) If not, extreme simplifications are in order.
Sure, that's enough to simplify. This is untested, so might need minor tweaking:
data want;
set have;
length EPC MoA Chemical_Ingredient term $ 150 tag $ 20;
do j=1 by 1 until (term=' ');
term = scan(PHARM_CLASSES, j, ',');
if term > ' ' then do;
tag = scan(term, 2, '][');
if upcase(tag) = 'MOA' and MoA=' ' then MoA = scan(term, 1, '][');
else if upcase(tag) = 'EPC' and EPC=' ' then EPC = scan(term, 1, '][');
else if upcase(tag) = 'CHEMICAL/INGREDIENT' and Chemical_Ingredient = ' ' then Chemical_Ingredient = scan(term, 1, '][');
end;
end;
drop j term tag PHARM_CLASSES;
run;
You might want to keep PHARM_CLASSSES at first, until you are confident that the results are correct.
This is a late response, but is a little easier to generalize to several/dozens of variables.
data have;
input NDC PHARM_CLASSES :$&200. ;
datalines;
21434 GLP-1 Receptor Agonist [EPC],Glucagon-Like Peptide 1 [Chemical/Ingredient],Glucagon-like Peptide-1 (GLP-1) Agonists [MoA]
21445 .
21975 Androgen [EPC],Androgen Receptor Agonists [MoA],Androstanes [Chemical/Ingredient]
21977 Androgen [EPC],Androgen Receptor Agonists [MoA],Androstanes [Chemical/Ingredient]
23004 .
23227 Norepinephrine Reuptake Inhibitor [EPC],Norepinephrine Uptake Inhibitors [MoA]
run;
data want (drop=_:);
set have;
array vars{3} $60 epc chemical_ingredient moa;
length _tag_list $40 ;
retain _tag_list '[EPC] [Chemical/Ingredient] [MoA]';
do _P=1 by 1 while(scan(pharm_classes,_P,',')^='');
_value_tag_pair=scan(pharm_classes,_P,',');
_v=findw(_tag_list,trim(scan(_value_tag_pair,-1,' ')),' ','e');
if _v^=0 then vars{_v}=scan(_value_tag_pair,1,'[');
end;
run;
This works by searching the tag-in-hand against the tag_list, yielding the position of the current tag in the list. This position (_V) is used to assign a value to the appropriate element in character-array VARS.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.