Hello SAS users,
I have a table with many lab variables. A sample dataset is pasted below along with the code I developed. Each lab variables has a character variable type. The whole idea is to truncate lab values and assign separate variables for them. Currently, I have lab variables ranging from lab1- lab40 (I only developed code for lab1- lab4). If I develop similar code for 40 lab variables it would be much lengthier. I thought the code could be simplified further.
Can you please suggest any better program. My output should look like the "final" data set created from proc sql statement and should have ability to query 40 lab variables.
data have;
input id lab1 $ lab2 $lab3 $ lab4 $;
datalines;
1 2 5 10 124
2 25 305 4054 8
3 29 34 40 92
4 301 5004 2015 20
;run;
%macro want (infile= , var= , outfile= );
Data &outfile;
set &infile;
array vars &var.var1-&var.var4;
start = 1;
do i = 1 to 4;
pos = anydigit(&var, start);
if pos then do;
if pos = start then vars{i} = input(substr(&var,start,1),1.);
else vars{i}= input(substr(&var,start,2),2.);
start = pos+1;
end;
end;
drop i start pos;
keep ID &var &var.var1 &var.var2 &var.var3 &var.var4;
run;
%mend;
%want (infile= have, var= lab1, outfile= want1);
%want (infile= have, var= lab2, outfile= want2);
%want (infile= have, var= lab3, outfile= want3);
%want (infile= have, var= lab4, outfile= want4);
proc sql;
create table final as
select a.id, b.lab1, b.lab1var1, b.lab1var2, b.lab1var3, b.lab1var4,
c.lab2, c.lab2var1, c.lab2var2, c.lab2var3, c.lab2var4,
d.lab3, d.lab3var1, d.lab3var2, d.lab3var3, d.lab3var4,
e.lab4, e.lab4var1, e.lab4var2, e.lab4var3, e.lab4var4
from have a
left join want1 b
on a.id= b.id
left join want2 c
on a.id= c.id
left join want3 d
on a.id= d.id
left join want4 e
on a.id= e.id;
quit;
Thank you,
Chriscoarsen
I'm curious why you want to break down a character string one digit at a time and then store the digits. Seems like extra work, when you could just keep the character strings as character strings, and extract values as needed.
It's also hard to see why you would want the data arranged in the output table as it is. While you are asking how to simplify the creation of such a table, once you have such a table then the coding to make use of it is not simplified, and so requires lots more looping. Perhaps a re-thinking of the layout of your data might make your task easier.
Nevertheless, I think this solves your problem and is easily generalized to more than 4 lab values.
data want;
set have;
array lab lab1-lab4;
do i=1 to dim(lab);
thislab=lab(i);
do pos=1 to 4;
digit=put(substr(lab(i),pos,pos),1.);
output;
end;
end;
drop lab: i;
run;
Hi Miller,
Many thanks for the solution. I would like to break the character string because in these lab values each digit has to be evaluated to see if a patient had a positive or negative attribute. I would like to look at frequencies of these attributes. For example some labs have a negative value which I would like to evaluate.
I just noticed that the sample I provided has only positive values. Here's how the data exactly looks like below. The longest length of both positive and negative values is 4 digits. Meaning, a positive value can be recorded as 4054 (for ID2, LAB3) which should be broken down as (4, 0, 5, 4) or negative value can be recorded as -2-3-4-1 (for ID5, LAB1) which should be broken down as (-2, -3, -4, -1).
data have;
input id lab1 $ lab2 $lab3 $ lab4 $;
datalines;
1 2 5 10 124
2 25 305 4054 8
3 29 34 40 92
4 301 5004 2015 20
5 -2-3-4-1 364 -4-5 -2-2
6 -1-4 -9-4-3 12 -1-1
;run;
I am still fine with the data layout you suggested. But, the program did not work for negative values (even mine did not work I just checked). As you can see, the ID-5 has lab1 value -2-3-4-1 which means the results should show three negative values for that ID (-2, -3, -4,-1). From the code you suggested, its breaking the values as ( -, 2, -, 3).
Thank you!
Why exactly is that going to be useful?
Do you only want to look at the first 4 characters?
You can have the macro generate one data step to do what your complicated code is doing.
data have;
input id lab1 $ lab2 $lab3 $ lab4 $;
datalines;
1 2 5 10 124
2 25 305 4054 8
3 29 34 40 92
4 301 5004 2015 20
;
%macro want (infile= , varlist= , outfile= );
%local n i var ;
%let n=%sysfunc(countw(&varlist,%str( )));
data &outfile;
set &infile;
array in &varlist ;
array out [&n,4] $1
%do i=1 %to &n;
%let var=%scan(&varlist,&i,%str( ));
&var.var1-&var.var4
%end;
;
do i=1 to dim(in);
do j=1 to 4;
out[i,j] = char(in[i],j);
end;
end;
run;
%mend want;
%want(infile=have,outfile=want,varlist=lab1 lab2 lab3 lab4);
proc print;
run;
If you know that the longest length is 4 digits, you could do this for your sample data set:
data have;
input id lab1 $ lab2 $ lab3 $ lab4 $;
datalines;
1 2 5 10 124
2 25 305 4054 8
3 29 34 40 92
4 301 5004 2015 20
;run;
data final (drop=v c);
set have;
array labvars {4} $ lab: ;
array labvals {4,4}
lab1_val1-lab1_val4 lab2_val1-lab2_val4
lab3_val1-lab3_val4 lab4_val1-lab4_val4 ;
do v=1 to 4;
do c=1 to 4;
labvals{v,c}=input(char(labvars{v},c),1.);
end;
end;
run;
Of course, there's a lot more typing if your have 40 LAB variables, so you could use the PROC SQL DICTIONARIES facility to help generate all those names, as in:
data have;
input id lab1 $ lab2 $ lab3 $ lab4 $;
datalines;
1 2 5 10 124
2 25 305 4054 8
3 29 34 40 92
4 301 5004 2015 20
;run;
proc sql noprint;
select cats('NEED (keep=',name,':)'),cats(name,'_VAL1-',name,'_val4')
into :mrg_list separated by ' ', :newvar_list separated by ' '
from dictionary.columns
where libname='WORK' and memname='HAVE' and upcase(name) like 'LAB%';
%let nvars=&sqlobs;
quit;
%put &=mrg_list;
%put &=newvar_list;
data need (drop=v c);
set have;
array labvars {&nvars} $ lab: ;
array labvals {&nvars,4} &newvar_list ;
do v=1 to dim(labvars);
do c=1 to dim(labvals,2);
labvals{v,c} = input(char(labvars{v},c),1.);
end;
end;
run;
data final;
merge need (drop=lab:) &mrg_list;
run;
Dataset NEED already has all the data you want. I made the DATA FINAL step just to re-ordered variables names (which in turn is why the MRG_LIST macrovars was needed from the PROC SQL step).
You can split the digits out, outputting one digit per row into a tall data set. Transpose the tall data set into your very wide layout. However, you might want to simply work with the tall data set and use BY, CLASS and WHERE statements as you analyze the results.
* wide - lots of integers; data have(keep=id lab:); do id = 1 to 100; length lab1-lab40 $6; array labs lab1-lab40; call streaminit(123); do over labs; x = rand('uniform'); select; when (x < .7) labs = cats(floor(rand ('uniform', 1e3))); when (x < .8) labs = cats(floor(rand ('uniform', 1e4))); when (x < .9) labs = cats(floor(rand ('uniform', 1e5))); otherwise labs = cats(floor(rand ('uniform', 1e6))); end; end; output; end; run; * create tall data set suitable for transpose; data want(keep=id lab var result); set have; array labs lab:; do i = 1 to dim(labs); lab = vname(labs(i)); do j = 1 to lengthm(labs(i)); var = cats('var',j); result = input(substr(labs(i),j,1),1.); output; end; end; format result 1.; run; * really wide - one variable per original digit; proc transpose data=want out=wide(drop=_name_); by id; id lab var; var result; run;
Just for giggles, what do those individual digits mean? And if they need to be treated individually why and how were they read/entered/created into a single value?
@mkeintz wrote:
Maybe these are 4-digit codes, where each digit is meant to evaluate a particular attribute. But sometime only the first attribute (or first-2 or first-3) attributes are recorded. An analyst may want frequencies of these attributes of all LABs.
I've dealt with such data, actually up to 40 characters worth of two digit codes. But I generally read the values as individual 2 digit, in my case, codes to begin with. The reason was that was the way a paper form was scanned for checked boxes and someone thought they "had" to place all check boxes in a region of the form into a single variable. Bad idea as the maximum length of the variable won't hold all of the values if for some reason most of the boxes were checked..(42 possible values would require 84 characters)
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.