Re: SAS programming: Do loop/array help

chriscoarsen · Posted 04-16-2020 02:25 PM

Hello SAS users,

I have a table with many lab variables. A sample dataset is pasted below along with the code I developed. Each lab variables has a character variable type. The whole idea is to truncate lab values and assign separate variables for them. Currently, I have lab variables ranging from lab1- lab40 (I only developed code for lab1- lab4). If I develop similar code for 40 lab variables it would be much lengthier. I thought the code could be simplified further.

Can you please suggest any better program. My output should look like the "final" data set created from proc sql statement and should have ability to query 40 lab variables.

data have;
input id lab1 $ lab2 $lab3 $ lab4 $;
datalines;
1 2   5    10  124
2 25  305  4054  8
3 29  34   40   92
4 301 5004 2015 20
;run;

%macro want (infile= , var= , outfile= );
Data &outfile;
set &infile;
array vars &var.var1-&var.var4;
start = 1;
do i = 1 to 4;
pos = anydigit(&var, start);
if pos then do;
if pos = start then vars{i} = input(substr(&var,start,1),1.);
else vars{i}= input(substr(&var,start,2),2.);
start = pos+1;
end;
end;
drop i start pos;
keep ID &var &var.var1 &var.var2 &var.var3 &var.var4;
run;
%mend;
%want (infile= have, var= lab1,  outfile= want1);
%want (infile= have, var= lab2,  outfile= want2);
%want (infile= have, var= lab3,  outfile= want3);
%want (infile= have, var= lab4,  outfile= want4);

proc sql;
create table final as 
select a.id, b.lab1, b.lab1var1, b.lab1var2, b.lab1var3, b.lab1var4,
	   c.lab2, c.lab2var1, c.lab2var2, c.lab2var3, c.lab2var4,
       d.lab3, d.lab3var1, d.lab3var2, d.lab3var3, d.lab3var4,
       e.lab4, e.lab4var1, e.lab4var2, e.lab4var3, e.lab4var4
 from have a
left join want1 b
on a.id= b.id 
left join want2 c
on a.id= c.id 
left join want3 d
on a.id= d.id 
left join want4 e 
on a.id= e.id;
quit;

Thank you,

Chriscoarsen

PaigeMiller · Posted 04-16-2020 02:52 PM

I'm curious why you want to break down a character string one digit at a time and then store the digits. Seems like extra work, when you could just keep the character strings as character strings, and extract values as needed.

It's also hard to see why you would want the data arranged in the output table as it is. While you are asking how to simplify the creation of such a table, once you have such a table then the coding to make use of it is not simplified, and so requires lots more looping. Perhaps a re-thinking of the layout of your data might make your task easier.

Nevertheless, I think this solves your problem and is easily generalized to more than 4 lab values.

data want;
	set have;
	array lab lab1-lab4;
	do i=1 to dim(lab);
		thislab=lab(i);
		do pos=1 to 4;
			digit=put(substr(lab(i),pos,pos),1.);
			output;
		end;
	end;
	drop lab: i;
run;

--
Paige Miller

chriscoarsen · Posted 04-16-2020 04:47 PM

Hi Miller,
Many thanks for the solution. I would like to break the character string because in these lab values each digit has to be evaluated to see if a patient had a positive or negative attribute. I would like to look at frequencies of these attributes. For example some labs have a negative value which I would like to evaluate.

I just noticed that the sample I provided has only positive values. Here's how the data exactly looks like below. The longest length of both positive and negative values is 4 digits. Meaning, a positive value can be recorded as 4054 (for ID2, LAB3) which should be broken down as (4, 0, 5, 4) or negative value can be recorded as -2-3-4-1 (for ID5, LAB1) which should be broken down as (-2, -3, -4, -1).

data have;
input id lab1 $ lab2 $lab3 $ lab4 $;
datalines;
1 2 5 10 124
2 25 305 4054 8
3 29 34 40 92
4 301 5004 2015 20
5 -2-3-4-1 364 -4-5 -2-2
6 -1-4 -9-4-3 12 -1-1
;run;

I am still fine with the data layout you suggested. But, the program did not work for negative values (even mine did not work I just checked). As you can see, the ID-5 has lab1 value -2-3-4-1 which means the results should show three negative values for that ID (-2, -3, -4,-1). From the code you suggested, its breaking the values as ( -, 2, -, 3).

Thank you!

Tom · Posted 04-16-2020 03:06 PM

Why exactly is that going to be useful?

Do you only want to look at the first 4 characters?

You can have the macro generate one data step to do what your complicated code is doing.

data have;
input id lab1 $ lab2 $lab3 $ lab4 $;
datalines;
1 2   5    10  124
2 25  305  4054  8
3 29  34   40   92
4 301 5004 2015 20
;

%macro want (infile= , varlist= , outfile= );
%local n i var ;
%let n=%sysfunc(countw(&varlist,%str( )));
data &outfile;
  set &infile;
  array in &varlist ;
  array out [&n,4] $1
%do i=1 %to &n;
  %let var=%scan(&varlist,&i,%str( ));
  &var.var1-&var.var4
%end;
  ;
  do i=1 to dim(in);
    do j=1 to 4;
      out[i,j] = char(in[i],j);
    end;
  end;
run;
%mend want;

%want(infile=have,outfile=want,varlist=lab1 lab2 lab3 lab4);
proc print;
run;

mkeintz · Posted 04-16-2020 03:39 PM

If you know that the longest length is 4 digits, you could do this for your sample data set:


data have;
input id lab1 $ lab2 $ lab3 $ lab4 $;
datalines;
1 2   5    10  124
2 25  305  4054  8
3 29  34   40   92
4 301 5004 2015 20
;run;
data final (drop=v c);
  set have;
  array labvars {4} $ lab: ;
  array labvals {4,4} 
     lab1_val1-lab1_val4  lab2_val1-lab2_val4
     lab3_val1-lab3_val4  lab4_val1-lab4_val4 ;
  do v=1 to 4;
    do c=1 to 4;
      labvals{v,c}=input(char(labvars{v},c),1.);
    end;
  end;
run;

Of course, there's a lot more typing if your have 40 LAB variables, so you could use the PROC SQL DICTIONARIES facility to help generate all those names, as in:

data have;
input id lab1 $ lab2 $ lab3 $ lab4 $;
datalines;
1 2   5    10  124
2 25  305  4054  8
3 29  34   40   92
4 301 5004 2015 20
;run;

proc sql noprint;
  select cats('NEED (keep=',name,':)'),cats(name,'_VAL1-',name,'_val4') 
  into :mrg_list separated by ' ',  :newvar_list separated by ' '
  from dictionary.columns
  where libname='WORK' and memname='HAVE' and upcase(name) like 'LAB%';
  %let nvars=&sqlobs;
quit;

%put &=mrg_list;
%put &=newvar_list;

data need (drop=v c);
  set have;
  array labvars {&nvars} $ lab: ;
  array labvals {&nvars,4}  &newvar_list ;
  do v=1 to dim(labvars);
    do c=1 to dim(labvals,2);
      labvals{v,c} = input(char(labvars{v},c),1.);
    end;
  end;
run;

data final;
  merge need (drop=lab:) &mrg_list;
run;

Dataset NEED already has all the data you want. I made the DATA FINAL step just to re-ordered variables names (which in turn is why the MRG_LIST macrovars was needed from the PROC SQL step).

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

chriscoarsen · Posted 04-16-2020 05:06 PM

Hi Tom,
Thank you for the code. Your program generates exact output as I need. But I figured that the data also has negative values (my bad, I provided the sample with only positive values). Can this be modified to also check for negative values.

I would like to look at frequencies of these these broken down attributes. The longest length of the lab variables are 4 digits only. Meaning, a positive value can be recorded as 4054 (for ID2, LAB3) which should be broken down as (4, 0, 5, 4) or negative value can be recorded as -2-3-4-1 (for ID5, LAB1) which should be broken down as (-2, -3, -4, -1).

Heres the sample with both positive and negative values.

data have;
input id lab1 $ lab2 $lab3 $ lab4 $;
datalines;
1 2 5 10 124
2 25 305 4054 8
3 29 34 40 92
4 301 5004 2015 20
5 -2-3-4-1 364 -4-5 -2-2
6 -1-4 -9-4-3 12 -1-1
;run;

RichardDeVen · Posted 04-16-2020 03:18 PM

You can split the digits out, outputting one digit per row into a tall data set. Transpose the tall data set into your very wide layout. However, you might want to simply work with the tall data set and use BY, CLASS and WHERE statements as you analyze the results.

* wide - lots of integers;
data have(keep=id lab:);
  do id = 1 to 100;
    length lab1-lab40 $6;
    array labs lab1-lab40;
    call streaminit(123);
    do over labs;
      x = rand('uniform');
      select;
        when (x < .7) labs = cats(floor(rand ('uniform', 1e3)));
        when (x < .8) labs = cats(floor(rand ('uniform', 1e4)));
        when (x < .9) labs = cats(floor(rand ('uniform', 1e5)));
        otherwise     labs = cats(floor(rand ('uniform', 1e6)));
      end;
    end;
    output;
  end;
run;

* create tall data set suitable for transpose;
data want(keep=id lab var result);
  set have;
  array labs lab:;
  do i = 1 to dim(labs);
    lab = vname(labs(i));
    do j = 1 to lengthm(labs(i));
      var = cats('var',j);
      result = input(substr(labs(i),j,1),1.);
      output;
    end;
  end;
  format result 1.;
run;

* really wide - one variable per original digit;
proc transpose data=want out=wide(drop=_name_);
  by id;
  id lab var;
  var result;
run;

ballardw · Posted 04-16-2020 03:40 PM

Just for giggles, what do those individual digits mean? And if they need to be treated individually why and how were they read/entered/created into a single value?

mkeintz · Posted 04-16-2020 03:50 PM

Maybe these are 4-digit codes, where each digit is meant to evaluate a particular attribute. But sometime only the first attribute (or first-2 or first-3) attributes are recorded. An analyst may want frequencies of these attributes of all LABs.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

ballardw · Posted 04-16-2020 04:13 PM

@mkeintz wrote:
Maybe these are 4-digit codes, where each digit is meant to evaluate a particular attribute. But sometime only the first attribute (or first-2 or first-3) attributes are recorded. An analyst may want frequencies of these attributes of all LABs.

I've dealt with such data, actually up to 40 characters worth of two digit codes. But I generally read the values as individual 2 digit, in my case, codes to begin with. The reason was that was the way a paper form was scanned for checked boxes and someone thought they "had" to place all check boxes in a region of the form into a single variable. Bad idea as the maximum length of the variable won't hold all of the values if for some reason most of the boxes were checked..(42 possible values would require 84 characters)

Register Today!

SAS Training: Just a Click Away