Creating a large amount of new variables

Reply
Contributor
Posts: 49

Creating a large amount of new variables

[ Edited ]

I am looking to create a large amount of new variables for my dataset. In the dataset each observation has up to 15 diagnostic codes (dx1-dx15). I want to scan through dx1-dx15 and create a new yes/no variable for a couple hundred specific ICD-9 codes. I am attempting to use an array and do loop but it doesn't appear to be working. Any sugestions to get the code to work or better methods?

 

This was the tutorial I followed http://support.sas.com/resources/papers/proceedings10/158-2010.pdf

 

data all;
	set year;
	array DX_freq {294} (033.9 038.0 041.4 042 057.9 079.99 112.0 151.0 152.9 196.2 197.6 215.4 216.1 216.3 216.5 216.6 216.7 228.00 228.01 228.04 228.09 228.1 
237.70 238.9 250.00 250.11 270.6 272.4 275.2 275.3 275.41 276.5 278.01 284.8 285.1 285.29 285.9 287.0 287.5 288.01 288.63 296.33 299.80 300.4 303.00 305.02 
305.1 305.50 323.9 342.90 345.90 348.39 365.89 365.9 372.72 375.55 378.9 379.93 382.9 401.9 425.4 427.1 427.89 428.0 432.1 436 438.20 448.1 453.8 458.9 465.9 
466.11 466.19 473.0 478.1 478.22 478.24 486 493.02 493.90 493.92 518.82 522.5 523.8 524.12 528.4 529.9 530.81 536.41 553.1 562.11 571.1 571.8 572.3 579.8 585.9 
599.0 599.7 603.9 605 607.89 608.2 608.3 623.8 648.91 654.21 682.0 682.2 682.6 682.8 685.1 686.9 691.0 706.1 706.2 707.14 707.15 709.00 709.09 709.9 719.65 
724.2 729.89 736.72 742.2 742.3 742.4 742.8 743.65 743.8 744.1 744.46 744.9 745.4 745.5 747.0 747.5 747.64 747.81 747.83 748.0 750.0 750.15 750.19 750.26 750.5 
751.5 751.69 752.40 752.51 752.61 752.62 752.63 752.69 753.0 753.12 753.29 754.1 754.2 754.30 754.31 754.51 754.60 754.70 754.79 754.89 755.01 755.13 755.66 
756.0 757.32 757.33 757.39 757.6 758.0 759.6 759.89 759.9 760.0 760.2 760.79 760.8 761.1 761.3 761.5 761.7 762.5 762.6 763.0 763.1 763.3 763.82 763.83 764.08 
764.97 764.98 765.03 765.17 765.18 765.19 765.26 765.27 765.28 765.29 766.0 766.1 766.2 766.21 767.1 767.19 767.2 767.3 767.6 768.9 769 770.12 770.2 770.6 
770.7 770.8 770.81 770.82 770.83 770.88 770.89 771.4 771.6 771.8 771.81 771.89 772.6 772.8 773.0 773.1 773.2 774.2 774.39 774.6 775.0 775.5 775.6 776.1 776.4 
776.5 777.1 777.8 778.3 778.4 778.6 778.8 779.0 779.3 779.5 779.8 779.81 779.82 779.84 779.89 780.39 780.6 781.0 781.3 782.0 782.1 783.40 783.41 784.0 784.2 
784.5 784.69 785.1 785.2 785.59 786.52 786.9 787.91 788.20 788.8 790.4 790.99 794.15 794.31 796.1 796.2 799.02 824.8 844.2 852.21 890.0 910.0 921.0 965.4 980.0 
998.3);
	do i=1 to 294;
		DX_freq[i]=dx1=DX_freq[i] or dx2=DX_freq[i] or dx3=DX_freq[i] or dx4=DX_freq[i] or dx5=DX_freq[i] or dx6=DX_freq[i] or dx7=DX_freq[i] 
	or dx8=DX_freq[i] or dx9=DX_freq[i] or dx10=DX_freq[i] or dx11=DX_freq[i] or dx12=DX_freq[i] or dx13=DX_freq[i] or dx14=DX_freq[i] or dx15=DX_freq[i];
end;
data all;
	set all;
	array DX_freq_char {51} [$] (E879.8 E884.4 E884.5 E885.9 E928.9 E950.9 V01.1 V01.6 V01.89 V02.51 V03.81 V05.3 V05.8 V05.9 V06.3 V08 V15.01 V15.81 V15.82 V17.4 
V17.49 V17.5 V18.0 V18.1 V18.8 V19.8 V20.1 V20.2 V27.0 V29.0 V29.2 V29.3 V29.8 V30.00 V30.01 V30.1 V31.00 V31.01 V33.00 V45.89 V50.2 V58.69 V62.84 V64.05 V64.1 
V64.3 V65.2 V70.7 V71.81 V72.1 V72.19);
	do i=1 to 51;
		DX_freq_char[i]=dx1=DX_freq_char[i] or dx2=DX_freq_char[i] or dx3=DX_freq_char[i] or dx4=DX_freq_char[i] or dx5=DX_freq_char[i] or dx6=DX_freq_char[i] or dx7=DX_freq_char[i] 
		or dx8=DX_freq_char[i] or dx9=DX_freq_char[i] or dx10=DX_freq_char[i] or dx11=DX_freq_char[i] or dx12=DX_freq_char[i] or dx13=DX_freq_char[i] or dx14=DX_freq_char[i] or dx15=DX_freq_char[i];
end;
Super User
Super User
Posts: 7,997

Re: Creating a large amount of new variables

Posted in reply to michelconn

I would suggest that you would find it easier to use a long dataset rather than a wide dataset.  I you post some test data (just a few observations) (in a datastep), and what the output should look like then it would be easier to work with.  

Contributor
Posts: 49

Re: Creating a large amount of new variables

data quant.year2;
	set year (obs=100);
	dx_freq_example=dx1=774.6 or dx2=774.6 or dx3=774.6 or dx4=774.6 or dx5=774.6 or dx6=774.6 or dx7=774.6 
	or dx8=774.6 or dx9=774.6 or dx10=774.6 or dx11=774.6 or dx12=774.6 or dx13=774.6 or dx14=774.6 or dx15=774.6;
run;

Here is an example code of what I would like to do. But I want to avoid creating 300+ line of code for each ICD-9 code. I was hoping an array and do loop would be less typing.

Super User
Posts: 19,877

Re: Creating a large amount of new variables

Posted in reply to michelconn

Why do you think this structure will be useful? What are you trying to calculate overall?

Contributor
Posts: 49

Re: Creating a large amount of new variables

I am taking a genome wide association approach to finding comorbidities for a specific disease. I plan on running all variables into a logistic regression and applying an alpha correction of 0.05/345.
Super User
Posts: 5,518

Re: Creating a large amount of new variables

Posted in reply to michelconn

Here are a few issues to settle, in order to proceed.

 

Your existing variables dx1-dx15 ... are they numeric or character?  I would guess character, but that's an important issue so I wanted to ask.

 

Are you trying to create 289 new variables (Y/N:  was a specific diagnosis code found in any of the 15 variables), or are you trying to create 15 x 289 new variables (Y/N:  was a specific diagnosis code found in dx1, ... Y/N:  was a specific diagnosis code found in dx15).  Or perhaps you are only searching for 1 new variable:  did any of the diagnoses (dx1-dx15) match any of the 289 diagnosis codes?

 

Would you be open to putting the 289 diagnosis codes into a data set, or would you prefer to hard-code their values in your program?

 

There are many ways to skin this cat, but your answers will help direct us down a smooth pathway.

Contributor
Posts: 49

Re: Creating a large amount of new variables

[ Edited ]
Posted in reply to Astounding

dx1-dx15 type is character but they have a mix of numeric and character data points (ex. 747.7 and V30.0).

 

Yes "create 289 new variables (Y/N:  was a specific diagnosis code found in any of the 15 variables)"

 

Some of my original number are off but there are 294 numeric variables (ex. 747.7) and 51 character variables (ex. V30.0) that I want to create a new yes/no variable for.

 

Esssentially want I want to do is the code below but avoid typing the same code 345 times for each ICD-9 code. 

dx_freq_example= dx1=774.6 or dx2=774.6 or dx3=774.6 or dx4=774.6 or dx5=774.6 or dx6=774.6 or dx7=774.6
    or dx8=774.6 or dx9=774.6 or dx10=774.6 or dx11=774.6 or dx12=774.6 or dx13=774.6 or dx14=774.6 or dx15=774.6;

 

Trusted Advisor
Posts: 1,118

Re: Creating a large amount of new variables

Posted in reply to michelconn

The long sequence of conditions could be avoided by using the IN operator:

data test;
array dx[15] (101:114 774.6); /* dummy values 101, ..., 114 added just for example */
icd=774.6;
dx_freq_example=(icd in dx); /* =1, because 774.6 is contained in {101, 102, ..., 114, 774.6} */
run;
Super User
Posts: 19,877

Re: Creating a large amount of new variables

Posted in reply to michelconn

You're going to have to tweak this to meet your requirements but I think it's a starting point of a way to do this, using GLMSELECT instead of array/loops.  You have to be careful of the names, especially when you're using the diagnosis as the names.  The first step is to transpose the data so dx is in one column and then use the design matrix created by PROC GLMSELECT.  

 

data have;
array dx(15) dx1-dx15;
do k=1 to 100;
do i=1 to 15;
dx(i)=round(rand('normal', 500, 100), 0.1);
end; 
output;
end;
run;

proc transpose data=have out=temp;
by k;
var dx1-dx15;
run;


proc glmselect data=temp namelen=50
               OUTDESIGN(ADDINPUTVARS FULLMODEL)=work.DesignVariables;
class        col1/ param=reference ShowCoding split;
model k=col1 / /* STEPS=1 */ STOP=1;
output out=new;
run;
QUIT;
Trusted Advisor
Posts: 1,118

Re: Creating a large amount of new variables

Posted in reply to michelconn

You are referring to codes like 747.7 as "numeric variables", but aren't they just codes which happen to look like numbers with decimals? Or would you say that the code 747.699 is "almost the same as" 747.7, because the "difference" of 0.001 is relatively small? Or is 747.7=0747.70 etc.?

 

Please note that 80 % of the numbers with up to one decimal (like 747.7) cannot be represented exactly in the computer's memory (in numeric variables)! The exact number stored instead is either something like 747.699999...5 or 747.700000...5 (in this specific case it's the latter). If you obtain some of these "numeric" codes from a database or from a SAS dataset which was created on a different platform or from some type of calculation, it can easily happen that the "same" number 747.7 is stored differently by this database system or the other platform or as the result of a calculation. Now, a "naive" comparison such as "if dx=747.7 then ..." would yield a wrong result!

 

Therefore, I would strongly recommend to store codes involving "decimals" always in character variables.

Super User
Posts: 5,518

Re: Creating a large amount of new variables

Posted in reply to michelconn

nce you will be creating 294 variables, you will need a line or two of code to create each one.  Your program won't be short, but it doesn't have to be excessively tedious either.  Here is one way:

 

data want;

set have;

flag_0339 = whichc('033.9', of dx1-dx15);

if flag_0339 > 0 then flag_0339=1;

run;

 

Note a few things. 

 

  • These two lines create FLAG_0339 as a numeric variable.  I would assume that 1/0 is more useful than Y/N. 
  • For each of the 294 variables to be created, you will need to add two lines of code to the program.
  • You must know what is in your data, and have an exact match.  "033.9" will not match "33.9" or "0033.9".

Macro language could be used to reduce the chance of typographical errors.  But it will not cut down on the program size.  In the simplest variation, you would get the two lines of code working for a single new variable, turn that code into a macro, and call the macro 294 times.  There are more complex approaches that might use arrays and more heavily use macro language, but they would probably disguise what the program is doing in exchange for making it shorter.

 

Does this sound like we are moving in the right direction? 

Trusted Advisor
Posts: 1,118

Re: Creating a large amount of new variables

Posted in reply to michelconn

How about this?

/* Create dummy data, just for demonstration */

%let rand_icd=%cmpres(cats(byte(choosen(int(rand('table',.852,.074,.074)),32,69,86)),
                      put(round(rand('normal', 500, 111), 0.01), best.)));

data have_ref; /* the list of 345 ICD codes of interest */
call streaminit(20160106);
length icd $7;
do _n_=1 to 345;
  icd=&rand_icd;
  output;
end;
run;

data have_pat; /* 100 patients with up to 15 diagnosis codes DX1-DX15 */
call streaminit(314159);
array dx[15] $7;
do patno=1 to 100;
  call missing(of dx[*]);
  do _n_=1 to 1+int(15*rand('uniform'));
    dx[_n_]=&rand_icd;
  end;
  output;
end;
run;

/* Create dataset by adding 0/1 flag variables to HAVE_PAT, one flag per ICD code in HAVE_REF */

data want;
retain patno;
set have_pat;
array dx[15];
array flag[345];
do i=1 to 345;
  set have_ref point=i;
  flag[i]=(icd in dx);
end;
drop icd;
run; 

/* Attach labels to the flag variables */

data _null_;
set have_ref end=last;
if _n_=1 then call execute('proc datasets lib=work nolist; modify want; label');
call execute(cats('flag',_n_)||'="Was ICD code '||strip(icd)||' found? 1=Yes, 0=No"');
if last then call execute('; quit;');
run;
Contributor ndp
Contributor
Posts: 61

Re: Creating a large amount of new variables

Posted in reply to michelconn

Since dx1-dx15 variables are type=character (even though some values are pure numeric) array for dx_freq should be character. This way all you need is one datastep. However you do need an extra array dx_fl for new variables dx_flag1-dx_flag345; but you have to be aware that 033.9 will not be same as 33.9 etc. Values in the array DX_freq have to match exactly character to character in dx1-dx15 variables.

data all;
 set year;
 array DX_freq {345} $ ('033.9' '038.0' ...etc ); *** include all values numeric + character within quotation marks;

array DX_fl {345} DX_flag1-DX_flag345;
 do i=1 to 345;
 DX_fl[i]=(dx1=DX_freq[i] or dx2=DX_freq[i] or dx3=DX_freq[i] or dx4=DX_freq[i] or dx5=DX_freq[i] or dx6=DX_freq[i] or dx7=DX_freq[i] or dx8=DX_freq[i] or dx9=DX_freq[i] or dx10=DX_freq[i] or dx11=DX_freq[i] or dx12=DX_freq[i] or dx13=DX_freq[i] or dx14=DX_freq[i] or dx15=DX_freq[i]);
end;

run;

 

 

Ask a Question
Discussion stats
  • 12 replies
  • 918 views
  • 1 like
  • 6 in conversation