Create new variables and make code more efficient

Regular Contributor
Posts: 150

Create new variables and make code more efficient

I am working with large datasets that include information about traveling. I have code that I inherited and have modified slightly but it is not the most efficient so I want to simplify the code and make it more efficient. Basically what I have is dates that people traveled, so the dits are the day they left and the dots are the day they returned, one person can have a maximum of 42 instance of travel so dit1-dit42 and if they have no dits or dots they did not travel. I also have survey dates for 4 interviews indicated by surveydate_w1-surveydate_w4. First, the code determines whether a person traveled before the survey date. Then I need to determine whether they went on a sales travel, which can be indicated by 5 different variables at each wave: var1_w1-var5_w1. SO basically what I need from there was is a categorical variable that indicated whether they didn't travel, traveled but no sales, or traveled with sales at each wave, variables names:  travsal_w1- travsal_w4. Separately from that, and what this code does not have, what I need is an indicator of whether they were traveling when they compled the survey, so if a dit and dot overlap with the survey date. Any help with this would be greatly appreciated. Thank you!

data travel;
set data;
array dits (42) dit1-dit42;
array dots (42) dot1-dot42;
array travels (42) travel1-travel42;

if dot1=. then trav=0;
else trav=1;
do i=1 to 42;
if . < dots(i) <= surveydate_w1 then travels(i)=1;
else if surveydate_w1 < dots(i) and dits(i) <= surveydate_w2 then travel(i)=2;
else if surveydate_w2 < dits(i) <= surveydate_w3 then travels(i)=3;
else if surveydate_w3 < dits(i) <= surveydate_w4 then travels(i)=4;
else if surveydate_w4 < dits(i) then travels(i)=5;
else travels(i)=99;
end;

travw1=0;
travw2=0;
travw3=0;
travw4=0;
do i=1 to 42;
if travels(i)=1 then travw1=travw1+1;
else if travels(i)=2 then travw2=travw2+1;
else if travels(i)=3 then travw3=travw3+1;
else if travels(i)=4 then travw4=travw4+1;
end;

*sales travel;
*w1;
if var1_w1 in (2,3) or var2_w1 in (2,3) or var3_w1 in (2,3) or var4_w1 in (2,3) or var5_w1 in (2,3) then sales_w1=1;
else if var1_w1=. or var2_w1=. or var3_w1=. or var4_w1=. or var5_w1=. then sales_w1=.;
else if var1_w1=1 and var2_w1=1 and var3_w1=1 and var4_w1=1 and var5_w1=1 then sales_w1=0;
*w2;
if var1_w2 in (2,3) or var2_w2 in (2,3) or var3_w2 in (2,3) or var4_w2 in (2,3) or var5_w2 in (2,3) then sales_w2=1;
else if var1_w2=. or var2_w2=. or var3_w2=. or var4_w2=. or var5_w2=. then sales_w2=.;
else if var1_w2=1 and var2_w2=1 and var3_w2=1 and var4_w2=1 and var5_w2=1 then sales_w2=0;
*w3;
if var1_w3 in (2,3) or var2_w3 in (2,3) or var3_w3 in (2,3) or var4_w3 in (2,3) or var5_w3 in (2,3) then sales_w3=1;
else if var1_w3=. or var2_w3=. or var3_w3=. or var4_w3=. or var5_w3=. then sales_w3=.;
else if var1_w3=1 and var2_w3=1 and var3_w3=1 and var4_w3=1 and var5_w3=1 then sales_w3=0;
*w4;
if var1_w4 in (2,3) or var2_w4 in (2,3) or var3_w4 in (2,3) or var4_w4 in (2,3) or var5_w4 in (2,3) then sales_w4=1;
else if var1_w4=. or var2_w4=. or var3_w4=. or var4_w4=. or var5_w4=. then sales_w4=.;
else if var1_w4=1 and var2_w4=1 and var3_w4=1 and var4_w4=4 and var5_w4=1 then sales_w4=0;

*Travel w/sales at each wave;
*0 = 'Did no travel' 1 = 'Traveled no sales' 2 = 'Traveled with sales';
if travw2=0 then travsal_w2=0;
if travw3=0 then travsal_w3=0;
if travw4=0 then travsal_w4=0;

if travw2>0 and sales_w2=0 then travsal_w2=1;
else if travw2>0 and sales_w2=1 then travsal_w2=2;

if travw3>0 and sales_w3=0 then travsal_w3=1;
else if travw3>0 and sales_w3=1 then travsal_w3=2;

if travw4>0 and sales_w4=0 then travsal_w4=1;
else if travw4>0 and sales_w4=1 then travsal_w4=2;

Super User
Posts: 5,878

Re: Create new variables and make code more efficient

Before we get started, we need to know from what angle. The code awkward, but do you mean from a CPU or maintenance perspective.

On small tip: you could use arrays for your survey and trecw variables, which could reduce no of program line a bit.

Data never sleeps
Regular Contributor
Posts: 150

Re: Create new variables and make code more efficient

Thank you for your response. Mostly from a CPU angle since I have about 200,000 observations the code runs very slow. However, having something that is less awkward and easier to maintain is preferred as well. I can try the array for the other variables as well. Thank you!

Super User
Posts: 23,724

Re: Create new variables and make code more efficient

I highly echo the suggestion of arrays. Based on the logic I see it may be difficult to simplify further and keep it understandable.
Super User
Posts: 13,542

Re: Create new variables and make code more efficient

How slow is "very slow"? 200,000 records isn't that many any you aren't really doing anything that looks stressful.

If your var1_w1 var1_w2 var1_w3 var1_w4 only take values of ., 1, 2, and 3 it may be quicker to execute with

var1_w1 > 1

then using the IN .

I think this part:

do i=1 to 42;
if travels(i)=1 then travw1=travw1+1;
else if travels(i)=2 then travw2=travw2+1;
else if travels(i)=3 then travw3=travw3+1;
else if travels(i)=4 then travw4=travw4+1;
end;

could be moved into the previous loop after the

else travels(i)=99;

Just initialize the variables before the first loop.

Regular Contributor
Posts: 150

Re: Create new variables and make code more efficient

Thank you. Those are helpful suggestions. It is not necessarily slow because of the amount of data, but partially because of low processing power of the computer. I've had it take an hour to subset a dataset by two variables, a process that should be straight forward.

Super User
Posts: 23,724

Re: Create new variables and make code more efficient

Is your data stored locally or is it on a network drive?
Posts: 4,736

Re: Create new variables and make code more efficient

[ Edited ]

200T rows are not that much and I haven't seen anything in the code you've posted which should be very resource intensive.

The run-times you mention sound "unreasonable" even when using "old" hardware. There must be another reason why things take so long.

Can you please execute the following code:

``````options fullstimer;
data travel;
set data;
array dits (42) dit1-dit42;
array dots (42) dot1-dot42;
array travels (42) travel1-travel42;
run;
``````

And then post from the log the part which looks similar to below:

``````NOTE: There were 19 observations read from the data set WORK.DATA.
NOTE: The data set WORK.TRAVEL has 19 observations and 131 variables.
NOTE: DATA statement used (Total process time):
real time           0.01 seconds
user cpu time       0.00 seconds
system cpu time     0.01 seconds
memory              930.81k
OS Memory           19036.00k
Timestamp           21/11/2015 11:12:21 AM``````

Besides of network how full is your disk where SAS WORK resides?

Below code will give you the path to your WORK area:

``````proc options option=work;
run;``````

And here your code a bit "stripped down". This won't make a difference in regards of run-times though.

``````options fullstimer;

data travel;
set data;
array dits (42) dit1-dit42;
array dots (42) dot1-dot42;
array travels (42) travel1-travel42;
array var_w (4,5)
var1_w1 var2_w1 var3_w1 var4_w1 var5_w1
var1_w2 var2_w2 var3_w2 var4_w2 var5_w2
var1_w3 var2_w3 var3_w3 var4_w3 var5_w3
var1_w4 var2_w4 var3_w4 var4_w4 var5_w4
;
array sales_w (4) sales_w1 sales_w2 sales_w3 sales_w4;

if dot1=. then trav=0;
else trav=1;

travw1=0;
travw2=0;
travw3=0;
travw4=0;

do i=1 to dim(dots);
if . <
dots(i) <= surveydate_w1 then
do;
travels(i)=1;
travw1=travw1+1;
end;
else if surveydate_w1 < dots(i) and dits(i) <= surveydate_w2 then
do;
travels(i)=2;
travw2=travw2+1;
end;
else if surveydate_w2 < dits(i) <= surveydate_w3 then
do;
travels(i)=3;
travw3=travw3+1;
end;
else if surveydate_w3 < dits(i) <= surveydate_w4 then
do;
travels(i)=4;
travw4=travw4+1;
end;
else if surveydate_w4 < dits(i) then
do;
travels(i)=5;
end;
else
do;
travels(i)=99;
end;
end;

*sales travel;
*w1 to w4;
do i=1 to 4;
if var_w[i,1] in (2,3) or var_w[i,2] in (2,3) or var_w[i,3] in (2,3) or var_w[i,4] in (2,3) or var_w[i,5] in (2,3) then sales_w[i]=1;
else if var_w[i,1]=. or var_w[i,2]=. or var_w[i,3]=. or var_w[i,4]=. or var_w[i,5]=. then sales_w[i]=.;
else if var_w[i,1]=1 and var_w[i,2]=1 and var_w[i,3]=1 and var_w[i,4]=1 and var_w[i,5]=1 then sales_w[i]=0;
end;

*Travel w/sales at each wave;
*0 = 'Did no travel' 1 = 'Traveled no sales' 2 = 'Traveled with sales';
if travw2=0 then travsal_w2=0;
else if sales_w2=0 then travsal_w2=1;
else if sales_w2=1 then travsal_w2=2;

if travw3=0 then travsal_w3=0;
else if sales_w3=0 then travsal_w3=1;
else if sales_w3=1 then travsal_w3=2;

if travw4=0 then travsal_w4=0;
else if sales_w4=0 then travsal_w4=1;
else if sales_w4=1 then travsal_w4=2;

run;``````

Regular Contributor
Posts: 150

Re: Create new variables and make code more efficient

Thank you for the response.

I ran that code and it executed quickly, below it what I got:

NOTE: There were 201495 observations read from the data set WORK.LIFESTRESSORS.
NOTE: The data set WORK.TRAVEL has 201495 observations and 423 variables.
NOTE: DATA statement used (Total process time):
real time 6.29 seconds
user cpu time 0.03 seconds
system cpu time 0.68 seconds
memory 925.09k
OS Memory 21992.00k
Timestamp 11/25/2015 01:45:06 PM

I think my slow speed have to do with working from a server so I just have to keep that in mind. Others at my work have the same issue where some things will just take an hour or two to run. Thank you for your help. I have just been trying to run on the local disk and stipped down datasets so that has helped tremendously.

Discussion stats
• 8 replies
• 436 views
• 0 likes
• 5 in conversation