DATA Step, Macro, Functions and more

Do-loop question

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 5
Accepted Solution

Do-loop question

I have a dataset like this:

 

ACCTMonthDelin_dayBalance
110100
120200
134200
142300
210600
221400
230300
310300
325200
330200
344300

 

What I am trying to do is to calculate the sum of delinquency balance for each month. So we should caculate the sum of balance where Delin_day >0 for each month. For example, for Month 1, the Sum shold be 0  since no account shows delinquency. For month 2 ,the Sum should be 600 since the account 2 and account3 are both delinquency.

 

Since my real dataset is really large(about 6000 accounts and 50,000 records), I am trying to use do loop to complete it. But my do-loop doesn't work.  Can anyone help me with this please...

 

BTW, the max months in the dataset is 12, if it's useful. Thanks!


Accepted Solutions
Solution
‎09-13-2017 09:22 AM
Super User
Posts: 7,762

Re: Do-loop question


Linmuxi wrote:

Since my real dataset is really large(about 6000 accounts and 50,000 records),

 


For SAS, 50.000 records is (almost) negligibly close to nothing. Standard data/sort/SQL steps will take considerably less than a second.

Example:

data have;
do account = 1 to 6000;
  do month = 1 to 12;
    balance = int(rand('uniform') * 1000);
    delin_day = round(rand('uniform'));
    output;
  end;
end;
run;

data want (keep=month balance);
set have end=done;
array m {12} m1-m12;
if delin_day > 0 then m{month} + balance;
if done
then do month = 1 to 12;
  balance = m{month};
  output;
end;
run;

The first step creates a random-filled dataset like the one you have, with 72K records. The next one is @Astounding's suggested solution.

This is the log:

41         data have;
42         do account = 1 to 6000;
43           do month = 1 to 12;
44             balance = int(rand('uniform') * 1000);
45             delin_day = round(rand('uniform'));
46             output;
47           end;
48         end;
49         run;

NOTE: The data set WORK.HAVE has 72000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.03 seconds
      cpu time            0.02 seconds

50         
51         data want (keep=month balance);
52         set have end=done;
53         array m {12} m1-m12;
54         if delin_day > 0 then m{month} + balance;
55         if done
56         then do month = 1 to 12;
57           balance = m{month};
58           output;
59         end;
60         run;

NOTE: There were 72000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 12 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.02 seconds
      cpu time            0.00 seconds

(run on a 2-core pSeries)

You see that the total time for creating and analysing the data is 0.05 seconds (!).

Even if your 50k dataset has lots of columns, you'll stay in the seconds range.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers

View solution in original post


All Replies
Frequent Contributor
Posts: 89

Re: Do-loop question

You could do this with PROC SORT and PROC MEANS.

 

data have;
  length acct $ 5 month $2;
  infile datalines dlm='09'x missover;
  input acct month delin_day balance;
  datalines;
1	1	0	100
1	2	0	200
1	3	4	200
1	4	2	300
2	1	0	600
2	2	1	400
2	3	0	300
3	1	0	300
3	2	5	200
3	3	0	200
3	4	4	300
;
run;

proc sort data=have;
  by month;
run;

proc means data=have sum noprint;
  where delin_day gt 0;
  var balance;
  by month;
  output out=want sum=delinquent_balance;
run;

proc print data=want noobs;
  var month delinquent_balance;
run;
PROC Star
Posts: 1,759

Re: Do-loop question

Posted in reply to SuzanneDorinski

@SuzanneDorinski It is much better to not sort before proc means and use a class statement instead. This improves both legibility (less code) and speed (less processing -sorts are very expensive-).

Super User
Posts: 7,762

Re: Do-loop question


ChrisNZ wrote:

@SuzanneDorinski It is much better to not sort before proc means and use a class statement instead. This improves both legibility (less code) and speed (less processing -sorts are very expensive-).


But be aware that proc means may fail with memory problems if the variables in the class statement have high cardinality.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super User
Posts: 19,770

Re: Do-loop question

Show your full input and expected output. 

 

YOU DO NOT USE A DO LOOP TO LOOP THROUGH OBSERVATIONS. 

 

SAS data step loops automatically. In SAS, a DO LOOP is used to loop through variables in a row.

 

Yes, there are exceptions to the above comment, but since I think you're a beginner I wouldn't go down that route at all right now. Honestly, I'm still not good at using DO LOOPS that do this and I've been programming with SAS for over a decade. 

 

Depending on what exactly you need, ie a running total or sum then PROC EXPAND may be an option, or a simple data step may suffice. 

PROC Star
Posts: 283

Re: Do-loop question

Hi, The answer has been provided to you with proc means. However, I am just curious to understand what you meant by using a do-loop when the objective seems like to merely summarize. Did you mean a one full pass of the dataset as one implicit loop without a "do statement" or you wanted to experiment Ian's DOW-do until(last.by_variable)

Super User
Posts: 19,770

Re: Do-loop question

Posted in reply to novinosrin

With 50K observations you'd spend more time implementing a DoW loop that you would get back ever from a possible increased efficiency in run time. 

 

It's a perfectly valid solution, but when searching for an answer I try to balance efficiency of code and my time.  I tend to lean towards my time, since that's more of an issue. 

 

automation

 

https://imgs.xkcd.com/comics/automation.png

 

 

Super User
Posts: 5,497

Re: Do-loop question

By most standards, your data set isn't really large.  But if it were actually 1000 times larger, you might consider a process that would only take one step instead of two:

 

data want;

set have end=done;

array m {12} m1-m12;

if delin_day > 0 then m{month} + balance;

if done;

 

At this point, you have a couple of choices  You could simply add:

 

keep m1-m12;

run;

 

That would give you a single observation, with total delinquent balances for each month.  Alternatively, you could transform the data as follows:

 

do month = 1 to 12;

   delin_balance = m{month};

   output;

end;

keep month delin_balance;

run;

 

That would give you 12 observations, a separate observation for each month.

Solution
‎09-13-2017 09:22 AM
Super User
Posts: 7,762

Re: Do-loop question


Linmuxi wrote:

Since my real dataset is really large(about 6000 accounts and 50,000 records),

 


For SAS, 50.000 records is (almost) negligibly close to nothing. Standard data/sort/SQL steps will take considerably less than a second.

Example:

data have;
do account = 1 to 6000;
  do month = 1 to 12;
    balance = int(rand('uniform') * 1000);
    delin_day = round(rand('uniform'));
    output;
  end;
end;
run;

data want (keep=month balance);
set have end=done;
array m {12} m1-m12;
if delin_day > 0 then m{month} + balance;
if done
then do month = 1 to 12;
  balance = m{month};
  output;
end;
run;

The first step creates a random-filled dataset like the one you have, with 72K records. The next one is @Astounding's suggested solution.

This is the log:

41         data have;
42         do account = 1 to 6000;
43           do month = 1 to 12;
44             balance = int(rand('uniform') * 1000);
45             delin_day = round(rand('uniform'));
46             output;
47           end;
48         end;
49         run;

NOTE: The data set WORK.HAVE has 72000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.03 seconds
      cpu time            0.02 seconds

50         
51         data want (keep=month balance);
52         set have end=done;
53         array m {12} m1-m12;
54         if delin_day > 0 then m{month} + balance;
55         if done
56         then do month = 1 to 12;
57           balance = m{month};
58           output;
59         end;
60         run;

NOTE: There were 72000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 12 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.02 seconds
      cpu time            0.00 seconds

(run on a 2-core pSeries)

You see that the total time for creating and analysing the data is 0.05 seconds (!).

Even if your 50k dataset has lots of columns, you'll stay in the seconds range.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 590 views
  • 9 likes
  • 7 in conversation