Solved: Do-loop question

Linmuxi · Posted 07-30-2017 06:35 PM

I have a dataset like this:

ACCT	Month	Delin_day	Balance
1	1	0	100
1	2	0	200
1	3	4	200
1	4	2	300
2	1	0	600
2	2	1	400
2	3	0	300
3	1	0	300
3	2	5	200
3	3	0	200
3	4	4	300

What I am trying to do is to calculate the sum of delinquency balance for each month. So we should caculate the sum of balance where Delin_day >0 for each month. For example, for Month 1, the Sum shold be 0 since no account shows delinquency. For month 2 ,the Sum should be 600 since the account 2 and account3 are both delinquency.

Since my real dataset is really large(about 6000 accounts and 50,000 records), I am trying to use do loop to complete it. But my do-loop doesn't work. Can anyone help me with this please...

BTW, the max months in the dataset is 12, if it's useful. Thanks!

Kurt_Bremser · Posted 07-31-2017 04:27 AM

Linmuxi wrote:

Since my real dataset is really large(about 6000 accounts and 50,000 records),

For SAS, 50.000 records is (almost) negligibly close to nothing. Standard data/sort/SQL steps will take considerably less than a second.

Example:

data have;
do account = 1 to 6000;
  do month = 1 to 12;
    balance = int(rand('uniform') * 1000);
    delin_day = round(rand('uniform'));
    output;
  end;
end;
run;

data want (keep=month balance);
set have end=done;
array m {12} m1-m12;
if delin_day > 0 then m{month} + balance;
if done
then do month = 1 to 12;
  balance = m{month};
  output;
end;
run;

The first step creates a random-filled dataset like the one you have, with 72K records. The next one is @Astounding's suggested solution.

This is the log:

41         data have;
42         do account = 1 to 6000;
43           do month = 1 to 12;
44             balance = int(rand('uniform') * 1000);
45             delin_day = round(rand('uniform'));
46             output;
47           end;
48         end;
49         run;

NOTE: The data set WORK.HAVE has 72000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.03 seconds
      cpu time            0.02 seconds

50         
51         data want (keep=month balance);
52         set have end=done;
53         array m {12} m1-m12;
54         if delin_day > 0 then m{month} + balance;
55         if done
56         then do month = 1 to 12;
57           balance = m{month};
58           output;
59         end;
60         run;

NOTE: There were 72000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 12 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.02 seconds
      cpu time            0.00 seconds

(run on a 2-core pSeries)

You see that the total time for creating and analysing the data is 0.05 seconds (!).

Even if your 50k dataset has lots of columns, you'll stay in the seconds range.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

SuzanneDorinski · Posted 07-30-2017 07:05 PM

You could do this with PROC SORT and PROC MEANS.

data have;
  length acct $ 5 month $2;
  infile datalines dlm='09'x missover;
  input acct month delin_day balance;
  datalines;
1	1	0	100
1	2	0	200
1	3	4	200
1	4	2	300
2	1	0	600
2	2	1	400
2	3	0	300
3	1	0	300
3	2	5	200
3	3	0	200
3	4	4	300
;
run;

proc sort data=have;
  by month;
run;

proc means data=have sum noprint;
  where delin_day gt 0;
  var balance;
  by month;
  output out=want sum=delinquent_balance;
run;

proc print data=want noobs;
  var month delinquent_balance;
run;

ChrisNZ · Posted 08-01-2017 12:51 AM

@SuzanneDorinski It is much better to not sort before proc means and use a class statement instead. This improves both legibility (less code) and speed (less processing -sorts are very expensive-).

High-Performance SAS Coding - Third Edition

Kurt_Bremser · Posted 08-01-2017 02:27 AM

@ChrisNZ wrote:

@SuzanneDorinski It is much better to not sort before proc means and use a class statement instead. This improves both legibility (less code) and speed (less processing -sorts are very expensive-).

But be aware that proc means may fail with memory problems if the variables in the class statement have high cardinality.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Reeza · Posted 07-30-2017 07:26 PM

Show your full input and expected output.

YOU DO NOT USE A DO LOOP TO LOOP THROUGH OBSERVATIONS.

SAS data step loops automatically. In SAS, a DO LOOP is used to loop through variables in a row.

Yes, there are exceptions to the above comment, but since I think you're a beginner I wouldn't go down that route at all right now. Honestly, I'm still not good at using DO LOOPS that do this and I've been programming with SAS for over a decade.

Depending on what exactly you need, ie a running total or sum then PROC EXPAND may be an option, or a simple data step may suffice.

novinosrin · Posted 07-30-2017 07:29 PM

Hi, The answer has been provided to you with proc means. However, I am just curious to understand what you meant by using a do-loop when the objective seems like to merely summarize. Did you mean a one full pass of the dataset as one implicit loop without a "do statement" or you wanted to experiment Ian's DOW-do until(last.by_variable)

Reeza · Posted 07-30-2017 07:33 PM

With 50K observations you'd spend more time implementing a DoW loop that you would get back ever from a possible increased efficiency in run time.

It's a perfectly valid solution, but when searching for an answer I try to balance efficiency of code and my time. I tend to lean towards my time, since that's more of an issue.

https://imgs.xkcd.com/comics/automation.png

Astounding · Posted 07-30-2017 08:46 PM

By most standards, your data set isn't really large. But if it were actually 1000 times larger, you might consider a process that would only take one step instead of two:

data want;

set have end=done;

array m {12} m1-m12;

if delin_day > 0 then m{month} + balance;

if done;

At this point, you have a couple of choices You could simply add:

keep m1-m12;

run;

That would give you a single observation, with total delinquent balances for each month. Alternatively, you could transform the data as follows:

do month = 1 to 12;

delin_balance = m{month};

output;

end;

keep month delin_balance;

run;

That would give you 12 observations, a separate observation for each month.

Kurt_Bremser · Posted 07-31-2017 04:27 AM

Linmuxi wrote:

Since my real dataset is really large(about 6000 accounts and 50,000 records),

For SAS, 50.000 records is (almost) negligibly close to nothing. Standard data/sort/SQL steps will take considerably less than a second.

Example:

data have;
do account = 1 to 6000;
  do month = 1 to 12;
    balance = int(rand('uniform') * 1000);
    delin_day = round(rand('uniform'));
    output;
  end;
end;
run;

data want (keep=month balance);
set have end=done;
array m {12} m1-m12;
if delin_day > 0 then m{month} + balance;
if done
then do month = 1 to 12;
  balance = m{month};
  output;
end;
run;

The first step creates a random-filled dataset like the one you have, with 72K records. The next one is @Astounding's suggested solution.

This is the log:

41         data have;
42         do account = 1 to 6000;
43           do month = 1 to 12;
44             balance = int(rand('uniform') * 1000);
45             delin_day = round(rand('uniform'));
46             output;
47           end;
48         end;
49         run;

NOTE: The data set WORK.HAVE has 72000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.03 seconds
      cpu time            0.02 seconds

50         
51         data want (keep=month balance);
52         set have end=done;
53         array m {12} m1-m12;
54         if delin_day > 0 then m{month} + balance;
55         if done
56         then do month = 1 to 12;
57           balance = m{month};
58           output;
59         end;
60         run;

NOTE: There were 72000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 12 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.02 seconds
      cpu time            0.00 seconds

(run on a 2-core pSeries)

You see that the total time for creating and analysing the data is 0.05 seconds (!).

Even if your 50k dataset has lots of columns, you'll stay in the seconds range.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Do-loop question

Re: Do-loop question

Re: Do-loop question

Re: Do-loop question

Re: Do-loop question

Re: Do-loop question

Re: Do-loop question

Re: Do-loop question

Re: Do-loop question

Re: Do-loop question

Classroom Training Available!