BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Linmuxi
Calcite | Level 5

I have a dataset like this:

 

ACCTMonthDelin_dayBalance
110100
120200
134200
142300
210600
221400
230300
310300
325200
330200
344300

 

What I am trying to do is to calculate the sum of delinquency balance for each month. So we should caculate the sum of balance where Delin_day >0 for each month. For example, for Month 1, the Sum shold be 0  since no account shows delinquency. For month 2 ,the Sum should be 600 since the account 2 and account3 are both delinquency.

 

Since my real dataset is really large(about 6000 accounts and 50,000 records), I am trying to use do loop to complete it. But my do-loop doesn't work.  Can anyone help me with this please...

 

BTW, the max months in the dataset is 12, if it's useful. Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

Linmuxi wrote:

Since my real dataset is really large(about 6000 accounts and 50,000 records),

 


For SAS, 50.000 records is (almost) negligibly close to nothing. Standard data/sort/SQL steps will take considerably less than a second.

Example:

data have;
do account = 1 to 6000;
  do month = 1 to 12;
    balance = int(rand('uniform') * 1000);
    delin_day = round(rand('uniform'));
    output;
  end;
end;
run;

data want (keep=month balance);
set have end=done;
array m {12} m1-m12;
if delin_day > 0 then m{month} + balance;
if done
then do month = 1 to 12;
  balance = m{month};
  output;
end;
run;

The first step creates a random-filled dataset like the one you have, with 72K records. The next one is @Astounding's suggested solution.

This is the log:

41         data have;
42         do account = 1 to 6000;
43           do month = 1 to 12;
44             balance = int(rand('uniform') * 1000);
45             delin_day = round(rand('uniform'));
46             output;
47           end;
48         end;
49         run;

NOTE: The data set WORK.HAVE has 72000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.03 seconds
      cpu time            0.02 seconds

50         
51         data want (keep=month balance);
52         set have end=done;
53         array m {12} m1-m12;
54         if delin_day > 0 then m{month} + balance;
55         if done
56         then do month = 1 to 12;
57           balance = m{month};
58           output;
59         end;
60         run;

NOTE: There were 72000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 12 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.02 seconds
      cpu time            0.00 seconds

(run on a 2-core pSeries)

You see that the total time for creating and analysing the data is 0.05 seconds (!).

Even if your 50k dataset has lots of columns, you'll stay in the seconds range.

View solution in original post

8 REPLIES 8
SuzanneDorinski
Lapis Lazuli | Level 10

You could do this with PROC SORT and PROC MEANS.

 

data have;
  length acct $ 5 month $2;
  infile datalines dlm='09'x missover;
  input acct month delin_day balance;
  datalines;
1	1	0	100
1	2	0	200
1	3	4	200
1	4	2	300
2	1	0	600
2	2	1	400
2	3	0	300
3	1	0	300
3	2	5	200
3	3	0	200
3	4	4	300
;
run;

proc sort data=have;
  by month;
run;

proc means data=have sum noprint;
  where delin_day gt 0;
  var balance;
  by month;
  output out=want sum=delinquent_balance;
run;

proc print data=want noobs;
  var month delinquent_balance;
run;
ChrisNZ
Tourmaline | Level 20

@SuzanneDorinski It is much better to not sort before proc means and use a class statement instead. This improves both legibility (less code) and speed (less processing -sorts are very expensive-).

Kurt_Bremser
Super User

@ChrisNZ wrote:

@SuzanneDorinski It is much better to not sort before proc means and use a class statement instead. This improves both legibility (less code) and speed (less processing -sorts are very expensive-).


But be aware that proc means may fail with memory problems if the variables in the class statement have high cardinality.

Reeza
Super User

Show your full input and expected output. 

 

YOU DO NOT USE A DO LOOP TO LOOP THROUGH OBSERVATIONS. 

 

SAS data step loops automatically. In SAS, a DO LOOP is used to loop through variables in a row.

 

Yes, there are exceptions to the above comment, but since I think you're a beginner I wouldn't go down that route at all right now. Honestly, I'm still not good at using DO LOOPS that do this and I've been programming with SAS for over a decade. 

 

Depending on what exactly you need, ie a running total or sum then PROC EXPAND may be an option, or a simple data step may suffice. 

novinosrin
Tourmaline | Level 20

Hi, The answer has been provided to you with proc means. However, I am just curious to understand what you meant by using a do-loop when the objective seems like to merely summarize. Did you mean a one full pass of the dataset as one implicit loop without a "do statement" or you wanted to experiment Ian's DOW-do until(last.by_variable)

Reeza
Super User

With 50K observations you'd spend more time implementing a DoW loop that you would get back ever from a possible increased efficiency in run time. 

 

It's a perfectly valid solution, but when searching for an answer I try to balance efficiency of code and my time.  I tend to lean towards my time, since that's more of an issue. 

 

automation

 

https://imgs.xkcd.com/comics/automation.png

 

 

Astounding
PROC Star

By most standards, your data set isn't really large.  But if it were actually 1000 times larger, you might consider a process that would only take one step instead of two:

 

data want;

set have end=done;

array m {12} m1-m12;

if delin_day > 0 then m{month} + balance;

if done;

 

At this point, you have a couple of choices  You could simply add:

 

keep m1-m12;

run;

 

That would give you a single observation, with total delinquent balances for each month.  Alternatively, you could transform the data as follows:

 

do month = 1 to 12;

   delin_balance = m{month};

   output;

end;

keep month delin_balance;

run;

 

That would give you 12 observations, a separate observation for each month.

Kurt_Bremser
Super User

Linmuxi wrote:

Since my real dataset is really large(about 6000 accounts and 50,000 records),

 


For SAS, 50.000 records is (almost) negligibly close to nothing. Standard data/sort/SQL steps will take considerably less than a second.

Example:

data have;
do account = 1 to 6000;
  do month = 1 to 12;
    balance = int(rand('uniform') * 1000);
    delin_day = round(rand('uniform'));
    output;
  end;
end;
run;

data want (keep=month balance);
set have end=done;
array m {12} m1-m12;
if delin_day > 0 then m{month} + balance;
if done
then do month = 1 to 12;
  balance = m{month};
  output;
end;
run;

The first step creates a random-filled dataset like the one you have, with 72K records. The next one is @Astounding's suggested solution.

This is the log:

41         data have;
42         do account = 1 to 6000;
43           do month = 1 to 12;
44             balance = int(rand('uniform') * 1000);
45             delin_day = round(rand('uniform'));
46             output;
47           end;
48         end;
49         run;

NOTE: The data set WORK.HAVE has 72000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.03 seconds
      cpu time            0.02 seconds

50         
51         data want (keep=month balance);
52         set have end=done;
53         array m {12} m1-m12;
54         if delin_day > 0 then m{month} + balance;
55         if done
56         then do month = 1 to 12;
57           balance = m{month};
58           output;
59         end;
60         run;

NOTE: There were 72000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 12 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.02 seconds
      cpu time            0.00 seconds

(run on a 2-core pSeries)

You see that the total time for creating and analysing the data is 0.05 seconds (!).

Even if your 50k dataset has lots of columns, you'll stay in the seconds range.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 2437 views
  • 9 likes
  • 7 in conversation