Hello
I have a data set with 60 million rows and 850 variables.
In this data set each customerID has 24 rows (Each row represent information of specific month).
I want to select customers that meet the condition that in month Nov 2021 they belong to population 7.
Then ,for this customer list I want to check for each customer if he/she failed to pay loans in months Jul2022 till Oct2022.
Then I want to calculate how many customers failed to pay loans in months jul2022 till oct2022.( if customer fail at least in one month ).
I created sql query that calculate it.
Since the data set is very big then I created a small data set (1 million rows) and run the query and in 1 minute I got the result.
However, when I run it on full data set then I waited long time and finally got a message that it was terminated.
Then I run it again but didnt recieve any result .
Here is the code I run.
Any recommendation how to run code to receive the desired result?
proc sql;
create table want as
select count(*) as nr
from
(select lakoach
from csp.twoyrs_3
where VALID_FROM_DTTM='01NOV2022'd and pop_lomeforat_wppr=7 ) as a
inner join
(select lakoach,
max(ind_shlili_6) as calc_ind_shlili_1_5
from csp.twoyrs_3
where VALID_FROM_DTTM in ('01OCT2022'd,'01SEP2022'd ,'01AUG2022'd,'01JUL2022'd,'01JUN2022'd)
group by lakoach
having ind_shlili_6=1) as b
on a.lakoach=b.lakoach
;
quit;
Here is the Log when I run and waited long time and didnt get any result
1 The SAS System 13:07 Tuesday, September 3, 2024
1 ;*';*";*/;quit;run;
2 OPTIONS PAGENO=MIN;
3 %LET _CLIENTTASKLABEL='Program';
4 %LET _CLIENTPROCESSFLOWNAME='IV dist';
5 %LET _CLIENTPROJECTPATH='K:\aתחום מודלים\פיתוח מודל CS\פיתוח מודלים\עוסקים_7\Project_Models_Oskim_Ron_Gibuy.egp';
6 %LET _CLIENTPROJECTPATHHOST='VSK1H103A2014';
7 %LET _CLIENTPROJECTNAME='Project_Models_Oskim_Ron_Gibuy.egp';
8 %LET _SASPROGRAMFILE='';
9 %LET _SASPROGRAMFILEHOST='';
10
11 ODS _ALL_ CLOSE;
12 OPTIONS DEV=PNG;
13 GOPTIONS XPIXELS=0 YPIXELS=0;
14 FILENAME EGSR TEMP;
15 ODS tagsets.sasreport13(ID=EGSR) FILE=EGSR
16 STYLE=HTMLBlue
17 STYLESHEET=(URL="file:///C:/Program%20Files/SASHome/SASEnterpriseGuide/7.1/Styles/HTMLBlue.css")
18 NOGTITLE
19 NOGFOOTNOTE
20 GPATH=&sasworklocation
21 ENCODING=UTF8
22 options(rolap="on")
23 ;
NOTE: Writing TAGSETS.SASREPORT13(EGSR) Body file: EGSR
24
25 GOPTIONS ACCESSIBLE;
26 proc sql;
27 create table ttt2a as
28 select count(*) as nr
29 from
30 (select lakoach
31 from csp.twoyrs_3
32 where VALID_FROM_DTTM='01NOV2022'd and pop_lomeforat_wppr=7 ) as a
33 inner join
34 (select lakoach,
35 max(ind_shlili_6) as calc_ind_shlili_1_5
36 from csp.twoyrs_3
37 where VALID_FROM_DTTM in ('01OCT2022'd,'01SEP2022'd ,'01AUG2022'd,'01JUL2022'd,'01JUN2022'd)
38 group by lakoach
39 having ind_shlili_6=1) as b
40 on a.lakoach=b.lakoach
41 ;
NOTE: The query requires remerging summary statistics back with the original data.
NOTE: Compression was disabled for data set WORK.TTT2A because compression overhead would increase the size of the data set.
NOTE: SAS threaded sort was used.
So you have two selection criteria.
I want to select customers that meet the condition that in month Nov 2021 they belong to population 7.
Then ,for this customer list I want to check for each customer if he/she failed to pay loans in months Jul2022 till Oct2022.
Then I want to calculate how many customers failed to pay loans in months jul2022 till oct2022.( if customer fail at least in one month ).
From the code you posted it looks like you have data that uses LAKOACH and VALID_FROM_DTTM as the key variables.
So assuming the data is already sorted (and if it is not sorted then that might help explain why the SQL code takes so long since SQL will have to sort the data first to do the grouping you asked for) then you could do this in one pass thru the data.
You can wrap the SET statement in a DO loop so that all of the observations for one customer are processed in one iteration of the data step. When you see the appropriate observations for the months in question you can then set some flags to indicate if the condition is met.
If you just want to count instead of the individual customer ids then you can use SUM statements to increment counters and output only one observation.
data want;
pop7=0;
failure=0;
do until(last.LAKOACH)
set csp.twoyrs_3 end=eof;
by LAKOACH;
if (VALID_FROM_DTTM='01NOV2022'd and pop_lomeforat_wppr=7) then POP7=1;
if (VALID_FROM_DTTM in ('01OCT2022'd,'01SEP2022'd ,'01AUG2022'd,'01JUL2022'd,'01JUN2022'd)
and ind_shlili_6 ne 1 then FAILURE=1;
end;
totalN+1;
totalpop7 + POP7;
totalpop7failure + (pop7 and failure);
keep total: ;
if eof;
run;
Is the data in SAS dataset(s)? Or are you connecting to a remote database? What type of libname engine did you use when defining CSP libref?
If the data is in a dataset then a data step solution would probably be much faster.
If the data is in a remote database then you probably do not what to write SQL code that takes advantage of SAS's ability to automatically remerge since that might cause SAS to not push the query into the database.
Also if the data is in a remote database then ask the admin for the database to analyze the query and see if there are tricks you can use in that database to make the query run more quickly.
Most likely the remerge is being triggered by the HAVING clause in this subquery.
select lakoach
, max(ind_shlili_6) as calc_ind_shlili_1_5
from csp.twoyrs_3
where VALID_FROM_DTTM in ('01OCT2022'd,'01SEP2022'd ,'01AUG2022'd,'01JUL2022'd,'01JUN2022'd)
group by lakoach
having ind_shlili_6=1
I can change the proc sql to this one -maybe it is better because no remerge?
Anyway I would like to see data step/proc step way to do it ?
is it also one step or two steps?
proc sql;
create table want as
select sum(calc_ind_shlili_1_5) as nr
from
(select lakoach
from csp.twoyrs_3
where VALID_FROM_DTTM='01NOV2022'd and pop_lomeforat_wppr=7 ) as a
inner join
(select lakoach,
max(ind_shlili_6) as calc_ind_shlili_1_5
from csp.twoyrs_3
where VALID_FROM_DTTM in ('01OCT2022'd,'01SEP2022'd ,'01AUG2022'd,'01JUL2022'd,'01JUN2022'd)
group by lakoach) as b
on a.lakoach=b.lakoach
;
quit;
I removed the having statement but still long run and didnt finish run
So you have two selection criteria.
I want to select customers that meet the condition that in month Nov 2021 they belong to population 7.
Then ,for this customer list I want to check for each customer if he/she failed to pay loans in months Jul2022 till Oct2022.
Then I want to calculate how many customers failed to pay loans in months jul2022 till oct2022.( if customer fail at least in one month ).
From the code you posted it looks like you have data that uses LAKOACH and VALID_FROM_DTTM as the key variables.
So assuming the data is already sorted (and if it is not sorted then that might help explain why the SQL code takes so long since SQL will have to sort the data first to do the grouping you asked for) then you could do this in one pass thru the data.
You can wrap the SET statement in a DO loop so that all of the observations for one customer are processed in one iteration of the data step. When you see the appropriate observations for the months in question you can then set some flags to indicate if the condition is met.
If you just want to count instead of the individual customer ids then you can use SUM statements to increment counters and output only one observation.
data want;
pop7=0;
failure=0;
do until(last.LAKOACH)
set csp.twoyrs_3 end=eof;
by LAKOACH;
if (VALID_FROM_DTTM='01NOV2022'd and pop_lomeforat_wppr=7) then POP7=1;
if (VALID_FROM_DTTM in ('01OCT2022'd,'01SEP2022'd ,'01AUG2022'd,'01JUL2022'd,'01JUN2022'd)
and ind_shlili_6 ne 1 then FAILURE=1;
end;
totalN+1;
totalpop7 + POP7;
totalpop7failure + (pop7 and failure);
keep total: ;
if eof;
run;
I run the code and have an error
I also added ; after do until(last.LAKOACH)
data want;
pop7=0;
failure=0;
do until(last.LAKOACH);
set csp.twoyrs_3 end=eof;
by LAKOACH;
if (VALID_FROM_DTTM='01NOV2022'd and pop_lomeforat_wppr=7) then POP7=1;
if (VALID_FROM_DTTM in ('01OCT2022'd,'01SEP2022'd ,'01AUG2022'd,'01JUL2022'd,'01JUN2022'd) and ind_shlili_6=1 then FAILURE=1;
end;
totalN+1;
totalpop7 + POP7;
totalpop7failure + (pop7 and failure);
keep total: ;
if eof;
run;
1 The SAS System 13:07 Tuesday, September 3, 2024
1 ;*';*";*/;quit;run;
2 OPTIONS PAGENO=MIN;
3 %LET _CLIENTTASKLABEL='Program';
4 %LET _CLIENTPROCESSFLOWNAME='IV dist';
5 %LET _CLIENTPROJECTPATH='K:\aתחום מודלים\פיתוח מודל CS\פיתוח מודלים\עוסקים_7\Project_Models_Oskim_Ron_Gibuy.egp';
6 %LET _CLIENTPROJECTPATHHOST='VSK1H103A2014';
7 %LET _CLIENTPROJECTNAME='Project_Models_Oskim_Ron_Gibuy.egp';
8 %LET _SASPROGRAMFILE='';
9 %LET _SASPROGRAMFILEHOST='';
10
11 ODS _ALL_ CLOSE;
12 OPTIONS DEV=PNG;
13 GOPTIONS XPIXELS=0 YPIXELS=0;
14 FILENAME EGSR TEMP;
15 ODS tagsets.sasreport13(ID=EGSR) FILE=EGSR
16 STYLE=HTMLBlue
17 STYLESHEET=(URL="file:///C:/Program%20Files/SASHome/SASEnterpriseGuide/7.1/Styles/HTMLBlue.css")
18 NOGTITLE
19 NOGFOOTNOTE
20 GPATH=&sasworklocation
21 ENCODING=UTF8
22 options(rolap="on")
23 ;
NOTE: Writing TAGSETS.SASREPORT13(EGSR) Body file: EGSR
24
25 GOPTIONS ACCESSIBLE;
26 data want;
27 pop7=0;
28 failure=0;
29 do until(last.LAKOACH);
30 set csp.twoyrs_3 end=eof;
31 by LAKOACH;
32 if (VALID_FROM_DTTM='01NOV2022'd and pop_lomeforat_wppr=7) then POP7=1;
33 if (VALID_FROM_DTTM in ('01OCT2022'd,'01SEP2022'd ,'01AUG2022'd,'01JUL2022'd,'01JUN2022'd) and ind_shlili_6=1 then
____
_
388
79
202
33 ! FAILURE=1;
ERROR 388-185: Expecting an arithmetic operator.
ERROR 79-322: Expecting a ).
ERROR 202-322: The option or parameter is not recognized and will be ignored.
34 end;
35 totalN+1;
36 totalpop7 + POP7;
37 totalpop7failure + (pop7 and failure);
38 keep total: ;
39 if eof;
40 run;
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.WANT may be incomplete. When this step was stopped there were 0 observations and 4 variables.
WARNING: Data set WORK.WANT was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
2 The SAS System 13:07 Tuesday, September 3, 2024
real time 0.00 seconds
user cpu time 0.00 seconds
system cpu time 0.00 seconds
memory 2237.40k
OS Memory 30892.00k
Timestamp 09/04/2024 07:42:12 PM
Step Count 2025 Switch Count 0
Page Faults 0
Page Reclaims 247
Page Swaps 0
Voluntary Context Switches 12
Involuntary Context Switches 0
Block Input Operations 0
Block Output Operations 0
41
42 GOPTIONS NOACCESSIBLE;
43 %LET _CLIENTTASKLABEL=;
44 %LET _CLIENTPROCESSFLOWNAME=;
45 %LET _CLIENTPROJECTPATH=;
46 %LET _CLIENTPROJECTPATHHOST=;
47 %LET _CLIENTPROJECTNAME=;
48 %LET _SASPROGRAMFILE=;
49 %LET _SASPROGRAMFILEHOST=;
50
51 ;*';*";*/;quit;run;
52 ODS _ALL_ CLOSE;
53
54
55 QUIT; RUN;
56
You are missing a close parentheses in the second IF statement.
Hint: Do not type lines of code that are longer than 70 bytes. They are too long for humans to scan easily. There is a reason why newspapers have multiple columns.
IF a customer fail a few times during the 5 months period (JUN2022-OCT2022) then you count is as one or more?We need to count it as one.
totalpop7failure + (pop7 and failure);
Your code is working perfect 100%.
Can you please explain why your code totalpop7failure + (pop7 and failure) count only once when a customer fail a few times?
It is working perfect but I want to understand please
data have;
format ddate date9.;
input CustID ddate :date9. pop Ind_Fail;
cards;
1 '01APR2022'd 7 0
1 '01MAY2022'd 7 0
1 '01JUN2022'd 7 0
1 '01JUL2022'd 7 0
1 '01AUG2022'd 7 0
1 '01SEP2022'd 7 0
1 '01OCT2022'd 7 0
1 '01NOV2022'd 7 0
2 '01APR2022'd 7 0
2 '01MAY2022'd 7 0
2 '01JUN2022'd 7 1
2 '01JUL2022'd 7 0
2 '01AUG2022'd 7 0
2 '01SEP2022'd 7 0
2 '01OCT2022'd 7 0
2 '01NOV2022'd 7 0
3 '01APR2022'd 7 0
3 '01MAY2022'd 7 0
3 '01JUN2022'd 7 1
3 '01JUL2022'd 7 1
3 '01AUG2022'd 7 1
3 '01SEP2022'd 7 1
3 '01OCT2022'd 7 0
3 '01NOV2022'd 7 0
4 '01APR2022'd 8 0
4 '01MAY2022'd 8 0
4 '01JUN2022'd 8 1
4 '01JUL2022'd 8 0
4 '01AUG2022'd 8 0
4 '01SEP2022'd 8 0
4 '01OCT2022'd 8 0
4 '01NOV2022'd 8 0
5 '01APR2022'd 7 0
5 '01MAY2022'd 7 0
5 '01JUN2022'd 7 1
5 '01JUL2022'd 7 0
5 '01AUG2022'd 7 0
5 '01SEP2022'd 7 0
5 '01OCT2022'd 7 0
5 '01NOV2022'd 8 0
;
Run;
data want;
pop7=0;
failure=0;
do until(last.CustID);
set have end=eof;
by CustID;
if (ddate='01NOV2022'd and pop=7) then POP7=1;
if ddate in ('01OCT2022'd,'01SEP2022'd ,'01AUG2022'd,'01JUL2022'd,'01JUN2022'd) and Ind_Fail=1 then FAILURE=1;
end;
totalN+1; /***Total customers----5 customers****/
totalpop7 + POP7; /****Base pop that is determined by belonging to pop 7 in month Nov 2022---3 customers***/
totalpop7failure + (pop7 and failure);/***number of customers from the base pop totalpop7 that fail at least in one onth----2 customers***/
keep total: ;
if eof;
run;
Because it can only execute that statement once per customer because of the DO loop.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.