BookmarkSubscribeRSS Feed
deleted_user
Not applicable
Dear all,

I have the following problem: I am running some analyses on a very large dataset (about 80,000 obs). In order to reduce running time, I drew a random sample from this data set, reducing the number of observations to about 12,000. Strangely, the run time was not shorter but far longer then. I can exclude other sources of error, the problem must be due to the modifying of the data set. I tried to create indexes and change the format of the variables but this did not solve the problem.

The macro I used for the random subset looks like this:

%macro age_selection(start_custage, end_custage);

%let agedif = %eval(&end_custage. - &start_custage.);

data ncs.ncs2_cusbase_dev_&country.&start_custage._&end_custage.;
set ncs.ncs2_cusbase_dev_&country.;
random_variate=ranuni(1234);
%do i = 1 %to (&agedif.+1);
if ((random_variate > ((&i.-1)/(&agedif.+1))) and (random_variate <= &i./(&agedif.+1))) then sel_age = &i.;
%end;

%do i = &start_custage. %to &end_custage.;
data ncs.ncs2_cusbase_dev_AGE&i._&country. (drop = random_variate sel_age);
set ncs.ncs2_cusbase_dev_&country.&start_custage._&end_custage.; if sel_age = &i.;
run;

proc sql;
connect to odbc (dsn=edw);
execute(
create index idx_&i. on ncs.ncs2_cusbase_dev_AGE&i._&country. (custid, first_order_date)
) by odbc;
quit;
%end;

%mend;

%age_selection(1, 4);



Any help would be highly appreciated. Thanks a lot in advance,
Holger
7 REPLIES 7
deleted_user
Not applicable
Sorry, here is the second part of the code:

data ncs.ncs2_cusbase_dev_&country.&start_custage._&end_custage.;
set ncs.ncs2_cusbase_dev_&country.;
random_variate=ranuni(1234);
%do i = 1 %to (&agedif.+1);
if ((random_variate > ((&i.-1)/(&agedif.+1))) and (random_variate <= &i./(&agedif.+1))) then sel_age = &i.;
%end;

%do i = &start_custage. %to &end_custage.;
data ncs.ncs2_cusbase_dev_AGE&i._&country. (drop = random_variate sel_age);
set ncs.ncs2_cusbase_dev_&country.&start_custage._&end_custage.; if sel_age = &i.;
run;

proc sql;
connect to odbc (dsn=edw);
execute(
create index idx_&i. on ncs.ncs2_cusbase_dev_AGE&i._&country. (custid, first_order_date)
) by odbc;
quit;
%end;


%mend;


%age_selection(1, 4);

Any help would be highly appreciated. Thanks a lot in advance,
Holger
Robert_Bardos
Fluorite | Level 6
Holger's code snippet is constantly being truncated, so most of it can only be seen by using the "Quote Original" button when composing a reply. (Hint: there's a connect to oracle involved).

@holgerfb: suggest you read http://support.sas.com/forums/thread.jspa?messageID=27609毙 and repost your code snippet in a more readable way. Do use the Preview window.
polingjw
Quartz | Level 8
Your method is still reading every observation in the dataset and then taking a subset of the data with an if statement. Some better methods of taking a random sample are described here: http://support.sas.com/kb/24/722.html
deleted_user
Not applicable
Thanks, just to clearify: My problem is not the sampling procedure but the work with the resulting datasets which are much smaller but take longer to process.
sbb
Lapis Lazuli | Level 10 sbb
Lapis Lazuli | Level 10
It would be more useful to see your SAS log, rather than a program code snippet. Suggest you share your entire SAS log with FULLSTATS option set along with:

OPTIONS SOURCE SOURCE2 MGEN SGEN MPRINT;

Scott Barry
SBBWorks, Inc.
deleted_user
Not applicable
Dear all,
please find below the corresponding log file. Does anyone generally know the problem that the same analysis with a smaller dataset takes longer than with a bigger dataset? Cheers!



MPRINT(AGE_SELECTION): * do random selection of customer ages: defining for
which customer age the customer will appear in developement dataset;
MLOGIC(AGE_SELECTION): %LET (variable name is AGEDIF)
SYMBOLGEN: Macro variable END_CUSTAGE resolves to 4
SYMBOLGEN: Macro variable START_CUSTAGE resolves to 1
SYMBOLGEN: Macro variable COUNTRY resolves to ge
SYMBOLGEN: Macro variable START_CUSTAGE resolves to 1
SYMBOLGEN: Macro variable END_CUSTAGE resolves to 4
MPRINT(AGE_SELECTION): data ncs.ncs2_cusbase_dev_ge1_4;
SYMBOLGEN: Macro variable COUNTRY resolves to ge
MPRINT(AGE_SELECTION): set ncs.ncs2_cusbase_dev_ge;
MPRINT(AGE_SELECTION): random_variate=ranuni(1234);
SYMBOLGEN: Macro variable AGEDIF resolves to 3
MLOGIC(AGE_SELECTION): %DO loop beginning; index variable I; start value is
1; stop value is 4; by value is 1.
SYMBOLGEN: Macro variable I resolves to 1
SYMBOLGEN: Macro variable AGEDIF resolves to 3
SYMBOLGEN: Macro variable I resolves to 1
SYMBOLGEN: Macro variable AGEDIF resolves to 3
SYMBOLGEN: Macro variable I resolves to 1
MPRINT(AGE_SELECTION): if ((random_variate > ((1-1)/(3+1))) and
(random_variate <= 1/(3+1))) then sel_age = 1;
MLOGIC(AGE_SELECTION): %DO loop index variable I is now 2; loop will
iterate again.
SYMBOLGEN: Macro variable I resolves to 2
SYMBOLGEN: Macro variable AGEDIF resolves to 3
SYMBOLGEN: Macro variable I resolves to 2
SYMBOLGEN: Macro variable AGEDIF resolves to 3
SYMBOLGEN: Macro variable I resolves to 2
MPRINT(AGE_SELECTION): if ((random_variate > ((2-1)/(3+1))) and
(random_variate <= 2/(3+1))) then sel_age = 2;
MLOGIC(AGE_SELECTION): %DO loop index variable I is now 3; loop will
iterate again.
SYMBOLGEN: Macro variable I resolves to 3
SYMBOLGEN: Macro variable AGEDIF resolves to 3
SYMBOLGEN: Macro variable I resolves to 3
SYMBOLGEN: Macro variable AGEDIF resolves to 3
SYMBOLGEN: Macro variable I resolves to 3
MPRINT(AGE_SELECTION): if ((random_variate > ((3-1)/(3+1))) and
(random_variate <= 3/(3+1))) then sel_age = 3;
MLOGIC(AGE_SELECTION): %DO loop index variable I is now 4; loop will
iterate again.
SYMBOLGEN: Macro variable I resolves to 4
SYMBOLGEN: Macro variable AGEDIF resolves to 3
SYMBOLGEN: Macro variable I resolves to 4
SYMBOLGEN: Macro variable AGEDIF resolves to 3
SYMBOLGEN: Macro variable I resolves to 4
MPRINT(AGE_SELECTION): if ((random_variate > ((4-1)/(3+1))) and
(random_variate <= 4/(3+1))) then sel_age = 4;
MLOGIC(AGE_SELECTION): %DO loop index variable I is now 5; loop will not
iterate again.
MPRINT(AGE_SELECTION): if random_variate = 0 then sel_age = 1;
MPRINT(AGE_SELECTION): * for few cases which are at the boarder;
SYMBOLGEN: Macro variable START_CUSTAGE resolves to 1
MPRINT(AGE_SELECTION): sel_age = sel_age + 1 - 1;
MPRINT(AGE_SELECTION): * calculation of final customer age (in weeks);
MPRINT(AGE_SELECTION): run;

NOTE: SAS variable labels, formats, and lengths are not written to DBMS
tables.
NOTE: There were 88940 observations read from the data set
NCS.NCS2_CUSBASE_DEV_GE.
NOTE: The data set NCS.NCS2_CUSBASE_DEV_GE1_4 has 88940 observations and 4
variables.
NOTE: DATA statement used (Total process time):
real time 6.90 seconds
cpu time 0.21 seconds


SYMBOLGEN: Macro variable START_CUSTAGE resolves to 1
SYMBOLGEN: Macro variable END_CUSTAGE resolves to 4
MLOGIC(AGE_SELECTION): %DO loop beginning; index variable I; start value is
1; stop value is 4; by value is 1.
SYMBOLGEN: Macro variable I resolves to 1
SYMBOLGEN: Macro variable COUNTRY resolves to ge
MPRINT(AGE_SELECTION): proc delete data = ncs.ncs2_cusbase_dev_AGE1_ge_pre;
MPRINT(AGE_SELECTION): run;

WARNING: File NCS.NCS2_CUSBASE_DEV_AGE1_GE_PRE.DATA does not exist.
NOTE: PROCEDURE DELETE used (Total process time):
real time 0.01 seconds
cpu time 0.00 seconds


SYMBOLGEN: Macro variable I resolves to 1
SYMBOLGEN: Macro variable COUNTRY resolves to ge
MPRINT(AGE_SELECTION): data ncs.ncs2_cusbase_dev_AGE1_ge_pre (drop =
random_variate sel_age);
SYMBOLGEN: Macro variable COUNTRY resolves to ge
SYMBOLGEN: Macro variable START_CUSTAGE resolves to 1
SYMBOLGEN: Macro variable END_CUSTAGE resolves to 4
MPRINT(AGE_SELECTION): set ncs.ncs2_cusbase_dev_ge1_4;
SYMBOLGEN: Macro variable I resolves to 1
MPRINT(AGE_SELECTION): if sel_age = 1;
MPRINT(AGE_SELECTION): run;

NOTE: SAS variable labels, formats, and lengths are not written to DBMS
tables.
NOTE: There were 88940 observations read from the data set
NCS.NCS2_CUSBASE_DEV_GE1_4.
NOTE: The data set NCS.NCS2_CUSBASE_DEV_AGE1_GE_PRE has 22074 observations
and 2 variables.
NOTE: DATA statement used (Total process time):
real time 2.62 seconds
cpu time 0.14 seconds


SYMBOLGEN: Macro variable I resolves to 1
SYMBOLGEN: Macro variable COUNTRY resolves to ge
MPRINT(AGE_SELECTION): proc delete data = ncs.ncs2_cusbase_dev_AGE1_ge;
MPRINT(AGE_SELECTION): run;

WARNING: File NCS.NCS2_CUSBASE_DEV_AGE1_GE.DATA does not exist.
NOTE: PROCEDURE DELETE used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds


MPRINT(AGE_SELECTION): proc sql;
MPRINT(AGE_SELECTION): connect to odbc (dsn=edw);
SYMBOLGEN: Macro variable I resolves to 1
SYMBOLGEN: Macro variable COUNTRY resolves to ge
MPRINT(AGE_SELECTION): execute( create table ncs.ncs2_cusbase_dev_AGE1_ge (
custid dec(9), first_order_date DATE ) ) by odbc;
WARNING: During execute: [IBM][iSeries Access ODBC Driver][DB2 UDB]SQL7905 -
Table NCS2_CUSBASE_DEV_AGE1_GE in NCS created but could not be
journaled.
SYMBOLGEN: Macro variable I resolves to 1
SYMBOLGEN: Macro variable COUNTRY resolves to ge
SYMBOLGEN: Macro variable I resolves to 1
SYMBOLGEN: Macro variable COUNTRY resolves to ge
MPRINT(AGE_SELECTION): execute( INSERT INTO ncs.ncs2_cusbase_dev_AGE1_ge
(custid, first_order_date) ( select custid, first_order_date from
ncs.ncs2_cusbase_dev_AGE1_ge_pre ) ) by odbc;
MPRINT(AGE_SELECTION): quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 1.14 seconds
cpu time 0.00 seconds


MPRINT(AGE_SELECTION): proc sql;
MPRINT(AGE_SELECTION): connect to odbc (dsn=edw);
SYMBOLGEN: Macro variable I resolves to 1
SYMBOLGEN: Macro variable I resolves to 1
SYMBOLGEN: Macro variable COUNTRY resolves to ge
MPRINT(AGE_SELECTION): execute( create index idx_1 on
ncs.ncs2_cusbase_dev_AGE1_ge (custid, first_order_date) ) by odbc;
MPRINT(AGE_SELECTION): quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 1.12 seconds
cpu time 0.01 seconds
Robert_Bardos
Fluorite | Level 6
... but the work with the resulting datasets which are much smaller but take longer to process.

"The work with the resulting dataset" this is where we need more information. What does that mean? Your macro logic does create SAS code consisting of data steps and procedure invocations most certainly. Seeing the created code and seeing the SAS NOTE messages along with resource consumption related informations such as those that get reported when options like FULLSTIMER (it has some newer name but AFAIK it still works) are active would help us and ultimately you a big lot more.

So please set options SOURCE SOURCE2 NOTES FULLSTIMER and repost the log parts that relate to "the work with the resulting datasets".

Kind regards
Robert

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 714 views
  • 0 likes
  • 4 in conversation