BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Khaladdin
Quartz | Level 8

Hi all,

 

I want to ask a question related State Space procedure. I have a huge dataset that contains a million groups. I need to find permanent and transitory components of each group by using State Space Model. I run the following code:

 

proc ucm data=work;
	model price;
	by group;
	irregular plot=smooth;
	level checkbreak plot=smooth;
	estimate plot=residual;
	forecast plot=forecasts lead=10 alpha=0.5;
run;

This code works well. I have just one issue. As I have a huge number of groups, it takes a lot of time (approximately 3 months). Do you know any way/method that I can use to increase the efficiency and reduce the time.  

Thanks in advance for your helps. 

1 ACCEPTED SOLUTION

Accepted Solutions
Rick_SAS
SAS Super FREQ

If you have millions of BY groups, I question whether you need all those plots.How are you going to view 4 million plots? 

 

If you don't need printed output for certain sections, suppress it. For example, use PRINT=NONE on the ESTIMATE statement or use the ODS EXCLUDE statement to suppress output.

 

My advice is to (1) Use NOPRINT to turn off printing; (2) get rid of the plots, and (3) use OUTEST= and OUTFOR= options to send the results to data sets.

 

Is this real data or a simulation? If a simulation, see Tips #4 through #8 in the article "Eight tips to make your simulation run faster."

View solution in original post

20 REPLIES 20
rselukar
SAS Employee

The only thing I can think of is to distribute the problem on several machines (each machine gets a different set of by-groups). 

Khaladdin
Quartz | Level 8

Hi all,

 

I want to ask a question related State Space procedure. I have a huge dataset that contains a million groups. I need to find permanent and transitory components of each group by using State Space Model. I run the following code:

 

proc ucm data=work;
	model price;
	by group;
	irregular plot=smooth;
	level checkbreak plot=smooth;
	estimate plot=residual;
	forecast plot=forecasts lead=10 alpha=0.5;
run;

This code works well. I have just one issue. As I have a huge number of groups, it takes a lot of time (approximately 3 months). Do you know any way/method that I can use to increase the efficiency and reduce the time.  

Thanks in advance for your helps. 

Ksharp
Super User
It is a time series analysis question.
Please post it at Forecast forum.


Khaladdin
Quartz | Level 8
Thanks
Rick_SAS
SAS Super FREQ

If you have millions of BY groups, I question whether you need all those plots.How are you going to view 4 million plots? 

 

If you don't need printed output for certain sections, suppress it. For example, use PRINT=NONE on the ESTIMATE statement or use the ODS EXCLUDE statement to suppress output.

 

My advice is to (1) Use NOPRINT to turn off printing; (2) get rid of the plots, and (3) use OUTEST= and OUTFOR= options to send the results to data sets.

 

Is this real data or a simulation? If a simulation, see Tips #4 through #8 in the article "Eight tips to make your simulation run faster."

Khaladdin
Quartz | Level 8

Hi Rick,

 

Thanks for your suggestions. Actually, I do not need the plots. I did not write my full code when I asked the question. My full code is:

 

ods trace on;

ods select ParameterEstimates;

ods output ParameterEstimates=myEstimates;

proc ucm data=work;
	model price;
	by group;
	irregular plot=smooth;
	level checkbreak plot=smooth;
	estimate plot=residual;
	forecast plot=forecasts lead=10 alpha=0.5;
run;

proc print data=myEstimates;
run;

proc transpose data=myEstimates(keep=group component estimate)
               out=transposedEstimates;
  by group;
  id component;
run;

So, I have already transferred my results to datasets.  But nothing changes. It will take a lot.

Rick_SAS
SAS Super FREQ

You might not realize that the procedure creates those millions of graphs but that ODS does not show them. Get rid of the graph requests. Also, it is much more efficient to use NOPRINT and OUTEST=myEstimates than to use the code you show.

 

Also, you don't need the PROC PRINT, which is probably trying to print a data set that has 10-15 million observations in it.

Khaladdin
Quartz | Level 8

Thanks again. So, the following code might be more efficient, yes?:

 

proc ucm data=work 
         outest=myEstimates
         noprint
         ;
    by group;
    model price;
    irregular;
    level checkbreak;
    estimate;
    forecast lead=10 alpha=0.5;
run;

 

Rick_SAS
SAS Super FREQ

You've got the right idea, but 

1) The syntax is wrong, so check the doc. The OUTEST= option goes on the ESTIMATE statement, not on the PROC UCM statement.

2) If all you want are the parameter estimates, why are you doing all the other computations? For example, the forecast and confidence limits are expensive, so get rid of the FORECAST statement if you aren't saving the results. Only keep the statements that are relevant to the results that you intend to use.

 

As mentioned in the "8 Tips" article, run and debug your new code on a small subset of the data (maybe 5-10 BY groups) before you run it against the full data.

Khaladdin
Quartz | Level 8
proc ucm data=work 
         noprint
         ;
    by group;
    model price;
    irregular;
    level checkbreak;
    estimate  outest=myEstimates;
run; 

What about this one?

 

rselukar
SAS Employee

Rick is correct.  One more thing, since you are using the checkbreak option in the LEVEL statement, I am assuming that you want to save the detected break points.  Since the break points are produced in an ODS table only, NOPRINT may not be the way to go.  Check all the tables produced by your UCM call and "ods exclude" them and "ods output" the outlier summary table.  Something like this will work:

 

proc ucm data=work  plots=none;

    ods exclude DataSet EstimationSpan ForecastSpan
      InitialParameters FitSummary ConvergenceStatus
      ParameterEstimates FitStatistics ComponentSignificance
      TrendInformation OutlierSummary;

    ods output OutlierSummary = osummary;
    by group;
    model price;
    irregular;
    level checkbreak;
    estimate  outest=myEstimates;
run; 

Rick_SAS
SAS Super FREQ

Or delete that statement if you do not need it. NOPRINT is faster than ODS EXCLUDE.

 

This new code should go faster. How much faster depends on your data. As I said, try it for 10 BY groups to make sure it works as you expect. Then time how long it takes to compute for 100 or 1000 BY groups. From that you can estimate how long it will take for a million BY groups. 

 

You never answered my question about whether this is a simulation. If it is, you almost surely can get by with fewer than 1 million.simulated samples. I'd try 10,000 and see how large the Monte Carlo standard errors are.

Khaladdin
Quartz | Level 8
Sorry for not answering your question related to simulation. I missed it. It is a real data, not simulation.
Rick_SAS
SAS Super FREQ

Interesting. May I ask what the 1 million BY groups represent? 

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

Multiple Linear Regression in SAS

Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 20 replies
  • 1227 views
  • 3 likes
  • 4 in conversation