DATA Step, Macro, Functions and more

Fastest way to identify the next ID number in a dataset

Accepted Solution Solved
Reply
Contributor
Posts: 45
Accepted Solution

Fastest way to identify the next ID number in a dataset

Hello,

 

I need some help to find the fastest way to get the next ID number from a large dataset.

 

I am building a process to run several hundred forecast scenarios at a time in SAS and have a dataset acting as a dimension table which contains an ID number which is used to join the Dimension and Fact tables together in later processes.

When each individual process is run to create a new scenario I need to look up the next ID number from my DIM table to use when creating the new scenario then I use proc append to add the new DIM and FACT records to the appropriate tables.

 

To look up the next ID number I was using the following process.

proc sql noprint;
    select sum(max(SCENARIO_ID),1) 
    into :l_next_forecast_id 
    from LIB.DIM_TABLE;
quit;

 

When stress testing with more data(DIM table has over 5 million records) this query slowed down to over 2 seconds per scenario as it is having to read the entire table each time, so I started using the following process instead.

data _null_;
    set LIB.DIM_TABLE nobs=NOBS point=NOBS;
    call symputx('l_next_forecast_id',sum(SCENARIO_ID,1));
    stop;
run;

This is fine if I can assume that nothing will ever go wrong to make the last record not contain the max value or no data is ever deleted however paranoia tells me otherwise, so I would need to sort the data beforehand which again added too much time to the process.

This was surprising because, given that the data in the DIM table was already sorted and I am appending a new record(also sorted with proc sort) with a higher ID number I would expect from the documenation below that the sort indicator would be retained by the proc append, meaning that the sort step should only fire if something had gone wrong and flag that no sort required if everything was fine.

Proc Append Docs

The documented behaviour works as expected when dealing with work datasets but doesn't work when I am appending to a SAS/SHARE library so without the sort indicator retained the proc sort always fires and slows down the process.

I could use proc datasets to add the sort indicator back onto the table after the proc append but this really defies the point since if something has gone wrong and the append had happended incorrectly I don't want to override it so it stays broken for next time.

 

The last thing I tried was adding an index to the table and using proc contents with the centiles option to read the max key value however keeping the index on the dataset made the append steps too slow so i'm back to square 1.

 

For reference, an example of the process I am currently using where the sort indicator is lost is as follows. As i said before, this process works when the PLAYPEN library is assigned normally rather than using SAS/SHARE but the final library will need to use SAS/SHARE so changing that isn't an option.

Libname PLAYPEN '/data/dtf/playpen' server=sshare3 sapw=XXXXXXXX;

data PLAYPEN.SET1;
    set sashelp.class;
    ORDER_VARIABLE=1;
run;

proc sort data=PLAYPEN.SET1;
    by ORDER_VARIABLE;
run;

data WORK.SET2;
    set sashelp.class;
    ORDER_VARIABLE=2;
run;

proc sort data=WORK.SET2;
    by ORDER_VARIABLE;
run;

/* Before append Sort indicator present */
proc contents data=PLAYPEN.SET1;
run;

proc append base=PLAYPEN.SET1 data=WORK.SET2;
run;

/* After append Sort indicator lost */
proc contents data=PLAYPEN.SET1;
run;

 

Does anyone know of a more efficient way to get the next ID number or why my proc append steps would be dropping the sort indicator when using SAS/SHARE libraries?

 

Tim


Accepted Solutions
Solution
‎09-28-2015 04:14 AM
PROC Star
Posts: 1,570

Re: Fastest way to identify the next ID number in a dataset

1-Very odd and interesting that sasfile slows things down in your case. I have to look at this.

 

2-Re. step 2: you may want to use proc sort with the presorted flag, this should make the sort faster in your case. 

 

3-Why does step 6 worry you? By the way, step 6 does not set the validated sort flag. This probably does not matter in your case, but just so you know. From the documentation:

 SORTEDBY= Data Set Option 

 Details
 SAS determines whether a data set is already sorted by the key variable or variables in ascending order by checking the sort indicator. The sort indicator is stored in the data set descriptor information and is set from a previous sort. For more information about how the sort indicator is used and how it improves performance, see The Sort Indicator in SAS Language Reference:  Concepts and SORTVALIDATE System Option in SAS System Options: Reference.
 This example of the CONTENTS procedure indicates that the data set was sorted using the SORTEDBY= data set option.
  Sort Information
   Sortedby var1
   Validated NO
   Character Set ANSI

 

4-SAS/SHARE probably executes appends differently and does not implement some features (though if the whole table is locked there is less justification for this).  Ask tech support if no one here knows.

 

5-Since your process keeps the data in its place, you could have an alter password so nobody modifies it when your process isn't running.

 

6-On the whole issue you raise: At the end of day, either you know where/what your max value is or not. If you trust that you do, point to it, if not then you have no option but to scan the full table (it'd be nice if SAS indexes could find maxima).

View solution in original post


All Replies
Super User
Super User
Posts: 7,432

Re: Fastest way to identify the next ID number in a dataset

Hi,

 

Your post is a bit long and complicated, so I may have missed something here.  But the basic premise is you need to do XYZ on a repeat for each unique ID yes?  If so why not:

- Step 1: Create a small dataset with unique ID values

- Step 2: Use that small dataset to call out the code.

E.g.

proc sql;

  create table LOOP as 

  select distinct ID from HAVE;

quit;

data _null_;

  set loop;

  call execute(cats('....',id));  /* Replace .... with your code, or macro call */

run;

Contributor
Posts: 45

Re: Fastest way to identify the next ID number in a dataset

Hi RW9,

 

Sorry if the post was over complicated,

The basic process is that i need to run multiple scenarios and then save them with a unique ID each time.

Not do the same task for different IDs as you suggested.

 

Tim

Super User
Posts: 10,552

Re: Fastest way to identify the next ID number in a dataset

I suspect this looks like Call Execute may be an option. A data set with the ids AND parameters that describe the scenarios. If you have ALL of the scenarios as parameters then you can interate in a data step that uses Call Execute to concatenated the ID variable and either a counter or something involving the actual scenario parameters.

 

A  control data set might be a good idea to make sure you know what scenarios you have planned and that all of them get executed. You could also pretest the part for generating the unique identifiers without actually running models or whatever to identify possible issues.

PROC Star
Posts: 1,570

Re: Fastest way to identify the next ID number in a dataset

The getsort option is hardly useful and one wonders why SAS didn't bother to make it more powerful.

proc append would just have to check the sorted flags and compare the last record of the base and the first record of the data to make sorted appends a reality.

So easy. A missed opportunity really...

 

1- Have you tried making full scans fast loading and keeping the DIM table in memory using for example SASFILE (or MEMLIB or MEMCACHE*) ?

2- While I am as paranoiac as the next guy, for the point= method to fail, one would have to replace the last record's value or append a bad record. Could this happen in your scenario? If so can you lock the full table for your own use?

 

* MEMCACHE has to be used carefully. These options are examined in this book:

http://www.amazon.com/High-Performance-SAS-Coding-Christian-Graffeuille/dp/1512397490

 

Contributor
Posts: 45

Re: Fastest way to identify the next ID number in a dataset

Thanks Chris,

 

I've tried using sasfile to open the DIM table in memory and oddly enough the 'sql select max' query ran slower then when it wasn't in memory!

 

I do have table locking processes in place so it shouldn't be possible for a person or process to edit the table whilst my process is running and the process is robust enough to trap errors and escape before the append if something was to go wrong.

 

The process outline I have in place at the moment is this

 

1 - lock Dim and fact tables.

2 - Sort Dim table.

3 - use point= to get last ID number + 1 for new tables.

4 - create new dim and fact records.

5 - append new dim and fact records to full tables.

6 - add sort indicator to dim table using proc datasets modify.

7 - unlock tables.

 

If the table is only ever edited by this process then the records should always be in the correct order so step 6 is safe to add and means step 2 is instantaneous.

if anyone has affected the table in another way the sort indicator will be lost meaning step 2 should ensure the numbering stays correct.

 

Looks like this is the best I am going to get unless anyone can think of a better way to get the max ID out?

 

Also, still confusing as to why the Sort indicator is lost when appending to a SAS/SHARE library but not to a BASE library?

Because if it wasn't then I could take step 6 out, which is the only worry I have with the process.

 

Tim

Solution
‎09-28-2015 04:14 AM
PROC Star
Posts: 1,570

Re: Fastest way to identify the next ID number in a dataset

1-Very odd and interesting that sasfile slows things down in your case. I have to look at this.

 

2-Re. step 2: you may want to use proc sort with the presorted flag, this should make the sort faster in your case. 

 

3-Why does step 6 worry you? By the way, step 6 does not set the validated sort flag. This probably does not matter in your case, but just so you know. From the documentation:

 SORTEDBY= Data Set Option 

 Details
 SAS determines whether a data set is already sorted by the key variable or variables in ascending order by checking the sort indicator. The sort indicator is stored in the data set descriptor information and is set from a previous sort. For more information about how the sort indicator is used and how it improves performance, see The Sort Indicator in SAS Language Reference:  Concepts and SORTVALIDATE System Option in SAS System Options: Reference.
 This example of the CONTENTS procedure indicates that the data set was sorted using the SORTEDBY= data set option.
  Sort Information
   Sortedby var1
   Validated NO
   Character Set ANSI

 

4-SAS/SHARE probably executes appends differently and does not implement some features (though if the whole table is locked there is less justification for this).  Ask tech support if no one here knows.

 

5-Since your process keeps the data in its place, you could have an alter password so nobody modifies it when your process isn't running.

 

6-On the whole issue you raise: At the end of day, either you know where/what your max value is or not. If you trust that you do, point to it, if not then you have no option but to scan the full table (it'd be nice if SAS indexes could find maxima).

PROC Star
Posts: 1,570

Re: Fastest way to identify the next ID number in a dataset

About SASFILE: I ran a quick test. Surprisingly in-memory data wasn't faster for the select max(). At least on the first run.

 

Only SPDE divided the run time by 3.

 

data T; do I=1 to 3e7; H=ranuni(0); output; end; run;                    

proc sql; select max(H) from T; quit;

sasfile T load;
proc sql; select max(H) from T; quit;
sasfile T close;

libname MEMLIB  "%sysfunc(pathname(work))" memlib;
data MEMLIB.T; do I=1 to 3e7; H=ranuni(0); output; end; run;                    
proc sql; select max(H) from MEMLIB.T; quit;

libname SPEEDY spde "%sysfunc(pathname(work))";
data SPEEDY.T; do I=1 to 3e7; H=ranuni(0); output; end; run;   
proc sql; select max(H) from SPEEDY.T; quit;
SAS Employee
Posts: 24

Re: Fastest way to identify the next ID number in a dataset

You may want to give the hash table a try to get your maxID. You can load an entire dataset into memory with an ordered key.

The documentation:

https://support.sas.com/documentation/cdl/en/lecompobjref/67221/HTML/default/viewer.htm#titlepage.ht...

An example of creating an ordered hash:

http://support.sas.com/kb/24653

 

PROC Star
Posts: 1,570

Re: Fastest way to identify the next ID number in a dataset

Hash tables are not useful in this case if I understood the problem correctly as they are sadly only available within a data step. Here we need to load a table, run processes, update the table, and query it.

 

The real question is why in-memory SAS tables are as slow as disk here.

Contributor
Posts: 45

Re: Fastest way to identify the next ID number in a dataset

Thanks Chris,

 

I didn't know about the presorted flag so tried it and it did speed up the process for me so will definately be using that tactic at some point but it still leaves the sort step as the longest running process in this program so will stick to adding the sort indicator on after the append steps.

 

What has been most useful is putting alter and write passwords on the dataset as this removes my fear about users editing the table which was the only drawback I had left to deal with and gives me more trust that the last record will be the correct id I need.

 

Thanks for all your help.

 

Tim

 

 

PROC Star
Posts: 1,570

Re: Fastest way to identify the next ID number in a dataset

[ Edited ]

Yes, option presort is handy. More PROC SORT tips in my book

http://www.amazon.com/High-Performance-SAS-Coding-Christian-Graffeuille/dp/1512397490 Smiley Happy

 

If you lock the table and control the update process when it is unlocked, you no longer need to sort, do you? Pointing to the last obs should be enough shouldn't it?

Contributor
Posts: 45

Re: Fastest way to identify the next ID number in a dataset

Yeah,

I agree that with the dataset locks and password protection I should be in a situation where the data will always be ordered correctly so shouldn't need to add the sort indicator or use proc sort anymore.

However, given that the processing time for the proc sort is normally 0.00 with the sort indicator on it isn't impacting performance to leave it in at the moment just to be safe.

 

It's fun living in a permenant state of paranoia!

 

Tim

 

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 12 replies
  • 377 views
  • 1 like
  • 5 in conversation