Add identity column functionality to Base and SPDE datasets

ScottBass · ‎12-07-2016

Add identity column (surrogate key) functionality to Base and SPDE datasets, such as available in SQL Server.

If I have a 500M record SAS dataset, it would be great if I could efficiently retrieve the max value of the surrogate key (identity) column from the header of the table.

I can provide more details if needed.

Shmuel · ‎12-07-2016

You don't need keep a surrogate key in your data;

You always can get a specific observation using set <dataset name> point=<observation number> option.

To know how many observations are in data set you can do:

data _NULL_;

dsid = open(<dataset name>);

if dsid then nobs = attrn(dsid, 'NOBS');

dsid = close(dsid);

put NOBS=; /* number of observation printed to LOG */

run;

As for SPDE it should be tested;

ScottBass · ‎12-07-2016

How does this return the max surrogate key?

For example:

data test;
   do sk=10 to 1 by -2;
      output;
   end;
run;

Gaps can occur in sk's. The data may not be in ascending sk order.

Number of observations <> max sk!

Kurt_Bremser · ‎12-07-2016

Use the UUIDGEN() function to create a surrogate key.

In my case, I don't need to do that, as I import all the UUID's from the production database for reference and use.

Shmuel · ‎12-07-2016

Sorry I missunderstood the term "surrogate key";

If that key is a variable in the dataset, the only way I know that enables check its max value at any time

is by maintaining a variable - in the dataset or in a seperate file - programatically;

ScottBass · ‎12-07-2016

No worries Shmuel, I probably wasn't clear in my OP.

The way we've been doing it is something like:

proc sql noprint;
   select max(sk) into :maxsk separated by " "
   from some.dataset;
quit;

I've also seen PROC SUMMARY (not in my code) which gives similar results.

You then use &maxsk in the downstream code to programatically manage the incrementing sk. And perhaps an index or integrity constraint to ensure that the sk remains unique.

For a large SAS dataset, esp. if it's wide, that processing can take some time to run.

In SQL Server (and other RBDMS's???), AFAIK the IDENTITY column is maintained in the header of the table, with function(s) to retrieve the current max value. So, the performance is the same no matter the size of the table. Finally, the assignment of the IDENTITY column is managed by the database; I can simply insert new rows and the column is automatically incremented and assigned. If I truncate the table there's a function to reset the seed value for the IDENTITY column.

I get that SAS is not a RDBMS (and I think I read "somewhere" that SAS has no intentions of becoming one). But I still think this would be a useful feature (for me anyway, and I suspect for all those ETL developers managing a SAS-based data warehouse).

ChrisHemedinger · ‎12-07-2016

@ScottBass - you could use the Extended Attributes (XATTR in PROC DATASETS) to manage this and any other table metadata yourself. You would have to use PROC DATASETS each time the table changes to adjust the value.

Shmuel · ‎12-07-2016

The code:

proc sql noprint;
   select max(sk) into :maxsk separated by " "
   from some.dataset;
quit;

may be usefull to get the max (sk) value the first time, but as you said, it takes long time for bigdata;

By the way, the - seperated by " " - maybe removed as you have one value only.

Saving it first time is also a simple step and can be done in as simple as:

data any_lib.control_sk;

maxsk = &maxsk;

run;

For next run you just need:

data _NULL_;

call symput('maxsk',left(maxsk));

run;

SAS is not a complete RDBMS therefore you need to calculate the new maxsk programatically

in your update program(s) which makes it more difficult.

If you have transactions dataset for updating you can use it in order to update the maxsk in

the control_sk dataset as a second step after the bigdata have been updated.

@ChrisHemedinger mentioned the term "extended attribute";

Official documentation can be found in:

http://support.sas.com/documentation/cdl/en/lrcon/69852/HTML/default/viewer.htm#p0wvxeb8936663n1f0x4...

It could be usefull if there will be a SAS function to retrieve and to update the extended attribute from the sas program itself

without the need to use a second step of proc datasets.

Even then - if last two SKs are N1 N2 - and you delete N2, you need to maintain an array of SKs in order to know the value of N1 (the previous maxsk); Such array is already maintained by SAS INDEX.

Does yout bigdata have an index on SK ?

If yes, I believe that your original code (the SQL) will retrieve the maxsk value from the index of bigdata relatively fast.

ScottBass · ‎12-07-2016

@Shmuel Re: separated by " "...

Compare:

data one;x=1;run;
proc sql noprint;
   select x into :mvar from one;
quit;
%put *&mvar*;

With:

data one;x=1;run;
proc sql noprint;
   select x into :mvar separated by " " from one;
quit;
%put *&mvar*;

It's a little "trick" I found on SAS-L a while ago. It compensates for (I think) the default best12. format when selecting into :mvar in proc sql.

Shmuel · ‎12-08-2016

When you do:

select x into : mvar

you get a list of values and you need define a seperator.

Having:

selec max(x) into mvar

you get just one value and the seperator is extra.

ScottBass · ‎12-08-2016

@Shmuel

Having:
   selec max(x) into mvar
 you get just one value and the seperator is extra

Wrong. Run the code I posted. Separated by " " trims ("strips") the value saved to the macro variable.

I think we're going around in circles here. I didn't post this as a support question. I know the approaches to coax SAS to do what I need. Those same approaches are baked in to SAS DI Studio. I already know these techniques; I'm suggesting these approaches aren't as efficient and take more coding effort than in SQL Server.

I'm making a suggestion that functionality analogous to an IDENTITY column in SQL Server would be useful in SAS.

Does yout bigdata have an index on SK ?
If yes, I believe that your original code (the SQL) will retrieve the maxsk value from the index of bigdata relatively fast.

You tell me. Did you test your assertion before making it?

I didn't notice any performance gain by indexing the sk when I tested it, only performance degradation in creating and maintaining the index that I may not want, only in the hope of better performance when deriving the max sk.

Run this code and let me know your results. Is it faster with the index?

options compress=no;

data test1;
 length line $10000;
 line=repeat("*",1000);
 do sk=1E8 to 1 by -2;
 output;
 end;
run;

data test2 (index=(sk));
 length line $10000;
 line=repeat("*",1000);
 do sk=1E8 to 1 by -2;
 output;
 end;
run;

proc sql;
 select max(sk) format=32. into :maxsk separated by " " from test1;
quit;
%put *&maxsk*;

proc sql;
 select max(sk) format=32. into :maxsk from test2;
quit;
%put *&maxsk*;