I would like to add a new field to a sas dataset that represents a key of multiple columns, so that it can be used individually as a key field.
For example, with this code I create the key field "pk" formed by the concatenation of column a and column b, but I would like a function that automatically calculates a unique key:
data t;
input a b;
pk=compress(a) || compress(b);
datalines;
1 1
1 2
2 1
2 2
;If you have a lot of columns or some very long ones that make-up your business key then creating a hash value is a common approach.
data have;
  input a b;
  datalines;
1 1
1 2
2 1
2 2
;
data want;
  set have;
  length pk_var $64;
  pk_var=hashing('sha256',catx('|',a,b));
run;
proc print data=want;
run;
Which algorithm to use - md5, sha<nnn> - depends on how many rows of source data you have (collision risk) and storage restrictions given that these hash values can take-up quite a bit of space.
@mariopellegrini wrote:
I would like to add a new field to a sas dataset that represents a key of multiple columns, so that it can be used individually as a key field.
For example, with this code I create the key field "pk" formed by the concatenation of column a and column b, but I would like a function that automatically calculates a unique key:
data t; input a b; pk=compress(a) || compress(b); datalines; 1 1 1 2 2 1 2 2 ;
What rules do you have regarding that "primary key" (not really a SAS data set concept)? How do you intend to use the resulting variable. Do you mean code that doesn't rely on knowing variables in a data set? The following code will add a sequence number to the
data t; input a b; pk=_n_; datalines; 1 1 1 2 2 1 2 2 ;
maybe I didn't explain myself well, I intend to find, if it exists, a function that creates a unique code represented by the value of other fields. in the example I simply created the union of 2 fields, but I was wondering if there is an ad hoc function
Unless your file is ginormous why not just number the records?
data want;
  set t ;
  pk + 1;
run;If you want to generate a unique integer from two existing integers then you can just use arithmetic. Remember that you can only store about 16 decimal digits precisely since SAS uses 64-bit binary floating point numbers.
So if the maximum value of B is less than 10,000 (4 digits) then you could use:
data want;
  set t ;
  pk = a*10000 + b ;
run;You are currently generating a character string. (and generating warning message about running the COMPRESS() function on numeric values). If you want to generate a unique string from two variable length strings you probably need to add something between them that cannot be part of either string. Otherwise you cannot tell the difference between 'A'||'BC' and 'AB'||'C'.
data want;
  set t ;
  length pk $17 ;
  pk = catx('|',a,b);
run;
thanks, but I would like to create a field that doesn't have the "|" character but possibly just a numeric value that can be used as a key corresponding to 2 or more set fields
How would such a function add any value? If your key is more than one variable then just use all of them. It is really not that different to write:
data want;
  merge a b ;
  by var1 var2;
run;Than to write
data want;
  merge a b ;
  by pk;
run;IF (and might be a big if) the set of keys is small enough to fit into memory you might be able to use a hash object to generate a dataset with the primary keys and assign a arbitrary number to each one.
Here is an example using hash iterator.
data class;
 set sashelp.class sashelp.class;
run;
data keys;
  length pk 8;
  if 0 then set class(keep=name age);
  declare hash h(dataset:"class(keep=name age)");
  _error_=h.definekey(all:'yes');
  _error_=h.definedata(all:'yes');
  _error_=h.definedone();
  declare hiter iter('h');
  if iter.first() then stop;
  do until(iter.next());
    pk+1;
    output;
  end;
  stop;
run;
proc print data=keys;
run;ok, thanks for the helpful tip.
more precisely, however, I was looking for a function similar to sql's "hashbytes" which creates a code of several digits giving as input the concatenation of several text variables, but I don't think it exists in sas:
https://learn.microsoft.com/it-it/sql/t-sql/functions/hashbytes-transact-sql?view=sql-server-ver16
Maybe the function sha256 can be used here, see docs: https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.2/lefunctionsref/p04sqiymw1a6unn1uvh943eudcvz.h....
@mariopellegrini wrote:
ok, thanks for the helpful tip.
more precisely, however, I was looking for a function similar to sql's "hashbytes" which creates a code of several digits giving as input the concatenation of several text variables, but I don't think it exists in sas:https://learn.microsoft.com/it-it/sql/t-sql/functions/hashbytes-transact-sql?view=sql-server-ver16
So you don't actually want a primary key. Note that a hashed value is not guaranteed to be unique. You can use the hashing functions that SAS has as @andreas_lds has shown. But you need to take care in the method you use to concatenate the values to a single string to pass to the hashing function to avoid the A,BC vs AB,C problem. You could use CATX() but then you cannot have missing values because A,B,. will result in the same string as A,.,B. So if any of the variables could be missing then perhaps it is better to use CAT() instead.
That's actually something which caused me already some "discomfort" multiple times in real life implementations. Either assess and accept the risk or then use less convenient code.
Do you think it would be worth to propose an extension the the catx() function (like some 3rd parameter) that allows to request adding a delimiter also for missing source variables?
Something like: CATX(delimiter, item-1 <, ... item-n>, modifiers )
@mariopellegrini wrote:
ok, thanks for the helpful tip.
more precisely, however, I was looking for a function similar to sql's "hashbytes" which creates a code of several digits giving as input the concatenation of several text variables, but I don't think it exists in sas:https://learn.microsoft.com/it-it/sql/t-sql/functions/hashbytes-transact-sql?view=sql-server-ver16
You need to call this something other than a "primary key". That term means it is unique and not repeated in a data set.
Unless it is a key in a table built outside of this data set to uniquely identify that combination of variables, in which case it would be a primary key in that table but not this one.
If you have a lot of columns or some very long ones that make-up your business key then creating a hash value is a common approach.
data have;
  input a b;
  datalines;
1 1
1 2
2 1
2 2
;
data want;
  set have;
  length pk_var $64;
  pk_var=hashing('sha256',catx('|',a,b));
run;
proc print data=want;
run;
Which algorithm to use - md5, sha<nnn> - depends on how many rows of source data you have (collision risk) and storage restrictions given that these hash values can take-up quite a bit of space.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
