BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.

I would like to add a new field to a sas dataset that represents a key of multiple columns, so that it can be used individually as a key field.
For example, with this code I create the key field "pk" formed by the concatenation of column a and column b, but I would like a function that automatically calculates a unique key:

 

data t;
input a b;
pk=compress(a) || compress(b);
datalines;
1 1
1 2
2 1
2 2
;
1 ACCEPTED SOLUTION

Accepted Solutions
Patrick
Opal | Level 21

If you have a lot of columns or some very long ones that make-up your business key then creating a hash value is a common approach.

data have;
  input a b;
  datalines;
1 1
1 2
2 1
2 2
;

data want;
  set have;
  length pk_var $64;
  pk_var=hashing('sha256',catx('|',a,b));
run;

proc print data=want;
run;

Patrick_0-1674477566026.png

 

Which algorithm to use - md5, sha<nnn> - depends on how many rows of source data you have (collision risk) and storage restrictions given that these hash values can take-up quite a bit of space.

View solution in original post

11 REPLIES 11
ballardw
Super User

@mariopellegrini wrote:

I would like to add a new field to a sas dataset that represents a key of multiple columns, so that it can be used individually as a key field.
For example, with this code I create the key field "pk" formed by the concatenation of column a and column b, but I would like a function that automatically calculates a unique key:

 

data t;
input a b;
pk=compress(a) || compress(b);
datalines;
1 1
1 2
2 1
2 2
;

What rules do you have regarding that "primary key" (not really a SAS data set concept)? How do you intend to use the resulting variable. Do you mean code that doesn't rely on knowing variables in a data set? The following code will add a sequence number to the

data t;
input a b;
pk=_n_;
datalines;
1 1
1 2
2 1
2 2
;

 

 

mariopellegrini
Pyrite | Level 9

maybe I didn't explain myself well, I intend to find, if it exists, a function that creates a unique code represented by the value of other fields. in the example I simply created the union of 2 fields, but I was wondering if there is an ad hoc function

Tom
Super User Tom
Super User

Unless your file is ginormous why not just number the records?

data want;
  set t ;
  pk + 1;
run;

If you want to generate a unique integer from two existing integers then you can just use arithmetic.  Remember that you can only store about 16 decimal digits precisely since SAS uses 64-bit binary floating point numbers.

So if the maximum value of B is less than 10,000 (4 digits) then you could use:

data want;
  set t ;
  pk = a*10000 + b ;
run;

You are currently generating a character string. (and generating warning message about running the COMPRESS() function on numeric values).  If you want to generate a unique string from two variable length strings you probably need to add something between them that cannot be part of either string.  Otherwise you cannot tell the difference between 'A'||'BC' and 'AB'||'C'.

data want;
  set t ;
  length pk $17 ;
  pk = catx('|',a,b);
run;

 

mariopellegrini
Pyrite | Level 9

thanks, but I would like to create a field that doesn't have the "|" character but possibly just a numeric value that can be used as a key corresponding to 2 or more set fields

Tom
Super User Tom
Super User

How would such a function add any value?  If your key is more than one variable then just use all of them. It is really not that different to write:

data want;
  merge a b ;
  by var1 var2;
run;

Than to write 

data want;
  merge a b ;
  by pk;
run;

IF (and might be a big if) the set of keys is small enough to fit into memory you might be able to use a hash object to generate a dataset with the primary keys and assign a arbitrary number to each one.

Here is an example using hash iterator.

data class;
 set sashelp.class sashelp.class;
run;

data keys;
  length pk 8;
  if 0 then set class(keep=name age);
  declare hash h(dataset:"class(keep=name age)");
  _error_=h.definekey(all:'yes');
  _error_=h.definedata(all:'yes');
  _error_=h.definedone();
  declare hiter iter('h');
  if iter.first() then stop;
  do until(iter.next());
    pk+1;
    output;
  end;
  stop;
run;

proc print data=keys;
run;
mariopellegrini
Pyrite | Level 9

ok, thanks for the helpful tip.
more precisely, however, I was looking for a function similar to sql's "hashbytes" which creates a code of several digits giving as input the concatenation of several text variables, but I don't think it exists in sas:

https://learn.microsoft.com/it-it/sql/t-sql/functions/hashbytes-transact-sql?view=sql-server-ver16

Tom
Super User Tom
Super User

@mariopellegrini wrote:

ok, thanks for the helpful tip.
more precisely, however, I was looking for a function similar to sql's "hashbytes" which creates a code of several digits giving as input the concatenation of several text variables, but I don't think it exists in sas:

https://learn.microsoft.com/it-it/sql/t-sql/functions/hashbytes-transact-sql?view=sql-server-ver16


So you don't actually want a primary key. Note that a hashed value is not guaranteed to be unique.  You can use the hashing functions that SAS has as @andreas_lds has shown.   But you need to take care in the method you use to concatenate the values to a single string to pass to the hashing function to avoid the A,BC vs AB,C problem.  You could use CATX() but then you cannot have missing values because A,B,. will result in the same string as A,.,B. So if any of the variables could be missing then perhaps it is better to use CAT() instead.

Patrick
Opal | Level 21

@Tom 

Patrick_0-1674634972394.png

That's actually something which caused me already some "discomfort" multiple times in real life implementations. Either assess and accept the risk or then use less convenient code. 

Do you think it would be worth to propose an extension the the catx() function (like some 3rd parameter) that allows to request adding a delimiter also for missing source variables?

Something like: CATX(delimiter, item-1 <, ... item-n>, modifiers )

ballardw
Super User

@mariopellegrini wrote:

ok, thanks for the helpful tip.
more precisely, however, I was looking for a function similar to sql's "hashbytes" which creates a code of several digits giving as input the concatenation of several text variables, but I don't think it exists in sas:

https://learn.microsoft.com/it-it/sql/t-sql/functions/hashbytes-transact-sql?view=sql-server-ver16


You need to call this something other than a "primary key". That term means it is unique and not repeated in a data set.

Unless it is a key in a table built outside of this data set to uniquely identify that combination of variables, in which case it would be a primary key in that table but not this one.

Patrick
Opal | Level 21

If you have a lot of columns or some very long ones that make-up your business key then creating a hash value is a common approach.

data have;
  input a b;
  datalines;
1 1
1 2
2 1
2 2
;

data want;
  set have;
  length pk_var $64;
  pk_var=hashing('sha256',catx('|',a,b));
run;

proc print data=want;
run;

Patrick_0-1674477566026.png

 

Which algorithm to use - md5, sha<nnn> - depends on how many rows of source data you have (collision risk) and storage restrictions given that these hash values can take-up quite a bit of space.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 1598 views
  • 5 likes
  • 5 in conversation