<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Hash (hex64) to string: minimum string length in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Hash-hex64-to-string-minimum-string-length/m-p/720411#M223176</link>
    <description>&lt;P&gt;It is unnecessary.&amp;nbsp;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;5 is proper for your demo data. However, like you said, &lt;SPAN&gt;records will add in the future. We could not sure if 5 is proper in that time.&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Fri, 19 Feb 2021 09:15:34 GMT</pubDate>
    <dc:creator>whymath</dc:creator>
    <dc:date>2021-02-19T09:15:34Z</dc:date>
    <item>
      <title>Hash (hex64) to string: minimum string length</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hash-hex64-to-string-minimum-string-length/m-p/720172#M223064</link>
      <description>&lt;P&gt;I have a set of data that need to be anonymised. I created a hash for each ID, which I then turn into a string. To be safe I set the string to length 200, but this was not very user friendly.&lt;/P&gt;&lt;P&gt;-EDIT-&lt;/P&gt;&lt;P&gt;The long string is not user-friendly because the first 5 columns containing anonymised data are either not completely visible, or are very wide when you make the content visible. It also becomes difficult to eyeball if two values are the same or not. In general, it is also not very appealing for the end-user to work with these daunting long strings. It is not so much an issue with the storage size of the dataset.&lt;/P&gt;&lt;P&gt;-EDIT-&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I then saw that even with length 10 I could get away with it and still have a unique string for each original value (around 5 million IDs). Except for one case where two different IDs resulted in the same string. Setting the string length to 15 solved this issue.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Now I wonder if there is a way to know what I should set the minimum length to, to be safe, also when future records are added?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;For example, for the below data, a string length of 4 is too short, while 5 would be enough. Is there a way to determine this minimum of 5 based on the input?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;/*Sample data*/
data have (keep=ID);
length ID $11.;
call streaminit(123);
Min = 10000000000; Max = 99999999999;
do i = 1 to 1000;
	u = rand("Uniform");
   ID = min + floor((1+Max-Min)*u);
   output;
end;
;run;

/*Create hash and string*/
data want;
format ID $11. ID_hash $hex64. ID_short_string $4. ID_long_string $5. ID_not_user_friendly $200.;
	set have;
ID_hash = sha256(cat(ID));
ID_short_string = put(ID_hash,$hex64.);
ID_long_string = put(ID_hash,$hex64.);
ID_not_user_friendly = put(ID_hash,$hex64.);
run;

/*Check if equal unique values*/
proc sql;
create table counts as
select count(distinct(ID)) as n_ID
, count(distinct(ID_short_string)) as N_ID_Short
, count(distinct(ID_long_string)) as N_ID_Long
from want;
quit;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 19 Feb 2021 08:37:59 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hash-hex64-to-string-minimum-string-length/m-p/720172#M223064</guid>
      <dc:creator>SarahDew</dc:creator>
      <dc:date>2021-02-19T08:37:59Z</dc:date>
    </item>
    <item>
      <title>Re: Hash (hex64) to string: minimum string length</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hash-hex64-to-string-minimum-string-length/m-p/720177#M223066</link>
      <description>&lt;P&gt;I would not do it. Set the length of the hash string to 32, format it with $hex64., and stay with it.&lt;/P&gt;
&lt;P&gt;As soon as your method throws a duplicate, you have to search how many bytes you need, and have to recreate all your data with the longer hash. The few bytes you save are not worth the hassle.&lt;/P&gt;</description>
      <pubDate>Thu, 18 Feb 2021 13:49:34 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hash-hex64-to-string-minimum-string-length/m-p/720177#M223066</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2021-02-18T13:49:34Z</dc:date>
    </item>
    <item>
      <title>Re: Hash (hex64) to string: minimum string length</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hash-hex64-to-string-minimum-string-length/m-p/720293#M223123</link>
      <description>&lt;P&gt;It might help to define exactly what "not user friendly" means and how it is impacting your work.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 18 Feb 2021 18:37:20 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hash-hex64-to-string-minimum-string-length/m-p/720293#M223123</guid>
      <dc:creator>ballardw</dc:creator>
      <dc:date>2021-02-18T18:37:20Z</dc:date>
    </item>
    <item>
      <title>Re: Hash (hex64) to string: minimum string length</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hash-hex64-to-string-minimum-string-length/m-p/720408#M223174</link>
      <description>Good point, see edit.</description>
      <pubDate>Fri, 19 Feb 2021 08:38:38 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hash-hex64-to-string-minimum-string-length/m-p/720408#M223174</guid>
      <dc:creator>SarahDew</dc:creator>
      <dc:date>2021-02-19T08:38:38Z</dc:date>
    </item>
    <item>
      <title>Re: Hash (hex64) to string: minimum string length</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hash-hex64-to-string-minimum-string-length/m-p/720410#M223175</link>
      <description>&lt;P&gt;If it's only the display width you are concerned about, use a LENGTH of 32, but a HEX format with a shorter display length.&lt;/P&gt;</description>
      <pubDate>Fri, 19 Feb 2021 08:58:22 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hash-hex64-to-string-minimum-string-length/m-p/720410#M223175</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2021-02-19T08:58:22Z</dc:date>
    </item>
    <item>
      <title>Re: Hash (hex64) to string: minimum string length</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hash-hex64-to-string-minimum-string-length/m-p/720411#M223176</link>
      <description>&lt;P&gt;It is unnecessary.&amp;nbsp;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;5 is proper for your demo data. However, like you said, &lt;SPAN&gt;records will add in the future. We could not sure if 5 is proper in that time.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 19 Feb 2021 09:15:34 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hash-hex64-to-string-minimum-string-length/m-p/720411#M223176</guid>
      <dc:creator>whymath</dc:creator>
      <dc:date>2021-02-19T09:15:34Z</dc:date>
    </item>
    <item>
      <title>Re: Hash (hex64) to string: minimum string length</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hash-hex64-to-string-minimum-string-length/m-p/721565#M223676</link>
      <description>&lt;P&gt;If I understand correctly you propose to apply the following (given the example data above).&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;That does look appealing from a user perspective, especially because I get only numbers in this string which they are used to from the original ID. But do I not run into the same issue of how long the display length should minimally be, to make sure each unique ID is also displayed with a unique anonimised string?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;/*use a LENGTH of 32, but a HEX format with a shorter display length*/
data want;
length ID_string $32;
format ID $11. ID_hash $hex64. ID_string $hex10.;
	set have;
ID_hash = sha256(cat(ID));
ID_string = put(ID_hash,$hex64.);
run;

/*Check if displayed values are unique*/
proc sql;
create table counts as
select count(distinct(ID)) as n_ID
, count(distinct(put(ID_string,$hex5.))) as N_ID_S
, count(distinct(put(ID_string,$hex10.))) as N_ID_L
from want;
quit;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 24 Feb 2021 13:52:24 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hash-hex64-to-string-minimum-string-length/m-p/721565#M223676</guid>
      <dc:creator>SarahDew</dc:creator>
      <dc:date>2021-02-24T13:52:24Z</dc:date>
    </item>
    <item>
      <title>Re: Hash (hex64) to string: minimum string length</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hash-hex64-to-string-minimum-string-length/m-p/721585#M223683</link>
      <description>&lt;P&gt;How many distinct ID's do you need to encode? If I wanted to keep the displayed length of an encoded value to a minimum while guaranteeing a clear distinction, I would work with a simple running count, a Zx. format that is just long enough, and keep a lookup table that is updated with every new encoding:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data have;
input ID $;
datalines;
ABC
DEF
GHI
;

data lookup;
length ID $8 enc 8;
stop;
run;

%macro encode(inds,outds);
data &amp;amp;outds.;
set &amp;amp;inds. end=done;
if 0 then set lookup nobs=maxnum;
retain maxkey;
if _N_ = 1
then do;
  maxkey = maxnum;
  declare hash l1 (dataset:"lookup");
  l1.definekey("ID");
  l1.definedata("ID","enc");
  l1.definedone();
  declare hash l2 (dataset:"lookup");
  l2.definekey("enc");
  l2.definedata("enc");
  l2.definedone();
end;
if l1.find() ne 0
then do; 
  do until (l2.check(key:maxkey) ne 0);
    maxkey + 1;
  end;
  enc = maxkey;
  rc = l1.add();
  rc = l2.add();
end;
if done then rc = l1.output(dataset:"lookup_new");
drop rc maxkey;
run;

proc sql noprint;
select int(log(nobs)) + 1 into :formlength trimmed
from dictionary.tables
where libname = "WORK" and memname = upcase("&amp;amp;outds.");
quit;

proc datasets lib=work nolist;
delete lookup;
change lookup_new=lookup;
modify &amp;amp;outds.;
format enc z&amp;amp;formlength..;
quit;

%mend;

%encode(have,encoded1);

data have2;
set have end=done;
output;
if _n_ = 1 /* add a random observation */
then do;
  ID = "JKL";
  output;
end;
run;

%encode(have2,encoded2);&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Wed, 24 Feb 2021 14:41:05 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hash-hex64-to-string-minimum-string-length/m-p/721585#M223683</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2021-02-24T14:41:05Z</dc:date>
    </item>
    <item>
      <title>Re: Hash (hex64) to string: minimum string length</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hash-hex64-to-string-minimum-string-length/m-p/723413#M224480</link>
      <description>Thanks, that looks like a really nice approach. I see it also gives the same encoded value if the same value is added, which can be the case in my data. I have around 5 million unique IDs for one of the ID columns so there I will use Z7.&lt;BR /&gt;&lt;BR /&gt;Just wondering if it is equally safe as a hash value if the initial counts are given to an ordered ID (alphabetically for strings or ascending numbers). If you figure out the first count it might be easier to guess who the next count belongs to. It might be best to make sure the original IDs are ordered randomly?</description>
      <pubDate>Thu, 04 Mar 2021 11:01:24 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hash-hex64-to-string-minimum-string-length/m-p/723413#M224480</guid>
      <dc:creator>SarahDew</dc:creator>
      <dc:date>2021-03-04T11:01:24Z</dc:date>
    </item>
    <item>
      <title>Re: Hash (hex64) to string: minimum string length</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hash-hex64-to-string-minimum-string-length/m-p/723421#M224485</link>
      <description>&lt;P&gt;You could modify my code to use a random integer value in the range, and derive the range from nobs.&lt;/P&gt;</description>
      <pubDate>Thu, 04 Mar 2021 11:30:38 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hash-hex64-to-string-minimum-string-length/m-p/723421#M224485</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2021-03-04T11:30:38Z</dc:date>
    </item>
  </channel>
</rss>

