Hello,
I have a data set that looks something like this:
ID1
1
1
2
3
3
3
4
5
6
I want to create two new IDs, one would be a randomly generated ID that is one to one with ID1, and another one that is sequential, as follows:
ID1 ID2 ID3
1 456234 1
1 456234 2
2 128504 1
3 546745 1
3 546745 2
3 546745 3
4 387446 1
5 897655 1
6 342269 1
I am having problems with grouping these and the general coding as well as generating random numbers.
Thanks in advance! 🙂
How about this?
data have;
infile cards;
attrib id1 length=5 label='Initial ID';
input id1;
cards;
1
1
2
3
3
3
4
5
6
;
run;
data want;
set have;
attrib id2 length=8 label='Random ID';
attrib id3 length=5 label='Sequential ID';
by id1;
retain id2 id3;
if first.id1 then do;
id2 = int(ranuni(0) * 1e6) + 1;
id3 = 0;
end;
id3 + 1;
run;
Thanks LaurieF.
It somewhat works but not quite.
I get the ID2 only for the first observation in each group and the other ones are blank.
ID2 is also sometimes 5 and sometimes 6 digits, I want it to be 6 digits all the time.
Any help would be greatly appreciated!
EDIT: I also have units where ID1 is blank, I want those blanks to each have different ID2s, not the same.
And I would also like to be able to generate the same random numbers if I run this program again... Like using a seed value or something... I just don't seem to know how to do all this in SAS.
Thanks in advance
You didn't say that was what you wanted! But it's easy:
id2 = int(ranuni(0) * 9e5) + 1e5 + 1;
If you want to reproduce the results, replace 0 with any number. I often use a very old phone number.
This won't work consistently though: if you were to insert a new ID 3.5 (for example), the results would break down from that point. If you always want an ID to turn into the same random number, all the time, you need a hashing algorithm. I would start looking at the MD5 function.
Weak reversible encryption (set up for SSNs)
Using rounded random numbers
does not insure 1 to 1 or easy
reversibility and can be problematic.
If your ids are considerable larger than your
key and your ids are less than 9999999999
then this should work.
key has to be a power of 2 (shift bits)
HAVE
====
Up to 40 obs WORK.HAVE total obs=8
Obs ID1
1 104111111
2 567123456
3 311555555
4 442819299
5 224262144
6 591048576
7 661048111
8 999999999
WANT
====
Up to 40 obs WORK.WANT total obs=8
Obs ID1 ENCRYPT DECRYPT
1 104111111 5717080721 104111111
2 567123456 867082296 56713984
3 311555555 3065013083 311555555
4 442819299 9370055045 442819299
5 224262144 4416057372 224262144
6 591048576 8693094127 591048576
7 661048111 3682049608 661048111
8 999999999 95520070221 999999999
SOLUTION
========
* create some data;
data have;
infile cards;
input id1;
cards;
104111111
567123456
311555555
442819299
224262144
591048576
661048111
999999999
;
run;quit;
* encrypt and decrypt;
data want;
set have;
encrypt=encrypt(id1);
decrypt=decrypt(encrypt);
put id1 encrypt decrypt;
run;quit;
options cmplib = (work.functions /* add other libs here */);
proc fcmp outlib=work.functions.hashssn;
function encrypt(ssn);
length rev $16;
key=8192;
ssn_remainder = mod(ssn, key);
ssn_int = round((ssn - ssn_remainder)/key,1);
* 8192 ssn_remainder = 1 and ssn_int = 1 ie 1*key + 1 = original value;
ssn_big = ssn_int*100000 + ssn_remainder;
rev=reverse(putn(ssn_big,16.));
ssn_encrypt=input(rev,16.);
return(ssn_encrypt);
endsub;
run;quit;
proc fcmp outlib=work.functions.hashssn;
function decrypt(ssn);
length rev $16;
key=8192;
rev=reverse(putn(ssn,16.));
ssn_big=input(rev,16.);
ssn_decrypt = round(ssn_big/100000,1)*key + mod(ssn_big,10000);
return(ssn_decrypt);
endsub;
;run;quit;
@rogerjdeangelis I think it depends on what the specification is, and what the results are being used for, and whether it has to be reversible If it's just for a few hundred or thousand, it should be fine. But if there are more than that, that's why I suggested md5. I know it is possible to create collisions, but it's not easy (and it's good enough for SAS/DI's SCD-2 generator…)
But a proper spec would be nice.
This seems to work
* create some data;
data have;
infile cards;
input id1;
cards;
567123456
104111111
311555555
442819299
224262144
591048576
661048111
999999999
008111111
008110000
010101010
;
run;quit;
* encrypt and decrypt;
data want;
set have;
encrypt=encrypt(id1);
decrypt=decrypt(encrypt);
put id1 encrypt decrypt;
run;quit;
options cmplib = (work.functions /* add other libs here */);
proc fcmp outlib=work.functions.hashssn;
function encrypt(ssn);
length rev $17;
key=8192;
ssn_remainder = mod(ssn, key);
ssn_int = round((ssn - ssn_remainder)/key,1);
* 8192 ssn_remainder = 1 and ssn_int = 1 ie 1*key + 1 = original value;
ssn_big = ssn_int*100000 + ssn_remainder;
rev=reverse(put(ssn_big,17.));
if substr(rev,1,1)=0 then rev=cats('-1',substr(rev,2));
ssn_encrypt=input(rev,17.);
return(ssn_encrypt);
endsub;
run;quit;
proc fcmp outlib=work.functions.hashssn;
function decrypt(ssn);
length rev $17;
key=8192;
rev=strip(reverse(put(ssn,17.)));
if index(rev,'-')>0 then substr(rev,index(rev,'-')-1)='0 ';
ssn_big=input(rev,17.);
ssn_decrypt = round(ssn_big/100000,1)*key + mod(ssn_big,10000);
return(ssn_decrypt);
endsub;
;run;quit;
Up to 40 obs WORK.WANT total obs=11
Obs ID1 ENCRYPT DECRYPT
1 567123456 -1867082296 567123456
2 104111111 5717080721 104111111
3 311555555 3065013083 311555555
4 442819299 9370055045 442819299
5 224262144 4416057372 224262144
6 591048576 8693094127 591048576
7 661048111 3682049608 661048111
8 999999999 95520070221 999999999
9 8111111 13010099 8111111
10 8110000 21180989 8110000
11 10101010 472003321 10101010
Strong Encryption
sha256 is probably enough but if you are paranoid then do sha512
/* T1002220 SAS/WPS/R: De-identify PII(name) using irrevesible strong hashing
SAS/WPS/R: De-identify PII(name) using irrevesible strong hashing
Joining tables with de-identified names
sha256 and sha512 are 'theoretically' secure
You need SAS Secure for the SAS solution.
For WPS/R you don't;
HAVE Two sas datasets sashelp.class and sashelp.classfit
need to de-identify name
=========================================================
Up to 40 obs SD1.WPSCLASS total obs=4
Obs NAME SEX AGE HEIGHT WEIGHT
1 Roger F 14 62.8 102.5
2 Janet F 15 62.5 112.5
3 Judy F 14 64.3 90
4 Robert M 12 64.8 128
And
Up to 40 obs SD1.WPSCLASSFIT total obs=9
Obs NAME SEX AGE HEIGHT WEIGHT PREDICT LOWERMEAN
1 Louise F 12 56.3 77 76.48 67.96
2 James M 12 57.3 83 80.38 72.66
3 John M 12 59 99.5 87.01 80.47
4 Janet F 15 62.5 112.5 100.66 95.22
5 Carol F 14 62.8 102.5 101.83 96.37
6 Judy F 14 64.3 90 107.68 101.84
7 Barbara F 13 65.3 98 111.57 105.26
8 William M 15 66.5 112 116.25 109.18
9 Alfred M 14 69 112.5 126.00 116.94
WANT ( Join the two datasets but with de-identified name )
==========================================================
Up to 40 obs SD1.WPSCLASSFIT total obs=9
Obs NAME SEX AGE HEIGHT WEIGHT PREDICT LOWERMEAN
1 Janet F 15 62.5 112.5 100.66 95.22
2 Judy F 14 64.3 90 107.68 101.84
WITH DE-IDENTIFIED 'NAME' using 256bit hash
===========================================
Up to 40 obs from classwps total obs=4
Obs NAME SEX AGE HEIGHT WEIGHT
1 689ffebb9149498fb8189d6e3ad1625c8bc1e1c0f0849ca40f2df83eb2930c7b F 14 62.8 102.5
2 909428873964ece00ba2d2fbca40035e7685a04687821c9fc7ea4cbb047b70fc F 15 62.5 112.5
3 f43da163b5f48c69f909facc9a5880c5666c5ccffa4837890bb4519542f03e45 F 14 64.3 90
4 7d359106f754613d9354353f2a403b88b76fb1569ea81524f54d479f4c0d610f M 12 64.8 128
Obs NAME SEX AGE HEIGHT WEIGHT PREDICT LOWERMEAN
1 b68d4a840d7e218045d98247e00d79ed7bb1eff3b1fff117dc93db32f4699bfe F 12 56.3 77 76.48 67.96
2 7007b18f0d9cb87601ba50c83e761006dc185270a4cf2d55812e0863be8ff566 M 12 57.3 83 80.38 72.66
3 632dc5c5235dc0b5b411f8f5e4a6f619ce4155ef99d3d8adfcd026288c795187 M 12 59 99.5 87.01 80.47
*4 909428873964ece00ba2d2fbca40035e7685a04687821c9fc7ea4cbb047b70fc F 15 62.5 112.5 100.66 95.22
5 f90982571d7ace45426530062c7008fc370be6a7faeb9ecb05733758960b0c39 F 14 62.8 102.5 101.83 96.37
*6 f43da163b5f48c69f909facc9a5880c5666c5ccffa4837890bb4519542f03e45 F 14 64.3 90 107.68 101.84
7 6615925a48704e803ee5e32ad0209bb348698ce575effe73da268d9e606c3208 F 13 65.3 98 111.57 105.26
8 2b008dfe3f79c53c59ab5392929c06c970d71b62c58527ef51d07a50616dbbd0 M 15 66.5 112 116.25 109.18
9 177d26a3ce597f633b7194aee7fa886111ab1f2edfd37854e0a179ea533154c6 M 14 69 112.5 126.00 116.94
OR
Up to 40 obs WORK.NTRSEC total obs=2
Obs NAME SEX AGE HEIGHT WEIGHT PREDICT LOWERMEAN
1 909428873964ece00ba2d2fbca40035e7685a04687821c9fc7ea4cbb047b70fc F 15 62.5 112.5 100.66 95.22
2 f43da163b5f48c69f909facc9a5880c5666c5ccffa4837890bb4519542f03e45 F 14 64.3 90 107.68 101.84
MAPPING THE HASH TO INTEGERS
============================
Up to 40 obs from CLASSWPSFIX total obs=4
NAME
CRYPTO SEX AGE HEIGHT WEIGHT
1 F 14 62.8 102.5
4 M 12 64.8 128
*2 F 15 62.5 112.5 ** Match 1
*3 F 14 64.3 90 ** Match 2
Up to 40 obs from CLASSWPS total obs=9
NAME
CRYPTO SEX AGE HEIGHT WEIGHT PREDICT LOWERMEAN
5 F 12 56.3 77 76.48 67.96
6 M 12 57.3 83 80.38 72.66
7 M 12 59 99.5 87.01 80.47
*2 F 15 62.5 112.5 100.6 95.22 ** Match 1
9 F 14 62.8 102.5 101.8 96.37
*3 F 14 64.3 90 107.6 101.8 ** Match 2
11 F 13 65.3 98 111.5 105.2
12 M 15 66.5 112 116.2 109.1
13 M 14 69 112.5 126.0 116.9
AFTER MATCH ON NUM_CRYPTO
NAME
Obs CRYPTO SEX AGE HEIGHT WEIGHT PREDICT LOWERMEAN
1 2 F 15 62.5 112.5 100.66 95.225
2 3 F 14 64.3 90 107.68 101.84
*____ _ ____
/ ___| / \ / ___|
\___ \ / _ \ \___ \
___) / ___ \ ___) |
|____/_/ \_\____/
;
libname sd1 "d:/sd1";
data sd1.sasclass;
length name $32;
format name $hex64.;
set sashelp..class;
name=sha256(name);
if mod(_n_,4) = 0 then do;
if _n_=4 then name=sha256('Roger');
output;
end;
run;quit;
data sd1.sasclassfit;
length name $32;
format name $hex64.;
set sashelp.classfit;
name=sha256(name);
if mod(_n_,2) = 0;
run;quit;
proc sql;
create
table ntrsec as
select
l.*
,r.predict
,r.lowermean
from
sd1.sasclass as l, sd1.sasclassfit as r
where
l.name = r.name
;quit;
Predicted Bound of
Value of 95% C.I.
NAME SEX AGE HEIGHT WEIGHT Weight for Mean
--------------------------------------------------------------------------------------------------------------------
D4B7A1817B10284BBC7135509F57659210CAEB70C06F91644CDB0D6B19054C47 F 15 62.5 112.5 100.6625 95.22579
E8EE6BCC3464E592F1DD07E0EE6D3FB93EC132AA4D867AA52818A641304475F5 F 14 64.3 90 107.6807 101.8416
#! WPS-R ;
*_ ______ ____ ____
\ \ / / _ \/ ___| | _ \
\ \ /\ / /| |_) \___ \ _____| |_) |
\ V V / | __/ ___) |_____| _ <
\_/\_/ |_| |____/ |_| \_\
;
proc datasets lib=work kill;
run;quit;
* need at aleast one dataset in work for WPS to switch to sas7bdats
instead of wpd datasets;
data makesas;
set sashelp.class;
run;quit;
options validvarname=upcase;
libname sd1 "d:/sd1";
data sd1.wpsclass;
set sashelp.class;
if mod(_n_,4) = 0 then do;
if _n_=4 then name='Roger';
output;
end;
run;quit;
data sd1.wpsclassfit;
set sashelp.classfit;
if mod(_n_,2) = 0;
run;quit;
* I think there is an issue with binary data and WPS import;
* so I switched to hex;
%utl_submit_wps64('
libname sd1 "d:/sd1";
options set=R_HOME "C:/Program Files/R/R-3.3.2";
libname wrk "%sysfunc(pathname(work))";
libname hlp "C:\Program Files\SASHome\SASFoundation\9.4\core\sashelp";
proc r;
submit;
source("C:/Program Files/R/R-3.3.2/etc/Rprofile.site", echo=T);
library(digest);
library(haven);
class <- read_sas("d:/sd1/wpsclass.sas7bdat");
shaclass <- as.character(lapply(class$NAME,function(x) paste0(digest(x, algo=c("sha256"),raw=TRUE),collapse="")));
classwps <- cbind(name_crypto=c(1:nrow(class)),SHACLASS=shaclass,class[,-1]);
classfit <- read_sas("d:/sd1/wpsclassfit.sas7bdat");
shaclassfit <- as.character(lapply(classfit$NAME,function(x) paste0(digest(x, algo=c("sha256"),raw=TRUE),collapse="")));
classfitwps <- cbind(name_crypto=c((nrow(class)+1):(nrow(class)+nrow(classfit))),SHACLASS=shaclassfit,classfit[,-1]);
endsubmit;
import r=classwps data=wrk.classwps;
import r=classfitwps data=wrk.classfitwps;
run;quit;
');
/*
The WPS System
name_crypto SHACLASS SEX AGE HEIGHT WEIGHT
1 1 689ffebb9149498fb8189d6e3ad1625c8bc1e1c0f0849ca40f2df83eb2930c7b F 14 62.8 102.5
2 2 909428873964ece00ba2d2fbca40035e7685a04687821c9fc7ea4cbb047b70fc F 15 62.5 112.5
3 3 f43da163b5f48c69f909facc9a5880c5666c5ccffa4837890bb4519542f03e45 F 14 64.3 90.0
4 4 7d359106f754613d9354353f2a403b88b76fb1569ea81524f54d479f4c0d610f M 12 64.8 128.0
[1] 5 6 7 8 9 10 11 12 13
name_crypto SHACLASS SEX AGE HEIGHT WEIGHT PREDICT LOWERMEAN
1 5 b68d4a840d7e218045d98247e00d79ed7bb1eff3b1fff117dc93db32f4699bfe F 12 56.3 77.0 76.48849 67.96005
2 6 7007b18f0d9cb87601ba50c83e761006dc185270a4cf2d55812e0863be8ff566 M 12 57.3 83.0 80.38752 72.66709
3 7 632dc5c5235dc0b5b411f8f5e4a6f619ce4155ef99d3d8adfcd026288c795187 M 12 59.0 99.5 87.01587 80.47925
4 8 909428873964ece00ba2d2fbca40035e7685a04687821c9fc7ea4cbb047b70fc F 15 62.5 112.5 100.66247 95.22579
5 9 f90982571d7ace45426530062c7008fc370be6a7faeb9ecb05733758960b0c39 F 14 62.8 102.5 101.83218 96.37505
6 10 f43da163b5f48c69f909facc9a5880c5666c5ccffa4837890bb4519542f03e45 F 14 64.3 90.0 107.68073 101.84160
7 11 6615925a48704e803ee5e32ad0209bb348698ce575effe73da268d9e606c3208 F 13 65.3 98.0 111.57976 105.26025
8 12 2b008dfe3f79c53c59ab5392929c06c970d71b62c58527ef51d07a50616dbbd0 M 15 66.5 112.0 116.25859 109.18222
9 13 177d26a3ce597f633b7194aee7fa886111ab1f2edfd37854e0a179ea533154c6 M 14 69.0 112.5 126.00617 116.94168
*/
proc sql;
select
l.*
from
classfitwps as l, classwps as r
where
l.shaclass = r.shaclass
;quit;
SHACLASS SEX AGE HEIGHT WEIGHT PREDICT LOWERMEAN
---------------------------------------------------------------------------------------------------------------------
909428873964ece00ba2d2fbca40035e7685a04687821c9fc7ea4cbb047b70fc F 15 62.5 112.5 100.6625 95.22579
f43da163b5f48c69f909facc9a5880c5666c5ccffa4837890bb4519542f03e45 F 14 64.3 90 107.6807 101.8416
*__ __ _ ____ ___ _ _ _____
| \/ | / \ | _ \ |_ _| \ | |_ _|
| |\/| | / _ \ | |_) |____| || \| | | |
| | | |/ ___ \| __/_____| || |\ | | |
|_| |_/_/ \_\_| |___|_| \_| |_|
;
data class_classfit;
set classwps classfitwps;
run;quit;
/*
Up to 40 obs WORK.CLASS_CLASSFIT total obs=13
Obs NAME_CRYPTO SHACLASS
1 1 689ffebb9149498fb8189d6e3ad1625c8bc1e1c0f0849ca40f2df83eb2930c7b
2 2 909428873964ece00ba2d2fbca40035e7685a04687821c9fc7ea4cbb047b70fc ***** DUP
3 3 f43da163b5f48c69f909facc9a5880c5666c5ccffa4837890bb4519542f03e45 ** DUP
4 4 7d359106f754613d9354353f2a403b88b76fb1569ea81524f54d479f4c0d610f
5 5 b68d4a840d7e218045d98247e00d79ed7bb1eff3b1fff117dc93db32f4699bfe
6 6 7007b18f0d9cb87601ba50c83e761006dc185270a4cf2d55812e0863be8ff566
7 7 632dc5c5235dc0b5b411f8f5e4a6f619ce4155ef99d3d8adfcd026288c795187
8 8 909428873964ece00ba2d2fbca40035e7685a04687821c9fc7ea4cbb047b70fc ***** DUP
9 9 f90982571d7ace45426530062c7008fc370be6a7faeb9ecb05733758960b0c39
10 10 f43da163b5f48c69f909facc9a5880c5666c5ccffa4837890bb4519542f03e45 ** DUP
11 11 6615925a48704e803ee5e32ad0209bb348698ce575effe73da268d9e606c3208
12 12 2b008dfe3f79c53c59ab5392929c06c970d71b62c58527ef51d07a50616dbbd0
13 13 177d26a3ce597f633b7194aee7fa886111ab1f2edfd37854e0a179ea533154c6
*/
* remove dups we need 1 to 1
proc sort data=class_classfit
out=lookup(keep=name_crypto shaclass) nodupkey;
by shaclass;
run;quit;
/*
Up to 40 obs WORK.LOOKUP total obs=11
NAME
Obs CRYPTO SHACLASS
1 13 177d26a3ce597f633b7194aee7fa886111ab1f2edfd37854e0a179ea533154c6
2 12 2b008dfe3f79c53c59ab5392929c06c970d71b62c58527ef51d07a50616dbbd0
3 7 632dc5c5235dc0b5b411f8f5e4a6f619ce4155ef99d3d8adfcd026288c795187
4 11 6615925a48704e803ee5e32ad0209bb348698ce575effe73da268d9e606c3208
5 1 689ffebb9149498fb8189d6e3ad1625c8bc1e1c0f0849ca40f2df83eb2930c7b
6 6 7007b18f0d9cb87601ba50c83e761006dc185270a4cf2d55812e0863be8ff566
7 4 7d359106f754613d9354353f2a403b88b76fb1569ea81524f54d479f4c0d610f
8 2 909428873964ece00ba2d2fbca40035e7685a04687821c9fc7ea4cbb047b70fc
9 5 b68d4a840d7e218045d98247e00d79ed7bb1eff3b1fff117dc93db32f4699bfe
10 3 f43da163b5f48c69f909facc9a5880c5666c5ccffa4837890bb4519542f03e45
11 9 f90982571d7ace45426530062c7008fc370be6a7faeb9ecb05733758960b0c39
*/
* add mapped integers to each dataset;
data _null_;
do dsn="classfitwps","classwps";
call symputx('dsn',dsn);
rc=dosubl('
proc sql;
create
table &dsn.fix(drop=shaclass) as
select
l.name_crypto
,r.*
from
lookup as l, &dsn as r
where
l.shaclass = r.shaclass
;quit;
');
end;
run;quit;
* change sha256 to an unique integer for each datset ;
Up to 40 obs from CLASSWPSFIX total obs=4
NAME
Obs CRYPTO SEX AGE HEIGHT WEIGHT
1 1 F 14 62.8 102.5
2 4 M 12 64.8 128
3 2 F 15 62.5 112.5
4 3 F 14 64.3 90
NAME
Obs CRYPTO SEX AGE HEIGHT WEIGHT PREDICT LOWERMEAN
1 5 F 12 56.3 77 76.48 67.960
2 6 M 12 57.3 83 80.38 72.667
3 7 M 12 59 99.5 87.01 80.4
4 2 F 15 62.5 112.5 100.6 95.225
5 9 F 14 62.8 102.5 101.8 96.375
6 3 F 14 64.3 90 107.6 101.84
7 11 F 13 65.3 98 111.5 105.26
8 12 M 15 66.5 112 116.2 109.18
9 13 M 14 69 112.5 126.0 116.94
proc sql;
create
table ntrsec as
select
l.*
,r.predict
,r.lowermean
from
classwpsfix as l, classfitwpsfix as r
where
l.name_crypto = r.name_crypto
;quit;
Up to 40 obs WORK.NTRSEC total obs=2
NAME
Obs CRYPTO SEX AGE HEIGHT WEIGHT PREDICT LOWERMEAN
1 2 F 15 62.5 112.5 100.66 95.225
2 3 F 14 64.3 90 107.68 101.84
Thank you for all the help.
I have almost figured it out but I seem to have a problem with duplicate IDs, when I do it this way... Because the random IDs are generated separately for those with blank ID1s and those with non blank ID1s, there seems to be duplicate IDs.
I just need to fix the problem with duplicate random P_IDs, everything else works perfectly it seems.
the duplication of my P_IDs isn't only beteween those with blank ID1s and those that have IDs. There are units which have 2 different, non blank IDs and have the same P_IDs. So I guess the ranuni doesn't ensure the IDs will be unique?
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.