About SarahDew

SarahDew · ‎03-03-2023

I am using a count + group by to flag rows that are complete duplicates. The result removes the extra rows, while I prefer to keep the original dataset, with just an added column. I don't use a select distinct so can't understand why rows are being deleted. Any clue why it happens and how to avoid it? data t; input ID$ name$; cards; a010 Steve a010 James a011 Harvey a012 Carl a012 Carl; run; proc sql; create table dup_flag as select *, count(*) as n from t group by ID, name; quit;

SarahDew · ‎10-05-2022

Is there a way to use a sas function in a proc sgplot statement? I would like to test many different transformations and it would be nice not having to add new colums to the data each time. I tried within a macro and different versions of %sysfunc(nput(sqrt(Systolic,8.8))) but got different errors each time. proc sgplot data=sashelp.heart; loess x=Diastolic y=sqrt(Systolic); run;

SarahDew · ‎03-04-2021

Thanks, that looks like a really nice approach. I see it also gives the same encoded value if the same value is added, which can be the case in my data. I have around 5 million unique IDs for one of the ID columns so there I will use Z7. Just wondering if it is equally safe as a hash value if the initial counts are given to an ordered ID (alphabetically for strings or ascending numbers). If you figure out the first count it might be easier to guess who the next count belongs to. It might be best to make sure the original IDs are ordered randomly?

SarahDew · ‎02-24-2021

If I understand correctly you propose to apply the following (given the example data above). That does look appealing from a user perspective, especially because I get only numbers in this string which they are used to from the original ID. But do I not run into the same issue of how long the display length should minimally be, to make sure each unique ID is also displayed with a unique anonimised string? /*use a LENGTH of 32, but a HEX format with a shorter display length*/ data want; length ID_string $32; format ID $11. ID_hash $hex64. ID_string $hex10.; set have; ID_hash = sha256(cat(ID)); ID_string = put(ID_hash,$hex64.); run; /*Check if displayed values are unique*/ proc sql; create table counts as select count(distinct(ID)) as n_ID , count(distinct(put(ID_string,$hex5.))) as N_ID_S , count(distinct(put(ID_string,$hex10.))) as N_ID_L from want; quit;

SarahDew · ‎02-19-2021

Good point, see edit.

SarahDew · ‎02-18-2021

I have a set of data that need to be anonymised. I created a hash for each ID, which I then turn into a string. To be safe I set the string to length 200, but this was not very user friendly. -EDIT- The long string is not user-friendly because the first 5 columns containing anonymised data are either not completely visible, or are very wide when you make the content visible. It also becomes difficult to eyeball if two values are the same or not. In general, it is also not very appealing for the end-user to work with these daunting long strings. It is not so much an issue with the storage size of the dataset. -EDIT- I then saw that even with length 10 I could get away with it and still have a unique string for each original value (around 5 million IDs). Except for one case where two different IDs resulted in the same string. Setting the string length to 15 solved this issue. Now I wonder if there is a way to know what I should set the minimum length to, to be safe, also when future records are added? For example, for the below data, a string length of 4 is too short, while 5 would be enough. Is there a way to determine this minimum of 5 based on the input? /*Sample data*/ data have (keep=ID); length ID $11.; call streaminit(123); Min = 10000000000; Max = 99999999999; do i = 1 to 1000; u = rand("Uniform"); ID = min + floor((1+Max-Min)*u); output; end; ;run; /*Create hash and string*/ data want; format ID $11. ID_hash $hex64. ID_short_string $4. ID_long_string $5. ID_not_user_friendly $200.; set have; ID_hash = sha256(cat(ID)); ID_short_string = put(ID_hash,$hex64.); ID_long_string = put(ID_hash,$hex64.); ID_not_user_friendly = put(ID_hash,$hex64.); run; /*Check if equal unique values*/ proc sql; create table counts as select count(distinct(ID)) as n_ID , count(distinct(ID_short_string)) as N_ID_Short , count(distinct(ID_long_string)) as N_ID_Long from want; quit;

SarahDew · ‎12-07-2020

Thanks, this answers all my questions! The externals will transfer the dataset to SAS VA (Viya), so I'm not sure how the ID's will be displayed there and if the format will be kept. I will ask them to give me some test ID's to see if I can still make the match using the log.

SarahDew · ‎12-04-2020

It works, but I see you use a different hash string in the where clause, that I don't see in the lookup or output table. Also the actual ID_ano in find is different from the one you specified. Where did you get this value? Me: 5A3558265A673F3F5E2B3F3F3F523308633F463F733F3F3F3F3F183F3F3252 You: 5A3558265A67E5B4FF5E2B91A28D52330863D846D473A28C82F7D0188CC23252

SarahDew · ‎12-04-2020

Thanks, did it work for you? My find is still empty, not sure why though: data find; set lookup; where put(ID_ano,$hex64.) eq "5A3558265A673F3F5E2B3F3F3F523308633F463F733F3F3F3F3F183F3F3252"; run;

SarahDew · ‎12-04-2020

I need to anonymize some data and decided to use the sha256 function to generate anonymous ID's. I made a lookup table for myself, with the original ID and the hashed ID, and an output table with only the hashed ID for external use. The point is that only I would be able to de-anonymize the data when necessary, and the externals can give me the hash so I can find the match in the lookup table. But now, when I search for one of the hashed ID's in the lookup table, it is not found, while I can see it is there. So I cannot make the match. I replicated this behaviour in a simple example: data have; input ID; datalines; 62851 62852 62853 62854 ; run; data lookup; set have; format ID_ano $hex64.; ID_ano = sha256(ID); run; data out (drop=ID); set have; format ID_ano $hex64.; ID_ano = sha256(ID); run; data find; set lookup; where strip(ID_ano) eq "5A3558265A673F3F5E2B3F3F3F523308633F463F733F3F3F3F3F183F3F3252"; run;

SarahDew · ‎12-02-2020

I create a hash for my ID variable with sha256 and apply the $hex64. format for readability. But when the ID is missing then it gets a hash value of 20202020... The receiver of my data will only receive the hash_ID, and I want to prevent they assume this ID was known and can be found by me,or that everyone with this hash is the same ID. Is there a way to get these missings to be displayed as missing also in the $hex64. format? data test; input file ID $5.; datalines; 1 02156 2 00369 3 45896 4 5 78954 6 7 78954 ; run; proc sql; create table hash as select ID, (case when ID ne "" then sha256(cat("XXX",ID)) else "" end) as hash_ID format = $hex64. from test; quit;

SarahDew · ‎09-18-2020

By non-continuous I meant like person B who stays in France twice but with a move in between. Your solution accounts for this so works fine for my purpose. I can imagine there might be a situation where there is some missing data between the same country, but this does not occur in my current dataset. When I run into this I might post again 😉

SarahDew · ‎09-18-2020

Just figured I can use min and max to get the right startdate, just need to figure out how to distinguish non-continuous periods: proc sql; create table combined as select distinct ID, Country, min(Startdate) as Startdate, max(Enddate) as Enddate, sum(Time) as time from sample group by ID, Country; quit;

SarahDew · ‎09-18-2020

I have a list of the countries where people lived. If they changed adres within a country this appears on several lines, however, I would like to make it one line. Consider the following example: data sample; input ID $ Country $ Startdate :date9. Enddate :date9. Time; format Startdate Enddate date9.; datalines; A France 05NOV2006 03OCT2012 6.1 A France 04OCT2012 05SEP2015 3.0 A France 06SEP2015 01JUN2016 0.8 A US 02JUN2016 18SEP2019 3.4 B France 17DEC2006 09MAY2007 0.4 B France 10MAY2007 01FEB2014 6.9 B Germany 02FEB2014 02FEB2015 1.0 B Germany 03FEB2015 02JUL2017 2.5 B France 03JUL2017 05APR2018 0.8 B US 06APR2018 18SEP2019 1.5 ;run; I want the result to create one line per block of continuous living in one country and the sum of the time spent in that country: data result; input ID $ Country $ Startdate :date9. Enddate :date9. Time; format Startdate Enddate date9.; datalines; A France 05NOV2006 01JUN2016 9.9 A US 02JUN2016 18SEP2019 3.4 B France 17DEC2006 01FEB2014 7.3 B Germany 02FEB2014 02JUL2017 3.5 B France 03JUL2017 05APR2018 0.8 B US 06APR2018 18SEP2019 1.5 ;run; This is my basic attempt, but I can't figure out how to only get the startdate from the first line, and the enddate from the last line, and how to make sure I only combine continuous periods in one country, not when there was another country in between. Possibly something with "if first.startdate..." in a data step? proc sql; create table combined as select distinct ID, Country, Startdate, Enddate, sum(Time) from sample group by ID, Country; quit;

SarahDew · ‎05-05-2020

This seems to do exactly as asked and in the most concise way. I like it. Thanks

Online Status	Offline
Date Last Visited	‎03-08-2023 10:59 AM

Group by and count removes duplicate rows

Using sas function in proc sgplot

Re: Hash (hex64) to string: minimum string length

Re: Hash (hex64) to string: minimum string length

Re: Hash (hex64) to string: minimum string length

Hash (hex64) to string: minimum string length

Re: sha256 hash value not found in lookup table (while present)

Re: sha256 hash value not found in lookup table (while present)

Re: sha256 hash value not found in lookup table (while present)

sha256 hash value not found in lookup table (while present)

Re: Group by and count removes duplicate rows

Re: Using sas function in proc sgplot

Re: Using sas function in proc sgplot

Re: sha256 hash value not found in lookup table (while present)

Re: Apply $hex64 format but keep missing values empty

Group by and count removes duplicate rows

Using sas function in proc sgplot

Re: Hash (hex64) to string: minimum string length

Re: Hash (hex64) to string: minimum string length

Re: Hash (hex64) to string: minimum string length

Hash (hex64) to string: minimum string length

Re: sha256 hash value not found in lookup table (while present)

Re: sha256 hash value not found in lookup table (while present)

Re: sha256 hash value not found in lookup table (while present)

sha256 hash value not found in lookup table (while present)

Apply $hex64 format but keep missing values empty

Re: Select first and last date from continuous series of adresses by g...

Re: Select first and last date from continuous series of adresses by g...

Select first and last date from continuous series of adresses by group

Re: Count distinct using case when