SAS: unique identifier for multiple columns with same values

kilo_foxtrot · Posted 12-10-2018 09:28 AM

Take the example code below containing customers and their devices:

data test;
    length cust dev $5;
    input cust $ dev $;
    datalines;
    a   1
    a   2
    a   7
    b   2
    b   3
    c   3
    d   4
    e   5
    h   2
    x   4
    ;
run;

I'm trying to group ALL related individuals by customer and device into unique IDs. For example, Customer A is related to Customer B through Device 2, and Customer B is related to Customer C through Device 3. Because A is related to B and B is related to C, these should have the same UID. If you think of customer as the 'parent' and device as the 'child' in node structure, all linked children and parents should be in the same UID. Any breaks should be separate UIDs.

Below is the desired output:

Customer|Device|UID
    a       1   UID1
    a       2   UID1
    a       7   UID1
    b       2   UID1
    b       3   UID1
    c       3   UID1
    h       2   UID1
    d       4   UID2
    x	    4	UID2
    e       5   UID3

I have tried transposing by Customer and doing a 'cascade' retain on customer and device but the code gets pretty hairy. Also, I do not have access to PROC BOM which I've read may help my case. Does anyone have any ideas on how this might be achieved?

Kurt_Bremser · Posted 12-10-2018 10:59 AM

You will need to keep two hash objects to keep track of already assigned uid's:

data have;
    length cust dev $5;
    input cust $ dev $;
    datalines;
    a   1
    a   2
    a   7
    b   2
    b   3
    c   3
    d   4
    e   5
    h   2
    x   4
    ;
run;

data want;
set have end=eof;
length uid $4;
retain counter 0;
if _n_ = 1
then do;
  declare hash devices();
  devices.definedata("uid",'dev');
  devices.definekey("dev");
  devices.definedone();
  declare hash customers();
  customers.definedata("uid",'cust');
  customers.definekey("cust");
  customers.definedone();
  call missing(uid);
end;
if customers.find()
then do;
  if devices.find()
  then do;
    counter + 1;
    uid = 'UID' !! put(counter,1.);
    rc = devices.add();
    rc = customers.add();
  end;
  else rc = customers.add();
end;
else do;
  if devices.find() then rc = devices.add();
end;
drop rc counter;
run;

proc sort data=want;
by uid cust;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

kilo_foxtrot · Posted 12-11-2018 09:42 AM

This is close but I'm seeing errors if I input additional rows to test. Take the following example where I've added Customer X Device 5:

data have;
    length cust dev $5;
    input cust $ dev $;
    datalines;
    a   1
    a   2
    a   7
    b   2
    b   3
    c   3
    d   4
    e   5
    h   2
    x   4
    x   5
    ;
run;

Your code provides the following results:

d	4	UID2
x	4	UID2
e	5	UID3
x	5	UID3

However, these should all have the same value of UID2 because Customer X is linked to Devices 4 and 5 which should also link Customer E and D.

novinosrin · Posted 12-10-2018 02:20 PM


data have;
    length cust dev $5;
    input cust $ dev $;
    datalines;
    a   1
    a   2
    a   7
    b   2
    b   3
    c   3
    d   4
    e   5
    h   2
    x   4
    ;
run;

data _null_;
if _n_=1 then do;
  dcl hash H (multidata:'yes') ;
   h.definekey  ("dev") ;
   h.definedata ("cust","dev","uid") ;
   h.definedone () ;
end;
do until(last.cust);
set have end=l;
by cust;
if _n_=1 then  do;
c=_n_;
uid =cats( 'UID', put(c,8. -l));
rc=h.add();
end;
else if not f and h.find()=0 then f=1;
end;
do until(last.cust);
set have ;
by cust;                             
if _n_>1 then do;
if f then rc=h.add();
else do;
if first.cust then do;c+1; uid =cats( 'UID', put(c,8. -l));end;
rc=h.add();
end;
end;
end;
if l then h.output(dataset:'want');
run;


proc sort data=want;
by uid cust dev;
run;

kilo_foxtrot · Posted 12-11-2018 09:44 AM

Your answer is also very close but provides inaccuracies in the last four rows in the new test data in my reply above.

novinosrin · Posted 12-11-2018 10:23 AM

I realize my mistake, a bit busy now but will try later

kilo_foxtrot · Posted 12-12-2018 11:39 AM

Thank you -- I appreciate your efforts! I apologize I do not have an in-depth knowledge of hashes in SAS so I am unable to troubleshoot your first answer.

novinosrin · Posted 12-12-2018 12:02 PM

@kilo_foxtrot Don't worry about technology. I am working on your thread by seeking help from my college mate(who is super smart and solved the same in python) in handling networking problems. Once he gives me the understanding i need, I will try to code in SAS. No need to apologize. I love learning.

novinosrin · Posted 12-12-2018 11:08 PM

Hi @kilo_foxtrot First off, Sorry for the delay as I was consumed with a lot of my own stuff to do. Please see if the following works and let me know. I can further simplify and eliminate some redundancies. Right now, it is 10:10PM at Chicago and I am leaning on my pillows laying with my laptop in my lap. Feeling so tired, I haven't made it a complete neat one. Yet, I wanted to keep up my word and here you go-

/*Your new updated dataset HAVE*/
data have;
    length cust dev $5;
    input cust $ dev $;
    datalines;
    a   1
    a   2
    a   7
    b   2
    b   3
    c   3
    d   4
    e   5
    h   2
    x   4
    x   5
    ;
run;
dm log 'clear';
/*Intial look up*/
data _null_;
if _n_=1 then do;
length uid __dev __cust $5;
if 0 then set have(rename=(dev=_dev cust=_cust));
  declare hash H (multidata:'yes') ;
   h.definekey  ("__cust") ;
   h.definedata ("__cust","__dev","uid") ;
   h.definedone () ;
   declare hash H1 (dataset:'have(rename=(dev=_dev cust=_cust))', multidata:'yes') ;
   h1.definekey  ("_dev") ;
   h1.definedata ("_cust","_dev") ;
   h1.definedone () ;
   declare hiter hi('h');
   declare hiter hh('h1');
   call missing(__cust,__dev);
end;
array t(999) $5;
n=0;
do until(last.cust);
	set have end=l;
	by cust;
	if first.cust then 
		do;
			if h.find(key:cust) = 0 then f1=1;
			do while(hi.next()=0);
			if dev=__dev	then do; f2=1;leave;end;
			end;
			if sum(f1, f2)=. then do;c+1;uid =cats( 'UID', put(c,8. -l)); end;
		end;
	if not f1 then do rc1=h1.find(key:dev) by 0 while(rc1=0);
		rc=h.add(key:_cust,data:_cust,data: _dev,data:uid);
		if _cust ne cust then do;
		n+1;
		t(n)=_cust;
		end;
		rc1=h1.find_next();
	end;
end;
if not f1 then do i=1 to dim(t);
	if not missing(t(i)) then 
	do while(hh.next()=0);
	if t(i)=_cust then rc=h.add(key:_cust,data:_cust,data: _dev,data:uid);
	end;
end;
if l then h.output(dataset:'w');
run;

/*Final Want	*/
proc sort data=w out=want nodupkey;
by uid __cust __dev;
run;

Please do let me know. And I will try to see to make some improvements when I feel fresh hopefully after a good sleep. Good night!

kilo_foxtrot · Posted 12-13-2018 12:59 PM

No need at all to apologize -- this is on your time! If I add Customer A Device 5, it should link every single record to be UID1. Your latest code doesn't quite get this right and it also adds two addition rows to the data:

/*Your new updated dataset HAVE -- new cust x dev 5*/
data have;
length cust dev $5;
input cust $ dev $;
datalines;
a 1
a 2
a 5
a 7
b 2
b 3
c 3
d 4
e 5
h 2
x 4
x 5
;
run;

__cust	__dev	uid
a	1	UID1
a	2	UID1
a	5	UID1
a	7	UID1
b	2	UID1
b	3	UID1
c	3	UID1
e	5	UID1
h	2	UID1
x	4	UID1
x	5	UID1
d	4	UID2
x	4	UID2
x	5	UID2

I can't thank you enough for your continued efforts on this -- it is getting very complex!

novinosrin · Posted 12-13-2018 01:28 PM

Hi @kilo_foxtrot Thank you. I request your patience(which you are, just saying again ) while we do this interactively and at the sametime I work on my stuff too. Trust me, yours is more interesting than mine and how i wish I devoted more time to test.

Nevetheless, I did a minor change and will wait for your feedback for further changes yet again

It's clear you are awesome in testing and that helps.

/*Your new updated dataset HAVE -- new cust x dev 5*/
data have;
length cust dev $5;
input cust $ dev $;
datalines;
a 1
a 2
a 5
a 7
b 2
b 3
c 3
d 4
e 5
h 2
x 4
x 5
;
run;





dm log 'clear';
/*Intial look up*/
data _null_;
if _n_=1 then do;
length uid __dev __cust $5;
if 0 then set have(rename=(dev=_dev cust=_cust));
  declare hash H (multidata:'yes') ;
   h.definekey  ("__cust") ;
   h.definedata ("__cust","__dev","uid") ;
   h.definedone () ;
   declare hash H1 (dataset:'have(rename=(dev=_dev cust=_cust))', multidata:'yes') ;
   h1.definekey  ("_dev") ;
   h1.definedata ("_cust","_dev") ;
   h1.definedone () ;
   declare hiter hi('h');
   declare hiter hh('h1');
   call missing(__cust,__dev);
end;
array t(999) $5;
n=0;
do until(last.cust);
	set have end=l;
	by cust;
	if first.cust then 
		do;
			if h.find(key:cust) = 0 then f1=1;
			else do while(hi.next()=0);
			if dev=__dev	then do; f2=1;leave;end;
			end;
			if sum(f1, f2)=. then do;c+1;uid =cats( 'UID', put(c,8. -l)); end;
		end;
	do rc1=h1.find(key:dev) by 0 while(rc1=0);
		rc=h.add(key:_cust,data:_cust,data: _dev,data:uid);
		if _cust ne cust then do;
		n+1;
		t(n)=_cust;
		end;
		rc1=h1.find_next();
	end;
end;
do i=1 to dim(t);
	if not missing(t(i)) then 
	do while(hh.next()=0);
	if t(i)=_cust then rc=h.add(key:_cust,data:_cust,data: _dev,data:uid);
	end;
end;
if l then h.output(dataset:'w');
run;

/*Final Want	*/
proc sort data=w out=want nodupkey;
by uid __cust __dev;
run;

kilo_foxtrot · Posted 12-13-2018 03:01 PM

I appreciate your devotion to solving this problem @novinosrin. In my testing I try to "break" your code to help debug by adding/removing all kinds of customers and devices while knowing what to expect ahead of time. Your latest answer works for the test data, but it does not work for the previous example (removing Customer A Device 5).

I went back to the previous example (removing Customer A Device 5) that resulted in UID1 and UID2. I then added a new customer (Y) that is linked to UID1 via Device 7 and also linked to UID2 via Device 5. This should create a link among all records (all UID1) but again it creates extra records and assigns multiple UIDs. N=13 records in ‘have’, N=18 records in ‘want’.

data have;
length cust dev $5;
input cust $ dev $;
datalines;
a 1
a 2
a 7
b 2
b 3
c 3
d 4
e 5
h 2
x 4
x 5
y 7
y 5
;
run;

Again, your help is much appreciated! I'm sure it is tough to make robust since the test data provided probably does not offer enough data to thoroughly validate.

novinosrin · Posted 12-13-2018 05:13 PM

Hmm, i understand the bigger complexity now after staring at the example. Let me deal with simple piecemeal approach and respond.

Kurt_Bremser · Posted 12-13-2018 05:15 PM

I think this needs a kind of recursive approach, for which you will need to have the whole dataset in the hash.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

novinosrin · Posted 12-13-2018 05:36 PM

Absolutely right sir @Kurt_Bremser . The idea is in my mind, just the references/syntax causes my eyes more strain as the evening unfolds. Excellent observation!

SAS: unique identifier for multiple columns with same values

Re: SAS: unique identifier for multiple columns with same values

Re: SAS: unique identifier for multiple columns with same values

Re: SAS: unique identifier for multiple columns with same values

Re: SAS: unique identifier for multiple columns with same values

Re: SAS: unique identifier for multiple columns with same values

Re: SAS: unique identifier for multiple columns with same values

Re: SAS: unique identifier for multiple columns with same values

Re: SAS: unique identifier for multiple columns with same values

Re: SAS: unique identifier for multiple columns with same values

Re: SAS: unique identifier for multiple columns with same values

Re: SAS: unique identifier for multiple columns with same values

Re: SAS: unique identifier for multiple columns with same values

Re: SAS: unique identifier for multiple columns with same values

Re: SAS: unique identifier for multiple columns with same values

Registration is open

Registration is open

SAS Training: Just a Click Away