Take the example code below containing customers and their devices:
data test;
length cust dev $5;
input cust $ dev $;
datalines;
a 1
a 2
a 7
b 2
b 3
c 3
d 4
e 5
h 2
x 4
;
run;
I'm trying to group ALL related individuals by customer and device into unique IDs. For example, Customer A is related to Customer B through Device 2, and Customer B is related to Customer C through Device 3. Because A is related to B and B is related to C, these should have the same UID. If you think of customer as the 'parent' and device as the 'child' in node structure, all linked children and parents should be in the same UID. Any breaks should be separate UIDs.
Below is the desired output:
Customer|Device|UID
a 1 UID1
a 2 UID1
a 7 UID1
b 2 UID1
b 3 UID1
c 3 UID1
h 2 UID1
d 4 UID2
x 4 UID2
e 5 UID3
I have tried transposing by Customer and doing a 'cascade' retain on customer and device but the code gets pretty hairy. Also, I do not have access to PROC BOM which I've read may help my case. Does anyone have any ideas on how this might be achieved?
You will need to keep two hash objects to keep track of already assigned uid's:
data have;
length cust dev $5;
input cust $ dev $;
datalines;
a 1
a 2
a 7
b 2
b 3
c 3
d 4
e 5
h 2
x 4
;
run;
data want;
set have end=eof;
length uid $4;
retain counter 0;
if _n_ = 1
then do;
declare hash devices();
devices.definedata("uid",'dev');
devices.definekey("dev");
devices.definedone();
declare hash customers();
customers.definedata("uid",'cust');
customers.definekey("cust");
customers.definedone();
call missing(uid);
end;
if customers.find()
then do;
if devices.find()
then do;
counter + 1;
uid = 'UID' !! put(counter,1.);
rc = devices.add();
rc = customers.add();
end;
else rc = customers.add();
end;
else do;
if devices.find() then rc = devices.add();
end;
drop rc counter;
run;
proc sort data=want;
by uid cust;
run;
This is close but I'm seeing errors if I input additional rows to test. Take the following example where I've added Customer X Device 5:
data have;
length cust dev $5;
input cust $ dev $;
datalines;
a 1
a 2
a 7
b 2
b 3
c 3
d 4
e 5
h 2
x 4
x 5
;
run;
Your code provides the following results:
d 4 UID2
x 4 UID2
e 5 UID3
x 5 UID3
However, these should all have the same value of UID2 because Customer X is linked to Devices 4 and 5 which should also link Customer E and D.
data have;
length cust dev $5;
input cust $ dev $;
datalines;
a 1
a 2
a 7
b 2
b 3
c 3
d 4
e 5
h 2
x 4
;
run;
data _null_;
if _n_=1 then do;
dcl hash H (multidata:'yes') ;
h.definekey ("dev") ;
h.definedata ("cust","dev","uid") ;
h.definedone () ;
end;
do until(last.cust);
set have end=l;
by cust;
if _n_=1 then do;
c=_n_;
uid =cats( 'UID', put(c,8. -l));
rc=h.add();
end;
else if not f and h.find()=0 then f=1;
end;
do until(last.cust);
set have ;
by cust;
if _n_>1 then do;
if f then rc=h.add();
else do;
if first.cust then do;c+1; uid =cats( 'UID', put(c,8. -l));end;
rc=h.add();
end;
end;
end;
if l then h.output(dataset:'want');
run;
proc sort data=want;
by uid cust dev;
run;
I realize my mistake, a bit busy now but will try later
@kilo_foxtrot Don't worry about technology. I am working on your thread by seeking help from my college mate(who is super smart and solved the same in python) in handling networking problems. Once he gives me the understanding i need, I will try to code in SAS. No need to apologize. I love learning.
Hi @kilo_foxtrot First off, Sorry for the delay as I was consumed with a lot of my own stuff to do. Please see if the following works and let me know. I can further simplify and eliminate some redundancies. Right now, it is 10:10PM at Chicago and I am leaning on my pillows laying with my laptop in my lap. Feeling so tired, I haven't made it a complete neat one. Yet, I wanted to keep up my word and here you go-
/*Your new updated dataset HAVE*/
data have;
length cust dev $5;
input cust $ dev $;
datalines;
a 1
a 2
a 7
b 2
b 3
c 3
d 4
e 5
h 2
x 4
x 5
;
run;
dm log 'clear';
/*Intial look up*/
data _null_;
if _n_=1 then do;
length uid __dev __cust $5;
if 0 then set have(rename=(dev=_dev cust=_cust));
declare hash H (multidata:'yes') ;
h.definekey ("__cust") ;
h.definedata ("__cust","__dev","uid") ;
h.definedone () ;
declare hash H1 (dataset:'have(rename=(dev=_dev cust=_cust))', multidata:'yes') ;
h1.definekey ("_dev") ;
h1.definedata ("_cust","_dev") ;
h1.definedone () ;
declare hiter hi('h');
declare hiter hh('h1');
call missing(__cust,__dev);
end;
array t(999) $5;
n=0;
do until(last.cust);
set have end=l;
by cust;
if first.cust then
do;
if h.find(key:cust) = 0 then f1=1;
do while(hi.next()=0);
if dev=__dev then do; f2=1;leave;end;
end;
if sum(f1, f2)=. then do;c+1;uid =cats( 'UID', put(c,8. -l)); end;
end;
if not f1 then do rc1=h1.find(key:dev) by 0 while(rc1=0);
rc=h.add(key:_cust,data:_cust,data: _dev,data:uid);
if _cust ne cust then do;
n+1;
t(n)=_cust;
end;
rc1=h1.find_next();
end;
end;
if not f1 then do i=1 to dim(t);
if not missing(t(i)) then
do while(hh.next()=0);
if t(i)=_cust then rc=h.add(key:_cust,data:_cust,data: _dev,data:uid);
end;
end;
if l then h.output(dataset:'w');
run;
/*Final Want */
proc sort data=w out=want nodupkey;
by uid __cust __dev;
run;
Please do let me know. And I will try to see to make some improvements when I feel fresh hopefully after a good sleep. Good night!
No need at all to apologize -- this is on your time! If I add Customer A Device 5, it should link every single record to be UID1. Your latest code doesn't quite get this right and it also adds two addition rows to the data:
/*Your new updated dataset HAVE -- new cust x dev 5*/
data have;
length cust dev $5;
input cust $ dev $;
datalines;
a 1
a 2
a 5
a 7
b 2
b 3
c 3
d 4
e 5
h 2
x 4
x 5
;
run;
__cust __dev uid
a 1 UID1
a 2 UID1
a 5 UID1
a 7 UID1
b 2 UID1
b 3 UID1
c 3 UID1
e 5 UID1
h 2 UID1
x 4 UID1
x 5 UID1
d 4 UID2
x 4 UID2
x 5 UID2
I can't thank you enough for your continued efforts on this -- it is getting very complex!
Hi @kilo_foxtrot Thank you. I request your patience(which you are, just saying again ) while we do this interactively and at the sametime I work on my stuff too. Trust me, yours is more interesting than mine and how i wish I devoted more time to test.
Nevetheless, I did a minor change and will wait for your feedback for further changes yet again
It's clear you are awesome in testing and that helps.
/*Your new updated dataset HAVE -- new cust x dev 5*/
data have;
length cust dev $5;
input cust $ dev $;
datalines;
a 1
a 2
a 5
a 7
b 2
b 3
c 3
d 4
e 5
h 2
x 4
x 5
;
run;
dm log 'clear';
/*Intial look up*/
data _null_;
if _n_=1 then do;
length uid __dev __cust $5;
if 0 then set have(rename=(dev=_dev cust=_cust));
declare hash H (multidata:'yes') ;
h.definekey ("__cust") ;
h.definedata ("__cust","__dev","uid") ;
h.definedone () ;
declare hash H1 (dataset:'have(rename=(dev=_dev cust=_cust))', multidata:'yes') ;
h1.definekey ("_dev") ;
h1.definedata ("_cust","_dev") ;
h1.definedone () ;
declare hiter hi('h');
declare hiter hh('h1');
call missing(__cust,__dev);
end;
array t(999) $5;
n=0;
do until(last.cust);
set have end=l;
by cust;
if first.cust then
do;
if h.find(key:cust) = 0 then f1=1;
else do while(hi.next()=0);
if dev=__dev then do; f2=1;leave;end;
end;
if sum(f1, f2)=. then do;c+1;uid =cats( 'UID', put(c,8. -l)); end;
end;
do rc1=h1.find(key:dev) by 0 while(rc1=0);
rc=h.add(key:_cust,data:_cust,data: _dev,data:uid);
if _cust ne cust then do;
n+1;
t(n)=_cust;
end;
rc1=h1.find_next();
end;
end;
do i=1 to dim(t);
if not missing(t(i)) then
do while(hh.next()=0);
if t(i)=_cust then rc=h.add(key:_cust,data:_cust,data: _dev,data:uid);
end;
end;
if l then h.output(dataset:'w');
run;
/*Final Want */
proc sort data=w out=want nodupkey;
by uid __cust __dev;
run;
I appreciate your devotion to solving this problem @novinosrin. In my testing I try to "break" your code to help debug by adding/removing all kinds of customers and devices while knowing what to expect ahead of time. Your latest answer works for the test data, but it does not work for the previous example (removing Customer A Device 5).
I went back to the previous example (removing Customer A Device 5) that resulted in UID1 and UID2. I then added a new customer (Y) that is linked to UID1 via Device 7 and also linked to UID2 via Device 5. This should create a link among all records (all UID1) but again it creates extra records and assigns multiple UIDs. N=13 records in ‘have’, N=18 records in ‘want’.
data have;
length cust dev $5;
input cust $ dev $;
datalines;
a 1
a 2
a 7
b 2
b 3
c 3
d 4
e 5
h 2
x 4
x 5
y 7
y 5
;
run;
Again, your help is much appreciated! I'm sure it is tough to make robust since the test data provided probably does not offer enough data to thoroughly validate.
Hmm, i understand the bigger complexity now after staring at the example. Let me deal with simple piecemeal approach and respond.
I think this needs a kind of recursive approach, for which you will need to have the whole dataset in the hash.
Absolutely right sir @Kurt_Bremser . The idea is in my mind, just the references/syntax causes my eyes more strain as the evening unfolds. Excellent observation!
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.