I am trying to recreate a new variable in a dataset to be used for group by or class in proc means, that has to be built in function of other two variable. Here an example:
data test; input rapp $ 1-13 posiz $ 15-23; datalines; 0900028641002 000838931 0600700253462 000838937 0100701192000 000838931 0100701192000 000838930 0600700253462 000838937 ;
The idea is to add a new_id in the table that has the same value every time the records have the same value for rapp OR the same value for POSIZ, obtaining a thing like this:
data test2; input rapp $ 1-13 posiz $ 15-23 new_id $ 25-27; datalines; 0900028641002 000838931 001 0600700253462 000838937 002 0100701192000 000838931 001 0100701192000 000838930 001 0600700253462 000838937 002 ;
In this case, row3 has new_id=001 because it has the same posiz of row1. Row4 has 001 because it has the same rapp of row3. Row2 and row5 have 002 because they have both the same rapp and posiz. How could I obtain a result like this?
And what NEW_ID is assigned for a record in which both the RAPP and POSIZ values have already been seen, but are associated with different values of NEW_ID?
@mkeintz I believe new_id is the desired outcome for the sample data. Looks to me somehow like the clusterid for networks.
This is the key text of the OP for me (bold italics mine):
The idea is to add a new_id in the table that has the same value every time the records have the same value for rapp OR the same value for POSIZ
The "OR" is the problematic issue for me.
If the first two records are:
rapp posiz
1111 AAAA
2222 BBBB
I presume NEW_ID gets 001 and 002 respectively.
And if a subsequent record has rapp=1111 or posiz=AAAA it would get ID=001. Similarly subsequent rapp=2222 or posiz=BBBB records would get NEW_ID=002.
But what NEW_ID should be assigned if a record comes up with rapp=1111 and posiz=BBBB?
@Jack90_ wrote: (bold italics mine)
Your example is right. But in my dataset the exception that you showed can't verify. Every POSIZ is associated only to one RAPP, so to me the logic should be based on two conditions in every record:
- both POSIZ and RAPP of the record never appeared before--->new_id increase by 1
- POSIZ or RAPP already appeared--->you gave the same new_id of that past record.
I don't quite know what is meant by "exception that you showed can't verify".
But regardless, your example data (lines 1 and 3 below) shows that "every POSIZ is associated only to one RAPP" is not true.
data test;
input rapp $ 1-13 posiz $ 15-23;
datalines;
0900028641002 000838931
0600700253462 000838937
0100701192000 000838931
0100701192000 000838930
0600700253462 000838937
;
I think we need some more clarification.
@mkeintz I believe the OPs question was clear especially because the desired result had been provided.
@Jack90_ Can you please confirm that the following re-formulation of the problem is correct and what you need?
I've taken your sample data and just renamed the variables and replaced the values 1:1 with some easier to read strings.
The additional variable "person" would just be the observation in your original sample data.
Problem: For below sample data create clusters of persons (same cluster_id) that share at least one of the two attributes Household and Workplace.
data desired_result;
input person $ household $ workplace $ desired_id;
datalines;
A h01 w01 001
B h02 w02 002
C h03 w01 001
D h03 w03 001
E h02 w02 002
;
For above sample data person A and person D are in the same cluster because they are connected indirectly via person C.
Person A shares the workplace with person C and person D shares the household with person C.
...and that's why I believe this problem belongs somewhere in the realm of data prep for network/link analysis.
I've never written logic for clustering and I couldn't make things work (yet?) via docu for Proc Cluster and Proc Fastclus and any datastep logic I could think of so far feels way too complicated and inefficient.
@Patrick wrote:
I believe this problem belongs somewhere in the realm of data prep for network/link analysis.
I agree and so I think we can use one of the clustering tools provided by others in the past to do the heavy lifting, for example, the SubGraphs macro by PGStats. We just need to create the input dataset containing the information about links between different records.
/* Add record numbers to the data */
data have / view=have;
set test;
record=_n_;
run;
/* Create dataset describing the links between different records */
proc sql;
create table links as
select distinct a.record as from, b.record as to
from have a, have b
where (a.rapp=b.rapp | a.posiz=b.posiz) & a.record<b.record;
quit;
/* Identify connected components ("clusters") */
%SubGraphs(links)
/* Add cluster IDs to the original data */
proc sql;
create table want(drop=record) as
select have.*, clust as new_id format=z3.
from have, clusters
where record=node
order by record;
quit;
Yes @Patrick your re-formulation seems right to me. Honestly at the end I made my task grouping by another variable that could be considered a proxy of what I asked to build with this question, obtaining results that seems acceptable.
Anyway I add some other details about the problem if this could useful. The dataset that I am using collects banking accounts contracts. But the same banking account could be splitted in more than one record in the dataset. Essentially for two potential reasons (both could be present, just one of them, or none):
- the banking account is splitted in more than one record because you have the secured part of the drown amount in a record, and the unsecured part in another record: the two or more records share the same ID contract (the variable RAPP that I provided previously) but they have different POSITION code (the variable POSIZ that I provided previously)
- the banking account has a credit line with an undrown exposure (the credit not used by the counterpart). This undrown part insists on the credit line ID contract, not the banking account, so the two ore more records have different ID contract (again the variable RAPP). But the undrown part is always linked to only one banking account, sharing the same POSITION ID with one of its tranche (again the variable POSIZ).
My aim would be to reaggregate all the records, to see the total undrown and drown amount overall.
Removed because I found a use case where the proposed code didn't work. I'll share working code if I find the time to/can fix it.
data test;
input rapp $ 1-13 posiz $ 15-23;
datalines;
0900028641002 000838931
0600700253462 000838937
0100701192000 000838931
0100701192000 000838930
0600700253462 000838937
;
data have;
set test;
length from to $ 80;
from=rapp; to=posiz;
keep from to;
run;
data full;
set have end=last;
if _n_ eq 1 then do;
declare hash h();
h.definekey('node');
h.definedata('node');
h.definedone();
end;
output;
node=from; h.replace();
from=to; to=node;
output;
node=from; h.replace();
if last then h.output(dataset:'node');
drop node;
run;
data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();
if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
from_to.definekey('from');
from_to.definedata('to');
from_to.definedone();
if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
no.definekey('node');
no.definedata('node');
no.definedone();
do while(hi_no.next()=0);
household+1; output;
count=1;
key=node;_ha.add();
last=node;ha.add();
rc=hi.first();
do while(rc=0);
from=last;rx=from_to.find();
do while(rx=0);
key=to;ry=_ha.check();
if ry ne 0 then do;
node=to;output;rr=no.remove(key:node);
key=to;_ha.add();
count+1;
last=to;ha.add();
end;
rx=from_to.find_next();
end;
rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;
data final_want;
if _n_=1 then do;
if 0 then set want(rename=(household=new_id));
declare hash h(dataset:'want(rename=(household=new_id))');
h.definekey('node');
h.definedata('new_id');
h.definedone();
end;
set test;
call missing(new_id);
rc=h.find(key:rapp);
drop rc node;
format new_id z8.;
run;
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.