The subject might be misleading, but what i'm trying to do is take a dataset with matched names, and create a new dataset that lists all matches based on the "best" name (based on highest value in the RECORDS1 column). I have provided both a visual of what i'm trying to do as well as the HAVE/WANT code.
If anyone needs additional information to help solve please let me know. Thanks in advance!
HAVE:
NAME1 | RECORDS1 | NAME2 | RECORDS2 |
TOM | 5243 | TOMMY | 4 |
BRAD | 873 | BRADLEY | 219 |
BRADLEY | 219 | BRAD | 873 |
JOHN | 61017 | JOHNNY | 905 |
JOHNNY | 905 | JOHN | 61017 |
JONATHAN | 500 | JOHNNY | 905 |
WANT:
NAME1 | NAME2 |
TOM | TOMMY |
BRAD | BRADLEY |
JOHN | JOHNNY |
JOHN | JONATHAN |
DATA have;
INFILE DATALINES DSD;
INPUT NAME1 $ RECORDS1 NAME2 $ RECORDS2;
DATALINES;
TOM,5243,TOMMY,4
BRAD,873,BRADLEY,219
BRADLEY,219,BRAD,873
JOHN,61017,JOHNNY,905
JOHNNY,905,JOHN,61017
JONATHAN,500,JOHNNY,905
;
run;
DATA want;
INFILE DATALINES DSD;
INPUT NAME1 $ NAME2 $ ;
DATALINES;
TOM,TOMMY
BRAD,BRADLEY
JOHN,JOHNNY
JOHN,JONATHAN
;
run;
Treat the name pairs as links in a network and get the connected components with proc optnet:
DATA have;
INFILE DATALINES DSD;
INPUT NAME1 :$12. RECORDS1 NAME2 :$12. RECORDS2;
DATALINES;
TOM,5243,TOMMY,4
BRAD,873,BRADLEY,219
BRADLEY,219,BRAD,873
JOHN,61017,JOHNNY,905
JOHNNY,905,JOHN,61017
JONATHAN,500,JOHNNY,905
;
proc optnet data_links=have GRAPH_DIRECTION=UNDIRECTED out_nodes=clusters;
data_links_var from=name1 to=name2;
concomp;
run;
proc sort data=clusters; by concomp node; run;
data want;
do until(last.concomp);
set clusters; by concomp;
if first.concomp then name1 = node;
else do;
name2 = node;
output;
end;
end;
keep name1 name2;
run;
proc print data=want noobs; run;
name1 name2 TOM TOMMY BRAD BRADLEY JOHN JOHNNY JOHN JONATHAN
Can you explain what happened to the JOHNATHAN JOHNNY pair ?
Why is the pair excluded from WANT ?
JOHN 61017 JOHNNY 905 JOHNNY 905 JOHN 61017 JONATHAN 500 JOHNNY 905
What should happen if the BRAD data was
BRAD 873 BRADLEY 219 BRADLEY 2500 BRAD 873
Should a NAME1 value be excluded from output because it appears in a different pair as NAME2 with a lower count of the pair ?
@triley This edit is a new question about the problem.
Would you ever have a single row like the TOM row, but the higher count is in RECORDS2.
So, would the data ever have a case such as the following?
RICKY 215 RICHARD 1618
I want everything to roll up to the highest record count name, so since JOHN matched JOHNNY, and JOHNNY matched JOHNOTHAN,JOHNOTHAN should also roll up under JOHN.
For your BRAD/BRADLEY scenario, the number of records will always be the same for each name, so it will never have a scenario like you mentioned.
Here's a basic solution that works for small "families" (3 or less observations that share a common "survivor"). If you have larger families, you would need to to make this a macro with a do loop and keep running the sort & merge until no survivor swaps were made. If you have a circular association in a family, I worry that this would run infinitely.
DATA have; INFILE DATALINES DSD; INPUT NAME1 $ RECORDS1 NAME2 $ RECORDS2; DATALINES; TOM,5243,TOMMY,4 BRAD,873,BRADLEY,219 BRADLEY,219,BRAD,873 JOHN,61017,JOHNNY,905 JOHNNY,905,JOHN,61017 JONATHAN,500,JOHNNY,905 ; run; /* name with highest count will become "survivor" and always stored in name1 */ /* move the more frequent name into name1 and less frequent name to name2 */ data have2; set have; if records2 > records1 then do; tempname = name2; temprecords = records2; name2 = name1; records2 = records1; name1 = tempname; records1 = temprecords; end; drop tempname temprecords; run; /* name2 is the less frequently used name; call it "nickname" */ /* drop out the redundant pairs */ proc sort data=have2 out=have3 nodupkey; by name1 records1 name2 records2; run; /* create merge key */ data have3; set have3; rename name1=namekey; rename records1=recordskey; run; /* make a copy of the file and sort it on the name2 ("nickname") */ proc sort data=have3 out=have4; by name2 records2 name1 records1; run; /* create a merge key */ data have4; set have4; rename name2=namekey; rename records2=recordskey; run; /* merge dataset onto itself. if nickname also appears in the table as a survivor name, assign real survivor to nickname as new survivor (where nickname appears as survivor). */ data have5; merge have3 (in=a) have4 (in=b); by namekey; if a and not b then do; output; end; if a and b then do; namekey = name3; recordskey = records3; output; end; rename namekey=name1; rename recordskey=records1; drop name3 records3; run;
Thank you for the response. As far as your original question, I only have ~600 total records.
For the code you provided. I think i understand what you are doing, but i wasn't able to reproduce the outcome i was looking for (even after trying to make some modifications). I was trying to modify the parts where you had a data step with a "data" and "set" statement with the same table name (e.g. have3, have4).
After i tried modifying, i was still getting a final dataset with a null value in the "name1" column. Is that expected?
Should I see the transformation from
JOHN 61017 JOHNNY 905 JOHNNY 905 JOHN 61017 JONATHAN 500 JOHNNY 905
to (conceptually) a count ordered chain
JOHN(61017) > JOHNNY(905) > JONATHAN(500)
to a root grouped name list
JOHN JOHNNY JOHN JONATHAN
or more generally
root(N0) > node-1(N1) > node-2(N2) > ... > node-m(Nm)
to
root node-1 root node-2 root ... root node-m
No i don't believe there should ever be a case like you mentioned. There could be a case though like i have below, Where it would be RICKY that i would like to "keep/retain" in NAME1, and have the association to RICHARD and RICK in NAME2
HAVE:
NAME1 | RECORDS1 | NAME2 | RECORDS2 |
RICHARD | 1539 | RICK | 9 |
RICHARD | 1539 | RICKY | 1681 |
RICKY | 1681 | RICK | 9 |
RICKY | 1681 | RICHARD | 1539 |
WANT:
NAME1 | RECORDS1 | NAME2 | RECORDS2 |
RICKY | 1681 | RICK | 9 |
RICKY | 1681 | RICHARD | 1539 |
Treat the name pairs as links in a network and get the connected components with proc optnet:
DATA have;
INFILE DATALINES DSD;
INPUT NAME1 :$12. RECORDS1 NAME2 :$12. RECORDS2;
DATALINES;
TOM,5243,TOMMY,4
BRAD,873,BRADLEY,219
BRADLEY,219,BRAD,873
JOHN,61017,JOHNNY,905
JOHNNY,905,JOHN,61017
JONATHAN,500,JOHNNY,905
;
proc optnet data_links=have GRAPH_DIRECTION=UNDIRECTED out_nodes=clusters;
data_links_var from=name1 to=name2;
concomp;
run;
proc sort data=clusters; by concomp node; run;
data want;
do until(last.concomp);
set clusters; by concomp;
if first.concomp then name1 = node;
else do;
name2 = node;
output;
end;
end;
keep name1 name2;
run;
proc print data=want noobs; run;
name1 name2 TOM TOMMY BRAD BRADLEY JOHN JOHNNY JOHN JONATHAN
Thank you for your response. It looks like the table provided the output i was looking for, but i'm not clear as to if the proc optnet procedure is taking into consideration the "RECORDS" variables. I am not familiar with the procedure so it may be inherently built in, but just wanted to clarify...
I believe i found a solution using a combination of the replies above. Thank you all for your time.
I basically used the "if records2 > records1 then do;" section to get the info in the order i needed, then used the proc optnet to get the clusters, then joined back to my original data to then sort by the appropriate "records" amount and took the one with the largest.
The sample code at https://www.devenezia.com/downloads/sas/samples/#groupbyeither shows how find the groups within data having dual keys and linked by key1 OR key2. Written for a 2004 SAS-L question "How to group people by their first name OR last name".
Here is the same processing rewritten for Proc DS2.
The core concept is using a HASH to map names to groups, and an multidata anti-map HASH of group to names. The anti-map must be used when combining separated groups that become linked. In particular, combining groups requires a traversal one of the groups in the anti-map using has_next/find_next.
* Given: * There is only one records number for each name, no matter how many records contain that name; data have; input NAME1: $10. RECORDS1 NAME2: $10. RECORDS2; format records: 6.; datalines; JONATHAN 500 JOHNNY 905 JOHNO 750 JOHNNY 905 JONNO 415 JOHNO 750 JOHHN 675 JOHNO 750 JOHNNY 905 JOHN 1017 JOHN 1017 JOHNNY 905 TOM 5243 TOMMY 4 BRAD 873 BRADLEY 219 BRADLEY 219 BRAD 873 JON 875 AJONO 775 JOHHNO 904 JON 875 ZIP 250 ZIPP 175 JOHNNY 905 AJONO 775 ; proc datasets nolist lib=work; delete want; run; proc ds2; data _null_; declare package hash name_group_map(); * a mapping from name to group. a name can belong to only one group; declare package hash group_name_map(); * an anti mapping of group to names. a group can have many names; declare char(25) name name1 name2; declare double records records1 records2 group group1 group2; method init(); name_group_map.ordered('ASCENDING'); name_group_map.keys([name]); name_group_map.data([name records group]); name_group_map.defineDone(); group_name_map.multidata('yes'); group_name_map.keys([group]); group_name_map.data([group name]); group_name_map.defineDone(); group = 0; end; method run(); declare double found1 found2 hold_group; declare double rc; set have; * each row represents a link. both ends are a name with a count.; * a name can appear in other rows, but its count will not be different; found1 = name_group_map.find([name1], [name1 records1 group1]) = 0; found2 = name_group_map.find([name2], [name2 records2 group2]) = 0; select; when ( ~found1 and ~found2) do; group + 1; *put 'NOTE: both new' group=; name_group_map.add([name1], [name1 records1 group]); name_group_map.add([name2], [name2 records2 group]); group_name_map.add([group], [group name1]); group_name_map.add([group], [group name2]); end; when ( found1 and ~found2) do; *put 'NOTE: add name2 to name1' group1=; name_group_map.add([name2], [name2 records2 group1]); group_name_map.add([group1], [group1 name2]); end; when ( ~found1 and found2) do; *put 'NOTE: add name1 to name2' group2=; name_group_map.add([name1], [name1 records1 group2]); group_name_map.add([group2], [group2 name1]); end; when ( found1 and found2) do; if group1 = group2 then return; * traverse the multidata of group2 key and migrate each data to the group1 key; hold_group = group; group = group2; rc = group_name_map.find(); do while (rc = 0); name_group_map.find(); name_group_map.replace([name], [name records group1]); group_name_map.add([group1], [group1 name]); if group_name_map.has_next() ne 0 then leave; rc = group_name_map.find_next(); end; group_name_map.removeall(); * remove the anti-map key; group = hold_group; end; otherwise; end; end; method term(); * DS2 does not overwrite existing tables; * DS2 does allow a table option (OVERWRITE=YES), or /OVERWRITE=YES; * However, DS2 Package HASH method OUTPUT does NOT provisio for such, * and will log an error if used, example: * ERROR: Malformed hash data source name <table-name>(overwrite=yes). * * Thus, the coder MUST pre-delete the target output table in an earlier step; name_group_map.output('work.want'); end; enddata; run; quit; options nosource nonotes; proc sort data=want; by group descending records name; run; ods html file='want.html' style=plateau; proc print data=have; proc print noobs data=want; run; ods _all_ close; options source notes;
Sample output
Thank you for providing another solution. As i am not very familiar with hash tables, I was wondering if it would be possible to modify this slightly and take it one step back and use it to come up with the "have" table. In other words, if I had NAME1 and RECORDS1 in a table, and NAME2 and RECORDS2 in a different table, could i use a similar approach to join the 2 to find the matches? I'm not sure if you can use HASH to find "similar" names based on some function (e.g. using COMPGED), but i think where this would be really helpful is in doing that step, since the SQL step I'm currently using takes over a day to produce that table.
Also, if you are able to provide a solution to this, i can submit a new question on SAS communities and will "Accept" your solution since I already accepted one that worked for this one.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.