Solved: Re: Creating a table based on prior records

triley · Posted 04-24-2020 11:04 AM

The subject might be misleading, but what i'm trying to do is take a dataset with matched names, and create a new dataset that lists all matches based on the "best" name (based on highest value in the RECORDS1 column). I have provided both a visual of what i'm trying to do as well as the HAVE/WANT code.

If anyone needs additional information to help solve please let me know. Thanks in advance!

HAVE:

NAME1	RECORDS1	NAME2	RECORDS2
TOM	5243	TOMMY	4
BRAD	873	BRADLEY	219
BRADLEY	219	BRAD	873
JOHN	61017	JOHNNY	905
JOHNNY	905	JOHN	61017
JONATHAN	500	JOHNNY	905

WANT:

NAME1	NAME2
TOM	TOMMY
BRAD	BRADLEY
JOHN	JOHNNY
JOHN	JONATHAN

DATA have;
INFILE DATALINES DSD;
INPUT NAME1 $ RECORDS1 NAME2 $ RECORDS2;
DATALINES;
TOM,5243,TOMMY,4
BRAD,873,BRADLEY,219
BRADLEY,219,BRAD,873
JOHN,61017,JOHNNY,905
JOHNNY,905,JOHN,61017
JONATHAN,500,JOHNNY,905

;
run;

DATA want;
INFILE DATALINES DSD;
INPUT NAME1 $ NAME2 $ ;
DATALINES;
TOM,TOMMY
BRAD,BRADLEY
JOHN,JOHNNY
JOHN,JONATHAN

;
run;

PGStats · Posted 04-27-2020 12:17 AM

Treat the name pairs as links in a network and get the connected components with proc optnet:

DATA have;
INFILE DATALINES DSD;
INPUT NAME1 :$12. RECORDS1 NAME2 :$12. RECORDS2;
DATALINES;
TOM,5243,TOMMY,4
BRAD,873,BRADLEY,219
BRADLEY,219,BRAD,873
JOHN,61017,JOHNNY,905
JOHNNY,905,JOHN,61017
JONATHAN,500,JOHNNY,905
; 

proc optnet data_links=have GRAPH_DIRECTION=UNDIRECTED out_nodes=clusters;
data_links_var from=name1 to=name2;
concomp;
run;

proc sort data=clusters; by concomp node; run;

data want;
do until(last.concomp);
	set clusters; by concomp;
	if first.concomp then name1 = node;
	else do;
		name2 = node;
		output;
		end;
	end;
keep name1 name2;
run;

proc print data=want noobs; run;

name1 	name2
TOM 	TOMMY
BRAD 	BRADLEY
JOHN 	JOHNNY
JOHN 	JONATHAN

PG

View solution in original post

RichardDeVen · Posted 04-24-2020 01:09 PM

Can you explain what happened to the JOHNATHAN JOHNNY pair ?

Why is the pair excluded from WANT ?

JOHN     61017  JOHNNY  905
JOHNNY   905    JOHN    61017
JONATHAN 500    JOHNNY  905

What should happen if the BRAD data was

BRAD	873	BRADLEY	219
BRADLEY	2500	BRAD	873

Should a NAME1 value be excluded from output because it appears in a different pair as NAME2 with a lower count of the pair ?

@triley This edit is a new question about the problem.

Would you ever have a single row like the TOM row, but the higher count is in RECORDS2.

So, would the data ever have a case such as the following?

RICKY	215	RICHARD	1618

triley · Posted 04-24-2020 03:45 PM

I want everything to roll up to the highest record count name, so since JOHN matched JOHNNY, and JOHNNY matched JOHNOTHAN,JOHNOTHAN should also roll up under JOHN.

For your BRAD/BRADLEY scenario, the number of records will always be the same for each name, so it will never have a scenario like you mentioned.

Daryl · Posted 04-24-2020 04:25 PM

How many rows of this do you have?

Is a circular reference ever possible?
JOHN 10 JOHNNY 5
JOHNNY 5 JON 1
JON 1 JOHN 10

Daryl · Posted 04-24-2020 09:49 PM

Here's a basic solution that works for small "families" (3 or less observations that share a common "survivor"). If you have larger families, you would need to to make this a macro with a do loop and keep running the sort & merge until no survivor swaps were made. If you have a circular association in a family, I worry that this would run infinitely.

DATA have;
INFILE DATALINES DSD;
INPUT NAME1 $ RECORDS1 NAME2 $ RECORDS2;
DATALINES;
TOM,5243,TOMMY,4
BRAD,873,BRADLEY,219
BRADLEY,219,BRAD,873
JOHN,61017,JOHNNY,905
JOHNNY,905,JOHN,61017
JONATHAN,500,JOHNNY,905
;
run;

/* name with highest count will become "survivor" and always stored in name1 */
/* move the more frequent name into name1 and less frequent name to name2 */
data have2;
 set have;
 if records2 > records1 then do;
  tempname = name2;
  temprecords = records2;
  name2 = name1;
  records2 = records1;
  name1 = tempname;
  records1 = temprecords;
 end;
 drop tempname temprecords;
run;

/* name2 is the less frequently used name; call it "nickname" */
/* drop out the redundant pairs */
proc sort data=have2 out=have3 nodupkey;
 by name1 records1 name2 records2;
run;

/* create merge key */
data have3; set have3;
rename name1=namekey;
rename records1=recordskey;
run;

/* make a copy of the file and sort it on the name2 ("nickname") */
proc sort data=have3 out=have4;
 by name2 records2 name1 records1;
run;

/* create a merge key */
data have4;
 set have4;
 rename name2=namekey;
 rename records2=recordskey;
run;

/* merge dataset onto itself.  
   if nickname also appears in the table as a survivor name,
   assign real survivor to nickname as new survivor (where nickname appears as survivor). */
data have5;
  merge have3 (in=a) have4 (in=b);
  by namekey;
  if a and not b then do;
   output;
  end;
  if a and b then do;
   namekey = name3;
   recordskey = records3;
  output;
  end; 
  rename namekey=name1;
  rename recordskey=records1;
  drop name3 records3;
run;

triley · Posted 04-27-2020 09:40 AM

Thank you for the response. As far as your original question, I only have ~600 total records.

For the code you provided. I think i understand what you are doing, but i wasn't able to reproduce the outcome i was looking for (even after trying to make some modifications). I was trying to modify the parts where you had a data step with a "data" and "set" statement with the same table name (e.g. have3, have4).

After i tried modifying, i was still getting a final dataset with a null value in the "name1" column. Is that expected?

RichardDeVen · Posted 04-26-2020 06:05 PM

Should I see the transformation from

JOHN     61017  JOHNNY  905
JOHNNY   905    JOHN    61017
JONATHAN 500    JOHNNY  905

to (conceptually) a count ordered chain

JOHN(61017)  >  JOHNNY(905)  >  JONATHAN(500)

to a root grouped name list

JOHN  JOHNNY
JOHN  JONATHAN

or more generally

root(N₀)  >  node-1(N₁)  >  node-2(N₂)  >  ...  >  node-m(N_m)

to

root  node-1
root  node-2
root  ...
root node-m

triley · Posted 04-27-2020 09:36 AM

No i don't believe there should ever be a case like you mentioned. There could be a case though like i have below, Where it would be RICKY that i would like to "keep/retain" in NAME1, and have the association to RICHARD and RICK in NAME2

HAVE:

NAME1	RECORDS1	NAME2	RECORDS2
RICHARD	1539	RICK	9
RICHARD	1539	RICKY	1681
RICKY	1681	RICK	9
RICKY	1681	RICHARD	1539

WANT:

NAME1	RECORDS1	NAME2	RECORDS2
RICKY	1681	RICK	9
RICKY	1681	RICHARD	1539

PGStats · Posted 04-27-2020 12:17 AM

Treat the name pairs as links in a network and get the connected components with proc optnet:

DATA have;
INFILE DATALINES DSD;
INPUT NAME1 :$12. RECORDS1 NAME2 :$12. RECORDS2;
DATALINES;
TOM,5243,TOMMY,4
BRAD,873,BRADLEY,219
BRADLEY,219,BRAD,873
JOHN,61017,JOHNNY,905
JOHNNY,905,JOHN,61017
JONATHAN,500,JOHNNY,905
; 

proc optnet data_links=have GRAPH_DIRECTION=UNDIRECTED out_nodes=clusters;
data_links_var from=name1 to=name2;
concomp;
run;

proc sort data=clusters; by concomp node; run;

data want;
do until(last.concomp);
	set clusters; by concomp;
	if first.concomp then name1 = node;
	else do;
		name2 = node;
		output;
		end;
	end;
keep name1 name2;
run;

proc print data=want noobs; run;

name1 	name2
TOM 	TOMMY
BRAD 	BRADLEY
JOHN 	JOHNNY
JOHN 	JONATHAN

PG

triley · Posted 04-27-2020 09:43 AM

Thank you for your response. It looks like the table provided the output i was looking for, but i'm not clear as to if the proc optnet procedure is taking into consideration the "RECORDS" variables. I am not familiar with the procedure so it may be inherently built in, but just wanted to clarify...

triley · Posted 04-27-2020 12:21 PM

I believe i found a solution using a combination of the replies above. Thank you all for your time.

I basically used the "if records2 > records1 then do;" section to get the info in the order i needed, then used the proc optnet to get the clusters, then joined back to my original data to then sort by the appropriate "records" amount and took the one with the largest.

RichardDeVen · Posted 04-29-2020 08:56 PM

The sample code at https://www.devenezia.com/downloads/sas/samples/#groupbyeither shows how find the groups within data having dual keys and linked by key1 OR key2. Written for a 2004 SAS-L question "How to group people by their first name OR last name".

Here is the same processing rewritten for Proc DS2.

The core concept is using a HASH to map names to groups, and an multidata anti-map HASH of group to names. The anti-map must be used when combining separated groups that become linked. In particular, combining groups requires a traversal one of the groups in the anti-map using has_next/find_next.

* Given:
* There is only one records number for each name, no matter how many records contain that name;
data have;
input 
NAME1: $10. RECORDS1 NAME2: $10. RECORDS2; 
format records: 6.;
datalines;
JONATHAN 500 JOHNNY   905
JOHNO    750 JOHNNY   905
JONNO    415 JOHNO    750
JOHHN    675 JOHNO    750
JOHNNY   905 JOHN    1017
JOHN    1017 JOHNNY   905
TOM     5243 TOMMY      4
BRAD     873 BRADLEY  219
BRADLEY  219 BRAD     873
JON      875 AJONO    775
JOHHNO   904 JON      875
ZIP      250 ZIPP     175
JOHNNY   905 AJONO    775
;

proc datasets nolist lib=work;
  delete want;
run;

proc ds2;
  data _null_;
    declare package hash name_group_map();     * a mapping from name to group.  a name can belong to only one group;
    declare package hash group_name_map();     * an anti mapping of group to names. a group can have many names;

    declare char(25) name name1 name2;
    declare double records records1 records2 group group1 group2;

    method init();
      name_group_map.ordered('ASCENDING');
      name_group_map.keys([name]);
      name_group_map.data([name records group]);
      name_group_map.defineDone();

      group_name_map.multidata('yes');
      group_name_map.keys([group]);
      group_name_map.data([group name]);
      group_name_map.defineDone();

      group = 0;
    end;

    method run();
      declare double found1 found2 hold_group;
      declare double rc;

      set have;

      * each row represents a link.  both ends are a name with a count.;
      * a name can appear in other rows, but its count will not be different;

      found1 = name_group_map.find([name1], [name1 records1 group1]) = 0;
      found2 = name_group_map.find([name2], [name2 records2 group2]) = 0;

      select;
        when ( ~found1 and ~found2) 
                do;
                    group + 1;
                    *put 'NOTE: both new' group=;
                    name_group_map.add([name1], [name1 records1 group]);
                    name_group_map.add([name2], [name2 records2 group]);
                    group_name_map.add([group], [group name1]);
                    group_name_map.add([group], [group name2]);
                end;
        when (  found1 and ~found2) 
                do;
                    *put 'NOTE: add name2 to name1' group1=;
                    name_group_map.add([name2], [name2 records2 group1]);
                    group_name_map.add([group1], [group1 name2]);
                end;
        when ( ~found1 and  found2) 
                do;
                    *put 'NOTE: add name1 to name2' group2=;
                    name_group_map.add([name1], [name1 records1 group2]);
                    group_name_map.add([group2], [group2 name1]);
                end;
        when (  found1 and  found2)
                do;
                  if group1 = group2 then return;
                  * traverse the multidata of group2 key and migrate each data to the group1 key; 
                  hold_group = group;
                  group = group2;
                  rc = group_name_map.find();
                  do while (rc = 0);
                    name_group_map.find();
                    name_group_map.replace([name], [name records group1]);
                    group_name_map.add([group1], [group1 name]);

                    if group_name_map.has_next() ne 0 then leave;
                    rc = group_name_map.find_next();
                  end;
                  group_name_map.removeall();  * remove the anti-map key;
                  group = hold_group;
                end;
        otherwise;
      end;
    end;

    method term();
      * DS2 does not overwrite existing tables;
      * DS2 does allow a table option (OVERWRITE=YES), or /OVERWRITE=YES;
      * However, DS2 Package HASH method OUTPUT does NOT provisio for such,
      * and will log an error if used, example:
      *   ERROR: Malformed hash data source name <table-name>(overwrite=yes).
      *
      * Thus, the coder MUST pre-delete the target output table in an earlier step;

      name_group_map.output('work.want');
    end;
  enddata;
run;
quit;
options nosource nonotes;

proc sort data=want;
  by group descending records name;
run;

ods html file='want.html' style=plateau;
proc print data=have;
proc print noobs data=want;
run;
ods _all_ close;

options source notes;

Sample output

triley · Posted 05-04-2020 10:17 AM

Thank you for providing another solution. As i am not very familiar with hash tables, I was wondering if it would be possible to modify this slightly and take it one step back and use it to come up with the "have" table. In other words, if I had NAME1 and RECORDS1 in a table, and NAME2 and RECORDS2 in a different table, could i use a similar approach to join the 2 to find the matches? I'm not sure if you can use HASH to find "similar" names based on some function (e.g. using COMPGED), but i think where this would be really helpful is in doing that step, since the SQL step I'm currently using takes over a day to produce that table.

Also, if you are able to provide a solution to this, i can submit a new question on SAS communities and will "Accept" your solution since I already accepted one that worked for this one.

RichardDeVen · Posted 05-04-2020 04:07 PM

Just create a new question. It sounds like an outer join of TABLE1 and TABLE2 would be needed, so in the new question specify information like how many names there are in each table, and what criteria are to be used to 'match' names which would then be your 'links' of this question. Once you have that joined table the answers in this question can be used to create the 'super-sets' from the outer join. The presumption would be that a name in either table can be matched to more than one row in the other table.

Classroom Training Available!