DATA Step, Macro, Functions and more

Comparison by group and output the number of IDs that match at a percent

Reply
Contributor
Posts: 22

Comparison by group and output the number of IDs that match at a percent

I have a large list of IDs and Fcodes that I am trying to compare. My end result is to be able to output by ID, a count of other IDs that have at least x% of the same Fcodes. Any help is appreciated. I can't wrap my head around how I can do this.

 

My goal is to have a data set that looks like this:

IDnumCount of Other IDs
45058952
45130053
45207454
45779851

 

Here's a sample data set

 

data have;
  infile datalines dsd truncover;
  input IDnum:$7. fcode:$4.;
datalines4;
4505895,E001
4520745,E001
4513005,E003
4577985,E003
4520745,E005
4505895,E006
4577985,E006
4513005,E010
4520745,E010
4577985,E010
4505895,E011
4513005,E011
4520745,E011
4577985,E011
4505895,E012
4513005,E012
4520745,E012
4505895,E016
4520745,E016
4505895,E017
4520745,E017
4577985,E017
4505895,E020
4520745,E020
4577985,E020
4505895,E022
4513005,E022
4520745,E022
4577985,E022
4505895,E023
4513005,E023
4520745,E023
4577985,E023
4520745,E024
4505895,E025
4577985,E102
;;;;

 

Super User
Posts: 23,357

Re: Comparison by group and output the number of IDs that match at a percent

[ Edited ]
Posted in reply to chrishull

Use PROC FREQ to get the N’s and percentages.
Then use a SQL cross join and sort descending by percentages, with a WHERE clause to filter out records that are X% above the amount.

 

Spoiler

@chrishull wrote:

I have a large list of IDs and Fcodes that I am trying to compare. My end result is to be able to output by ID, a count of other IDs that have at least x% of the same Fcodes. Any help is appreciated. I can't wrap my head around how I can do this.

 

My goal is to have a data set that looks like this:

IDnum Count of Other IDs
4505895 2
4513005 3
4520745 4
4577985 1

 

Here's a sample data set

 

data have;
  infile datalines dsd truncover;
  input IDnum:$7. fcode:$4.;
datalines4;
4505895,E001
4520745,E001
4513005,E003
4577985,E003
4520745,E005
4505895,E006
4577985,E006
4513005,E010
4520745,E010
4577985,E010
4505895,E011
4513005,E011
4520745,E011
4577985,E011
4505895,E012
4513005,E012
4520745,E012
4505895,E016
4520745,E016
4505895,E017
4520745,E017
4577985,E017
4505895,E020
4520745,E020
4577985,E020
4505895,E022
4513005,E022
4520745,E022
4577985,E022
4505895,E023
4513005,E023
4520745,E023
4577985,E023
4520745,E024
4505895,E025
4577985,E102
;;;;

 


Contributor
Posts: 22

Re: Comparison by group and output the number of IDs that match at a percent

Thank you for your comment. Would you be able to provide a sample of how you would write the PROC Freq statement? I haven't used proc freq before. 

Super User
Posts: 23,357

Re: Comparison by group and output the number of IDs that match at a percent

Posted in reply to chrishull

I'm not seeing the connection between your data and the expected summaries. 

 

For the first example, 

4505895,E001

Why does this have the ID=2 in the results? Can you explain the logic?

 

The percentages are via:

 

proc freq data=have noprint;
table idnum*fcode / out=summaries outpct;
run;

 

Contributor
Posts: 22

Re: Comparison by group and output the number of IDs that match at a percent

Sorry for the confusion and thank you for the help so far!

 

What I am trying to achieve is to get a count of the number of IDs that have 25% (x%) of the same codes by IDNum. The method that I can think of that would work would be to loop through each IDNum and count the number of matched fee codes, divide that by the IDNum's total codes and write out the "loop" IDnum  with every other IDNum that has 25% matching codes. Then I can count the number of distinct IDNums that match in a separate process. 

 

My data set has 10's of millions of records so I was looking for an efficient method to do the work. That's where I'm struggling....

 

Results I need are just the first 2 columns. The third hopefully illustrates the matching I'm trying to do. 

 

IDNum# of IDNums that have > 25% of the same CodesIDNums that Match
450589524520745, 4577985
452074534505895, 4513005, 4577985
451300524520745, 4577985
457798524520745, 4520745

 

Here's a pivoted version of a smaller data set which I hope explains it a little further.

 

IDNumE001E003E005E006E010E011
45058951  1 1
45207451 1 11
4513005 1  11
4577985 1 111

 

Trusted Advisor
Posts: 1,312

Re: Comparison by group and output the number of IDs that match at a percent

Posted in reply to chrishull

For each IDNUM, you apparently want to count the number of unique FCODE values, call it NFCODES.  Then you want to count the number of IDNUMs that overlap with the current IDNUM on at least 0.25*NFCODES.  And you want to include the IDNUM overlapping with itself in that count:

 

data have;
  infile datalines dsd truncover;
  input IDnum:$7. fcode:$4.;
datalines4;
4505895,E001
4520745,E001
4513005,E003
4577985,E003
4520745,E005
4505895,E006
4577985,E006
4513005,E010
4520745,E010
4577985,E010
4505895,E011
4513005,E011
4520745,E011
4577985,E011
4505895,E012
4513005,E012
4520745,E012
4505895,E016
4520745,E016
4505895,E017
4520745,E017
4577985,E017
4505895,E020
4520745,E020
4577985,E020
4505895,E022
4513005,E022
4520745,E022
4577985,E022
4505895,E023
4513005,E023
4520745,E023
4577985,E023
4520745,E024
4505895,E025
4577985,E102
5000001,E102
;;;;

proc sql;
  create table need1 as 
    select a.idnum, 
           case 
             when a.idnum=b.idnum then ' '
             else b.idnum
           end as id2
    from 
      have as a   inner join   have as b
      on a.fcode=b.fcode
    order by a.idnum,id2;
quit;

data want (keep=idnum cutpoint n_ids);
  do n_ids=0 by 0 until (last.idnum);
    do nfcodes=1 by 1 until (last.id2);
      set need1;
      by idnum id2;
    end;
    if id2=' ' then cutpoint=0.25*nfcodes;
    else if nfcodes >=cutpoint then n_ids=n_ids+1;
  end;
run;
Super User
Posts: 13,358

Re: Comparison by group and output the number of IDs that match at a percent

Posted in reply to chrishull

How do we know what threshold "x%" represents? Percent usually implies a numerator and a denominator. What are the numerator and denominator for the percentages?

And then what value of "x"?

Occasional Contributor
Posts: 10

Re: Comparison by group and output the number of IDs that match at a percent

[ Edited ]
Posted in reply to chrishull

Hi, not sure specifically if you wanted each IDNum to have the count or only the smallest so there are two tables want which gives all and want2 which gives the min.

 

data have;
  infile datalines dsd truncover;
  input IDnum:$7. fcode:$4.;
datalines4;
4505895,E001
4520745,E001
4513005,E003
4577985,E003
4520745,E005
4505895,E006
4577985,E006
4513005,E010
4520745,E010
4577985,E010
4505895,E011
4513005,E011
4520745,E011
4577985,E011
4505895,E012
4513005,E012
4520745,E012
4505895,E016
4520745,E016
4505895,E017
4520745,E017
4577985,E017
4505895,E020
4520745,E020
4577985,E020
4505895,E022
4513005,E022
4520745,E022
4577985,E022
4505895,E023
4513005,E023
4520745,E023
4577985,E023
4520745,E024
4505895,E025
4577985,E102
;;;;

PROC SQL;

CREATE table want AS
	SELECT IDNum, COUNT(fcode) AS Count_ID
	FROM have
	GROUP BY fcode
;
quit;

PROC SQL;

CREATE table want2 AS
	SELECT MIN(IDNum) AS IDNum, COUNT(fcode) AS Count_ID
	FROM have
	GROUP BY fcode
;
quit;
Contributor
Posts: 22

Re: Comparison by group and output the number of IDs that match at a percent

Posted in reply to chrishull

Thank you everyone for your help. I've managed to hack together a program that will work for my needs. I'm a little hesitant to run against my large data set because I don't think it's that efficient.. This is what I've come up with. Any comments/suggestions are welcome!

 

Thanks again!

Chris

 

data have;
  infile datalines dsd truncover;
  input IDnum:$7. fcode:$4.;
datalines4;
4505895,E001
4520745,E001
4513005,E003
4577985,E003
4520745,E005
4505895,E006
4577985,E006
4513005,E010
4520745,E010
4577985,E010
4505895,E011
4513005,E011
4520745,E011
4577985,E011
4505895,E012
4513005,E012
4520745,E012
4505895,E016
4520745,E016
4505895,E017
4520745,E017
4577985,E017
4505895,E020
4520745,E020
4577985,E020
4505895,E022
4513005,E022
4520745,E022
4577985,E022
4505895,E023
4513005,E023
4520745,E023
4577985,E023
4520745,E024
4505895,E025
4577985,E102
5000001,E102
;;;;

/* generate distinct list of IDnums */
proc sql;
 create table work.dlist as
 select IDnum, count(fcode) as countcodes
 from have
 group by IDnum;
quit;

/* get the number of observations to loop through */
%let dsid=%sysfunc(open(work.dlist));
%let num=%sysfunc(attrn(&dsid, nobs));
%let rc=%sysfunc(close(&dsid));

/* build the IDnum comparison table */
%macro buildIDtable(IDnum=);
  data work.curIDnum;
   set work.have;
   where IDnum= "&IDnum";
  run;
%mend buildIDtable;

/* output the IDnum comparison results to the table */
%macro appendTable;
  proc sql;
   insert into work.need1
   select a.IDnum as REFIDnum, b.IDnum as matchIDnum, b.fcode as matchFcode
   from work.curIDnum a inner join work.have b
    on a.IDnum <> b.idnum and a.fcode = b.fcode;
  quit;
%mend appendTable;
/*copy values of IDnum to macro variables */
data _null_;
 set work.dlist;
 call symput('IDnum'|| put(IDnum,char(7)));
run;
/* call the dowork macro once for each IDnum */
%macro workloop;
 %local x;
 %do x=1 %to &num;
   %buildIDtable(IDnum=&&IDnum&x);
   %appendTable;
 %end;
%mend workloop;

/* start the working loop for each IDnum in the have dataset */
%workloop;

/* count the number of matches by IDnum */
proc sql;
 create table work.need2 as
 select REFIDnum, matchIDnum, count(matchFcode) as countMatch
 from work.need1
 group by refIDnum, matchIDnum;
quit;

/* get the percent match by IDnum */
proc sql;
 create table work.need3 as
 select a.REFIDnum, a.matchIDnum, a.countMatch, b.countCodes as totalCodesM, a.countMatch/b.countCodes as percentMatch
 from work.need2 a inner join work.dlist b
   on a.matchIDnum=b.IDnum;
quit;

/* summarize final results */
proc sql;
 create table work.want as
 select REFIDnum, count(matchIDnum) as totalSimilar
 from work.need3
 where percentmatch> 0.33
 group by REFIDnum;
quit;
Super User
Posts: 23,357

Re: Comparison by group and output the number of IDs that match at a percent

Posted in reply to chrishull

I would be curious but don't have time to test this. If you had a 0/1 matrix like in last post, how well do your results correlate with the correlation metric from PROC CORR?

 

 

Ask a Question
Discussion stats
  • 9 replies
  • 180 views
  • 0 likes
  • 5 in conversation