BookmarkSubscribeRSS Feed
chrishull
Obsidian | Level 7

I have a large list of IDs and Fcodes that I am trying to compare. My end result is to be able to output by ID, a count of other IDs that have at least x% of the same Fcodes. Any help is appreciated. I can't wrap my head around how I can do this.

 

My goal is to have a data set that looks like this:

IDnumCount of Other IDs
45058952
45130053
45207454
45779851

 

Here's a sample data set

 

data have;
  infile datalines dsd truncover;
  input IDnum:$7. fcode:$4.;
datalines4;
4505895,E001
4520745,E001
4513005,E003
4577985,E003
4520745,E005
4505895,E006
4577985,E006
4513005,E010
4520745,E010
4577985,E010
4505895,E011
4513005,E011
4520745,E011
4577985,E011
4505895,E012
4513005,E012
4520745,E012
4505895,E016
4520745,E016
4505895,E017
4520745,E017
4577985,E017
4505895,E020
4520745,E020
4577985,E020
4505895,E022
4513005,E022
4520745,E022
4577985,E022
4505895,E023
4513005,E023
4520745,E023
4577985,E023
4520745,E024
4505895,E025
4577985,E102
;;;;

 

9 REPLIES 9
Reeza
Super User

Use PROC FREQ to get the N’s and percentages.
Then use a SQL cross join and sort descending by percentages, with a WHERE clause to filter out records that are X% above the amount.

 

Spoiler

@chrishull wrote:

I have a large list of IDs and Fcodes that I am trying to compare. My end result is to be able to output by ID, a count of other IDs that have at least x% of the same Fcodes. Any help is appreciated. I can't wrap my head around how I can do this.

 

My goal is to have a data set that looks like this:

IDnum Count of Other IDs
4505895 2
4513005 3
4520745 4
4577985 1

 

Here's a sample data set

 

data have;
  infile datalines dsd truncover;
  input IDnum:$7. fcode:$4.;
datalines4;
4505895,E001
4520745,E001
4513005,E003
4577985,E003
4520745,E005
4505895,E006
4577985,E006
4513005,E010
4520745,E010
4577985,E010
4505895,E011
4513005,E011
4520745,E011
4577985,E011
4505895,E012
4513005,E012
4520745,E012
4505895,E016
4520745,E016
4505895,E017
4520745,E017
4577985,E017
4505895,E020
4520745,E020
4577985,E020
4505895,E022
4513005,E022
4520745,E022
4577985,E022
4505895,E023
4513005,E023
4520745,E023
4577985,E023
4520745,E024
4505895,E025
4577985,E102
;;;;

 


chrishull
Obsidian | Level 7

Thank you for your comment. Would you be able to provide a sample of how you would write the PROC Freq statement? I haven't used proc freq before. 

Reeza
Super User

I'm not seeing the connection between your data and the expected summaries. 

 

For the first example, 

4505895,E001

Why does this have the ID=2 in the results? Can you explain the logic?

 

The percentages are via:

 

proc freq data=have noprint;
table idnum*fcode / out=summaries outpct;
run;

 

chrishull
Obsidian | Level 7

Sorry for the confusion and thank you for the help so far!

 

What I am trying to achieve is to get a count of the number of IDs that have 25% (x%) of the same codes by IDNum. The method that I can think of that would work would be to loop through each IDNum and count the number of matched fee codes, divide that by the IDNum's total codes and write out the "loop" IDnum  with every other IDNum that has 25% matching codes. Then I can count the number of distinct IDNums that match in a separate process. 

 

My data set has 10's of millions of records so I was looking for an efficient method to do the work. That's where I'm struggling....

 

Results I need are just the first 2 columns. The third hopefully illustrates the matching I'm trying to do. 

 

IDNum# of IDNums that have > 25% of the same CodesIDNums that Match
450589524520745, 4577985
452074534505895, 4513005, 4577985
451300524520745, 4577985
457798524520745, 4520745

 

Here's a pivoted version of a smaller data set which I hope explains it a little further.

 

IDNumE001E003E005E006E010E011
45058951  1 1
45207451 1 11
4513005 1  11
4577985 1 111

 

mkeintz
PROC Star

For each IDNUM, you apparently want to count the number of unique FCODE values, call it NFCODES.  Then you want to count the number of IDNUMs that overlap with the current IDNUM on at least 0.25*NFCODES.  And you want to include the IDNUM overlapping with itself in that count:

 

data have;
  infile datalines dsd truncover;
  input IDnum:$7. fcode:$4.;
datalines4;
4505895,E001
4520745,E001
4513005,E003
4577985,E003
4520745,E005
4505895,E006
4577985,E006
4513005,E010
4520745,E010
4577985,E010
4505895,E011
4513005,E011
4520745,E011
4577985,E011
4505895,E012
4513005,E012
4520745,E012
4505895,E016
4520745,E016
4505895,E017
4520745,E017
4577985,E017
4505895,E020
4520745,E020
4577985,E020
4505895,E022
4513005,E022
4520745,E022
4577985,E022
4505895,E023
4513005,E023
4520745,E023
4577985,E023
4520745,E024
4505895,E025
4577985,E102
5000001,E102
;;;;

proc sql;
  create table need1 as 
    select a.idnum, 
           case 
             when a.idnum=b.idnum then ' '
             else b.idnum
           end as id2
    from 
      have as a   inner join   have as b
      on a.fcode=b.fcode
    order by a.idnum,id2;
quit;

data want (keep=idnum cutpoint n_ids);
  do n_ids=0 by 0 until (last.idnum);
    do nfcodes=1 by 1 until (last.id2);
      set need1;
      by idnum id2;
    end;
    if id2=' ' then cutpoint=0.25*nfcodes;
    else if nfcodes >=cutpoint then n_ids=n_ids+1;
  end;
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
ballardw
Super User

How do we know what threshold "x%" represents? Percent usually implies a numerator and a denominator. What are the numerator and denominator for the percentages?

And then what value of "x"?

DanielLangley
Quartz | Level 8

Hi, not sure specifically if you wanted each IDNum to have the count or only the smallest so there are two tables want which gives all and want2 which gives the min.

 

data have;
  infile datalines dsd truncover;
  input IDnum:$7. fcode:$4.;
datalines4;
4505895,E001
4520745,E001
4513005,E003
4577985,E003
4520745,E005
4505895,E006
4577985,E006
4513005,E010
4520745,E010
4577985,E010
4505895,E011
4513005,E011
4520745,E011
4577985,E011
4505895,E012
4513005,E012
4520745,E012
4505895,E016
4520745,E016
4505895,E017
4520745,E017
4577985,E017
4505895,E020
4520745,E020
4577985,E020
4505895,E022
4513005,E022
4520745,E022
4577985,E022
4505895,E023
4513005,E023
4520745,E023
4577985,E023
4520745,E024
4505895,E025
4577985,E102
;;;;

PROC SQL;

CREATE table want AS
	SELECT IDNum, COUNT(fcode) AS Count_ID
	FROM have
	GROUP BY fcode
;
quit;

PROC SQL;

CREATE table want2 AS
	SELECT MIN(IDNum) AS IDNum, COUNT(fcode) AS Count_ID
	FROM have
	GROUP BY fcode
;
quit;
chrishull
Obsidian | Level 7

Thank you everyone for your help. I've managed to hack together a program that will work for my needs. I'm a little hesitant to run against my large data set because I don't think it's that efficient.. This is what I've come up with. Any comments/suggestions are welcome!

 

Thanks again!

Chris

 

data have;
  infile datalines dsd truncover;
  input IDnum:$7. fcode:$4.;
datalines4;
4505895,E001
4520745,E001
4513005,E003
4577985,E003
4520745,E005
4505895,E006
4577985,E006
4513005,E010
4520745,E010
4577985,E010
4505895,E011
4513005,E011
4520745,E011
4577985,E011
4505895,E012
4513005,E012
4520745,E012
4505895,E016
4520745,E016
4505895,E017
4520745,E017
4577985,E017
4505895,E020
4520745,E020
4577985,E020
4505895,E022
4513005,E022
4520745,E022
4577985,E022
4505895,E023
4513005,E023
4520745,E023
4577985,E023
4520745,E024
4505895,E025
4577985,E102
5000001,E102
;;;;

/* generate distinct list of IDnums */
proc sql;
 create table work.dlist as
 select IDnum, count(fcode) as countcodes
 from have
 group by IDnum;
quit;

/* get the number of observations to loop through */
%let dsid=%sysfunc(open(work.dlist));
%let num=%sysfunc(attrn(&dsid, nobs));
%let rc=%sysfunc(close(&dsid));

/* build the IDnum comparison table */
%macro buildIDtable(IDnum=);
  data work.curIDnum;
   set work.have;
   where IDnum= "&IDnum";
  run;
%mend buildIDtable;

/* output the IDnum comparison results to the table */
%macro appendTable;
  proc sql;
   insert into work.need1
   select a.IDnum as REFIDnum, b.IDnum as matchIDnum, b.fcode as matchFcode
   from work.curIDnum a inner join work.have b
    on a.IDnum <> b.idnum and a.fcode = b.fcode;
  quit;
%mend appendTable;
/*copy values of IDnum to macro variables */
data _null_;
 set work.dlist;
 call symput('IDnum'|| put(IDnum,char(7)));
run;
/* call the dowork macro once for each IDnum */
%macro workloop;
 %local x;
 %do x=1 %to &num;
   %buildIDtable(IDnum=&&IDnum&x);
   %appendTable;
 %end;
%mend workloop;

/* start the working loop for each IDnum in the have dataset */
%workloop;

/* count the number of matches by IDnum */
proc sql;
 create table work.need2 as
 select REFIDnum, matchIDnum, count(matchFcode) as countMatch
 from work.need1
 group by refIDnum, matchIDnum;
quit;

/* get the percent match by IDnum */
proc sql;
 create table work.need3 as
 select a.REFIDnum, a.matchIDnum, a.countMatch, b.countCodes as totalCodesM, a.countMatch/b.countCodes as percentMatch
 from work.need2 a inner join work.dlist b
   on a.matchIDnum=b.IDnum;
quit;

/* summarize final results */
proc sql;
 create table work.want as
 select REFIDnum, count(matchIDnum) as totalSimilar
 from work.need3
 where percentmatch> 0.33
 group by REFIDnum;
quit;
Reeza
Super User

I would be curious but don't have time to test this. If you had a 0/1 matrix like in last post, how well do your results correlate with the correlation metric from PROC CORR?

 

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 1092 views
  • 0 likes
  • 5 in conversation