I have a large list of IDs and Fcodes that I am trying to compare. My end result is to be able to output by ID, a count of other IDs that have at least x% of the same Fcodes. Any help is appreciated. I can't wrap my head around how I can do this.
My goal is to have a data set that looks like this:
IDnum | Count of Other IDs |
4505895 | 2 |
4513005 | 3 |
4520745 | 4 |
4577985 | 1 |
Here's a sample data set
data have;
infile datalines dsd truncover;
input IDnum:$7. fcode:$4.;
datalines4;
4505895,E001
4520745,E001
4513005,E003
4577985,E003
4520745,E005
4505895,E006
4577985,E006
4513005,E010
4520745,E010
4577985,E010
4505895,E011
4513005,E011
4520745,E011
4577985,E011
4505895,E012
4513005,E012
4520745,E012
4505895,E016
4520745,E016
4505895,E017
4520745,E017
4577985,E017
4505895,E020
4520745,E020
4577985,E020
4505895,E022
4513005,E022
4520745,E022
4577985,E022
4505895,E023
4513005,E023
4520745,E023
4577985,E023
4520745,E024
4505895,E025
4577985,E102
;;;;
Use PROC FREQ to get the N’s and percentages.
Then use a SQL cross join and sort descending by percentages, with a WHERE clause to filter out records that are X% above the amount.
@chrishull wrote:
I have a large list of IDs and Fcodes that I am trying to compare. My end result is to be able to output by ID, a count of other IDs that have at least x% of the same Fcodes. Any help is appreciated. I can't wrap my head around how I can do this.
My goal is to have a data set that looks like this:
IDnum Count of Other IDs 4505895 2 4513005 3 4520745 4 4577985 1
Here's a sample data set
data have; infile datalines dsd truncover; input IDnum:$7. fcode:$4.; datalines4; 4505895,E001 4520745,E001 4513005,E003 4577985,E003 4520745,E005 4505895,E006 4577985,E006 4513005,E010 4520745,E010 4577985,E010 4505895,E011 4513005,E011 4520745,E011 4577985,E011 4505895,E012 4513005,E012 4520745,E012 4505895,E016 4520745,E016 4505895,E017 4520745,E017 4577985,E017 4505895,E020 4520745,E020 4577985,E020 4505895,E022 4513005,E022 4520745,E022 4577985,E022 4505895,E023 4513005,E023 4520745,E023 4577985,E023 4520745,E024 4505895,E025 4577985,E102 ;;;;
Thank you for your comment. Would you be able to provide a sample of how you would write the PROC Freq statement? I haven't used proc freq before.
I'm not seeing the connection between your data and the expected summaries.
For the first example,
4505895,E001
Why does this have the ID=2 in the results? Can you explain the logic?
The percentages are via:
proc freq data=have noprint;
table idnum*fcode / out=summaries outpct;
run;
Sorry for the confusion and thank you for the help so far!
What I am trying to achieve is to get a count of the number of IDs that have 25% (x%) of the same codes by IDNum. The method that I can think of that would work would be to loop through each IDNum and count the number of matched fee codes, divide that by the IDNum's total codes and write out the "loop" IDnum with every other IDNum that has 25% matching codes. Then I can count the number of distinct IDNums that match in a separate process.
My data set has 10's of millions of records so I was looking for an efficient method to do the work. That's where I'm struggling....
Results I need are just the first 2 columns. The third hopefully illustrates the matching I'm trying to do.
IDNum | # of IDNums that have > 25% of the same Codes | IDNums that Match |
4505895 | 2 | 4520745, 4577985 |
4520745 | 3 | 4505895, 4513005, 4577985 |
4513005 | 2 | 4520745, 4577985 |
4577985 | 2 | 4520745, 4520745 |
Here's a pivoted version of a smaller data set which I hope explains it a little further.
IDNum | E001 | E003 | E005 | E006 | E010 | E011 |
4505895 | 1 | 1 | 1 | |||
4520745 | 1 | 1 | 1 | 1 | ||
4513005 | 1 | 1 | 1 | |||
4577985 | 1 | 1 | 1 | 1 |
For each IDNUM, you apparently want to count the number of unique FCODE values, call it NFCODES. Then you want to count the number of IDNUMs that overlap with the current IDNUM on at least 0.25*NFCODES. And you want to include the IDNUM overlapping with itself in that count:
data have;
infile datalines dsd truncover;
input IDnum:$7. fcode:$4.;
datalines4;
4505895,E001
4520745,E001
4513005,E003
4577985,E003
4520745,E005
4505895,E006
4577985,E006
4513005,E010
4520745,E010
4577985,E010
4505895,E011
4513005,E011
4520745,E011
4577985,E011
4505895,E012
4513005,E012
4520745,E012
4505895,E016
4520745,E016
4505895,E017
4520745,E017
4577985,E017
4505895,E020
4520745,E020
4577985,E020
4505895,E022
4513005,E022
4520745,E022
4577985,E022
4505895,E023
4513005,E023
4520745,E023
4577985,E023
4520745,E024
4505895,E025
4577985,E102
5000001,E102
;;;;
proc sql;
create table need1 as
select a.idnum,
case
when a.idnum=b.idnum then ' '
else b.idnum
end as id2
from
have as a inner join have as b
on a.fcode=b.fcode
order by a.idnum,id2;
quit;
data want (keep=idnum cutpoint n_ids);
do n_ids=0 by 0 until (last.idnum);
do nfcodes=1 by 1 until (last.id2);
set need1;
by idnum id2;
end;
if id2=' ' then cutpoint=0.25*nfcodes;
else if nfcodes >=cutpoint then n_ids=n_ids+1;
end;
run;
How do we know what threshold "x%" represents? Percent usually implies a numerator and a denominator. What are the numerator and denominator for the percentages?
And then what value of "x"?
Hi, not sure specifically if you wanted each IDNum to have the count or only the smallest so there are two tables want which gives all and want2 which gives the min.
data have;
infile datalines dsd truncover;
input IDnum:$7. fcode:$4.;
datalines4;
4505895,E001
4520745,E001
4513005,E003
4577985,E003
4520745,E005
4505895,E006
4577985,E006
4513005,E010
4520745,E010
4577985,E010
4505895,E011
4513005,E011
4520745,E011
4577985,E011
4505895,E012
4513005,E012
4520745,E012
4505895,E016
4520745,E016
4505895,E017
4520745,E017
4577985,E017
4505895,E020
4520745,E020
4577985,E020
4505895,E022
4513005,E022
4520745,E022
4577985,E022
4505895,E023
4513005,E023
4520745,E023
4577985,E023
4520745,E024
4505895,E025
4577985,E102
;;;;
PROC SQL;
CREATE table want AS
SELECT IDNum, COUNT(fcode) AS Count_ID
FROM have
GROUP BY fcode
;
quit;
PROC SQL;
CREATE table want2 AS
SELECT MIN(IDNum) AS IDNum, COUNT(fcode) AS Count_ID
FROM have
GROUP BY fcode
;
quit;
Thank you everyone for your help. I've managed to hack together a program that will work for my needs. I'm a little hesitant to run against my large data set because I don't think it's that efficient.. This is what I've come up with. Any comments/suggestions are welcome!
Thanks again!
Chris
data have; infile datalines dsd truncover; input IDnum:$7. fcode:$4.; datalines4; 4505895,E001 4520745,E001 4513005,E003 4577985,E003 4520745,E005 4505895,E006 4577985,E006 4513005,E010 4520745,E010 4577985,E010 4505895,E011 4513005,E011 4520745,E011 4577985,E011 4505895,E012 4513005,E012 4520745,E012 4505895,E016 4520745,E016 4505895,E017 4520745,E017 4577985,E017 4505895,E020 4520745,E020 4577985,E020 4505895,E022 4513005,E022 4520745,E022 4577985,E022 4505895,E023 4513005,E023 4520745,E023 4577985,E023 4520745,E024 4505895,E025 4577985,E102 5000001,E102 ;;;; /* generate distinct list of IDnums */ proc sql; create table work.dlist as select IDnum, count(fcode) as countcodes from have group by IDnum; quit; /* get the number of observations to loop through */ %let dsid=%sysfunc(open(work.dlist)); %let num=%sysfunc(attrn(&dsid, nobs)); %let rc=%sysfunc(close(&dsid)); /* build the IDnum comparison table */ %macro buildIDtable(IDnum=); data work.curIDnum; set work.have; where IDnum= "&IDnum"; run; %mend buildIDtable; /* output the IDnum comparison results to the table */ %macro appendTable; proc sql; insert into work.need1 select a.IDnum as REFIDnum, b.IDnum as matchIDnum, b.fcode as matchFcode from work.curIDnum a inner join work.have b on a.IDnum <> b.idnum and a.fcode = b.fcode; quit; %mend appendTable; /*copy values of IDnum to macro variables */ data _null_; set work.dlist; call symput('IDnum'|| put(IDnum,char(7))); run; /* call the dowork macro once for each IDnum */ %macro workloop; %local x; %do x=1 %to # %buildIDtable(IDnum=&&IDnum&x); %appendTable; %end; %mend workloop; /* start the working loop for each IDnum in the have dataset */ %workloop; /* count the number of matches by IDnum */ proc sql; create table work.need2 as select REFIDnum, matchIDnum, count(matchFcode) as countMatch from work.need1 group by refIDnum, matchIDnum; quit; /* get the percent match by IDnum */ proc sql; create table work.need3 as select a.REFIDnum, a.matchIDnum, a.countMatch, b.countCodes as totalCodesM, a.countMatch/b.countCodes as percentMatch from work.need2 a inner join work.dlist b on a.matchIDnum=b.IDnum; quit; /* summarize final results */ proc sql; create table work.want as select REFIDnum, count(matchIDnum) as totalSimilar from work.need3 where percentmatch> 0.33 group by REFIDnum; quit;
I would be curious but don't have time to test this. If you had a 0/1 matrix like in last post, how well do your results correlate with the correlation metric from PROC CORR?
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.