turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Data Mining
- /
- Counting the number of times values are together

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-03-2016 02:08 PM

HI All,

I'm trying to count the number of times that one code is claimed at the same time another code is claimed. The original dataset is like this:

group1 code1 group1 code2 group1 code3 group2 code2 group2 code3 group3 code1 group3 code2 group3 code3 group3 code4 group3 code5

I've transposed it (for another purpose) like this:

group1 code1 code2 code3 group2 code2 code3 group3 code1 code2 code3 code4 code5

What I am trying to acheive is results like this:

first code | second code | count claimed together |

code1 | code2 | 2 |

code1 | code3 | 1 |

code1 | code4 | 1 |

code1 | code5 | 1 |

code2 | code1 | 2 |

code2 | code3 | 3 |

code2 | code4 | 2 |

code2 | code5 | 1 |

code3 | code1 | 2 |

code3 | code2 | 3 |

code3 | code4 | 1 |

code3 | code5 | 1 |

I know that the dataset will be large since I have more than 8000 unique codes but I am trying to figure out a way to be able to know what codes are billed together the most.

I appreciate any input.

Thanks!

Accepted Solutions

Solution

08-08-2016
02:01 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-03-2016 03:27 PM

The transposed data will be more useful here. For illustration purposes, I'm assuming that you have no more than 8 codes per group but the program easily expands if you have more than that.

data pairs;

set transposed_data;

array codes {8} code1-code8;

do _m_ = 1 to 7;

do _n_=_m_+1 to 8;

if codes{_m_} > ' ' and codes{_n_} > ' ' then do;

if codes{_m_} < codes{_n_} then code_pair = catx(' ', codes{_m_}, codes{_n_});

else code_pair = catx(' ', codes{_n_}, codes{_m_});

output;

end;

end;

end;

run;

proc freq data=pairs;

tables code_pair / out=pair_counts;

run;

The output data set is optional, but might come in handy.

All Replies

Solution

08-08-2016
02:01 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-03-2016 03:27 PM

The transposed data will be more useful here. For illustration purposes, I'm assuming that you have no more than 8 codes per group but the program easily expands if you have more than that.

data pairs;

set transposed_data;

array codes {8} code1-code8;

do _m_ = 1 to 7;

do _n_=_m_+1 to 8;

if codes{_m_} > ' ' and codes{_n_} > ' ' then do;

if codes{_m_} < codes{_n_} then code_pair = catx(' ', codes{_m_}, codes{_n_});

else code_pair = catx(' ', codes{_n_}, codes{_m_});

output;

end;

end;

end;

run;

proc freq data=pairs;

tables code_pair / out=pair_counts;

run;

The output data set is optional, but might come in handy.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-04-2016 01:53 PM

Thank you for your help! I was able to process 300m+ observations in under 20 minutes. Impressive.

I have another dataset that's further aggregated (20m observations). Rather than counting each occurence, could I sum the count?

Group | Count | claim1 | claim2 | claim3 | claim4 | claim5 | claim6 | claim7 | claim8 |

group1 | 46 | code1 | |||||||

group2 | 1 | code1 | code2 | code3 | code4 | code5 | |||

group3 | 1 | code1 | code3 | code4 | |||||

group4 | 1 | code1 | code3 | code4 | code5 | code6 | code7 | ||

group5 | 6 | code1 | code2 | code3 | |||||

group6 | 25 | code1 | code4 | code5 |

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-04-2016 02:36 PM

Not quite sure what you're asking for here, but it's possible to do most anything. Do you want this:

code1 80 (sum of 46, 1, 1, 1, 6, 25)

code2 7 (sum of 6, 1)

code3 9 (sum of 1, 1, 1, 6)

Are you now processing individual codes instead of pairs? (If you're still processing pairs, how do you handle the first line that only has one code in it?)

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-04-2016 02:51 PM

Sorry I wasn't clear. I'm still looking to get the same information but rather than counting the occurrences of the pairs, is it possible to maintain a running total of the count for each pair. The output would be something like this:

Pair | Sum | Count |

code1 code2 | 77 | 3 |

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-04-2016 03:03 PM

I'm not seeing where the numeric values of 77 and 3 come from here.

But to answer your question in general, I think it would be necessary to use a hash table to make this happen ... not my area of strength. You code certainly could define a hash table that is based on two keys (the minimum code in a pair and the maximum code in a pair), and maintain two data fields based on those two keys as you move through the data. But someone else would have to program that for you ... it's a medium complexity task for someone who knows hash tables.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-03-2016 03:43 PM

Some SQL plus some data step should give what you need:

```
data have;
infile cards expandtabs;
input group$ code$;
cards;
group1 code1
group1 code2
group1 code3
group2 code2
group2 code3
group3 code1
group3 code2
group3 code3
group3 code4
group3 code5
;
/*generate pairs combination within the same group*/
proc sql;
create table _step1 as
select a.*, b.code as _code
from have a
left join have b
on a.group=b.group and a.code ne b.code
;
quit;
/*sort the pair so for dedup purpose*/
data _step2;
set _step1;
array cd code _code;
call sortc(of cd(*));
_code_group = catx('-',code,_code);
run;
/*count the group number by unique pair, this is the essentially the count you need*/
proc sql;
create table want as
select distinct code as first_code, _code as second_code, count (distinct group) as count from _step2
group by _code_group
;
quit;
```

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-04-2016 08:14 AM

data have; infile cards expandtabs; input group$ code$; cards; group1 code1 group1 code2 group1 code3 group2 code2 group2 code3 group3 code1 group3 code2 group3 code3 group3 code4 group3 code5 ; run; proc transpose data=have out=temp(drop=_:); by group; var code; run; data temp1; set temp; array x{*} col:; do i=1 to dim(x)-1; do j=i+1 to dim(x); if not missing(x{i}) and not missing(x{j}) then do; first=x{i};second=x{j};output; first=x{j};second=x{i};output; end; end; end; drop i j; run; proc freq data=temp1 noprint; table first*second/out=want list; run;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-04-2016 01:39 PM

Thank you for your help. Is there a method to limit the variables in the transpose step? say to 25? I know that I could clean up my data would wouldn't require the limit. I'm just curious. Thanks again!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-04-2016 02:05 PM

"o limit the variables in the transpose step? " I don't understand what you mean. You want control the output variable ? proc transpose data=have out=temp(keep=id col1-col25)

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-04-2016 02:27 PM

Sorry. I was wondering if it was possible to limit the number of observations per group. Maybe by putting in a counter to move to the next group after 25 variables are read. If I drop anything past variable 25, the observations will still be read which would increase processing times.

If it's not an option to add a counter by group then it's probably best that I clean my initial dataset first.

Thanks again for your input!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-05-2016 03:08 AM

Make a count variable: data want; set have; by id; if first.id then count=0; count+1; if count le 25; run;