BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
chrishull
Obsidian | Level 7

HI All,

 

I'm trying to count the number of times that one code is claimed at the same time another code is claimed. The original dataset is like this:

 

 

group1	code1
group1	code2
group1	code3
group2	code2
group2	code3
group3	code1
group3	code2
group3	code3
group3	code4
group3	code5

I've transposed it (for another purpose) like this:

 

 

 

group1	code1	code2	code3		
group2	code2	code3			
group3	code1	code2	code3	code4	code5

 

What I am trying to acheive is results like this:

 

first codesecond codecount claimed together
code1code22
code1code31
code1code41
code1code51
code2code12
code2code33
code2code42
code2code51
code3code12
code3code23
code3code41
code3code51

 

I know that the dataset will be large since I have more than 8000 unique codes but I am trying to figure out a way to be able to know what codes are billed together the most.

 

I appreciate any input.

 

Thanks! 

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

The transposed data will be more useful here.  For illustration purposes, I'm assuming that you have no more than 8 codes per group but the program easily expands if you have more than that.

 

data pairs;

set transposed_data;

array codes {8} code1-code8;

do _m_ = 1 to 7;

   do _n_=_m_+1 to 8;

      if codes{_m_} > ' ' and codes{_n_} > ' ' then do;

         if codes{_m_} < codes{_n_} then code_pair = catx(' ', codes{_m_}, codes{_n_});

         else code_pair = catx(' ', codes{_n_}, codes{_m_});

         output;

      end;

   end;

end;

run;

 

proc freq data=pairs;

   tables code_pair  / out=pair_counts;

run;

 

The output data set is optional, but might come in handy.

View solution in original post

11 REPLIES 11
Astounding
PROC Star

The transposed data will be more useful here.  For illustration purposes, I'm assuming that you have no more than 8 codes per group but the program easily expands if you have more than that.

 

data pairs;

set transposed_data;

array codes {8} code1-code8;

do _m_ = 1 to 7;

   do _n_=_m_+1 to 8;

      if codes{_m_} > ' ' and codes{_n_} > ' ' then do;

         if codes{_m_} < codes{_n_} then code_pair = catx(' ', codes{_m_}, codes{_n_});

         else code_pair = catx(' ', codes{_n_}, codes{_m_});

         output;

      end;

   end;

end;

run;

 

proc freq data=pairs;

   tables code_pair  / out=pair_counts;

run;

 

The output data set is optional, but might come in handy.

chrishull
Obsidian | Level 7

Thank you for your help! I was able to process 300m+ observations in under 20 minutes. Impressive.

 

I have another dataset that's further aggregated (20m observations). Rather than counting each occurence, could I sum the count?

 

GroupCountclaim1claim2claim3claim4claim5claim6claim7claim8
group146code1       
group21code1code2code3code4code5   
group31code1code3code4     
group41code1code3code4code5code6code7  
group56code1code2code3     
group625code1code4code5     

 

 

Astounding
PROC Star

Not quite sure what you're asking for here, but it's possible to do most anything.  Do you want this:

 

code1  80  (sum of 46, 1, 1, 1, 6, 25)

 

code2  7  (sum of 6, 1)

 

code3  9  (sum of 1, 1, 1, 6)

 

Are you now processing individual codes instead of pairs?  (If you're still processing pairs, how do you handle the first line that only has one code in it?)

chrishull
Obsidian | Level 7

Sorry I wasn't clear. I'm still looking to get the same information but rather than counting the occurrences of the pairs, is it possible to maintain a running total of the count for each pair. The output would be something like this:

 

PairSumCount
code1 code2773
Astounding
PROC Star

I'm not seeing where the numeric values of 77 and 3 come from here.

 

But to answer your question in general, I think it would be necessary to use a hash table to make this happen ... not my area of strength.  You code certainly could define a hash table that is based on two keys (the minimum code in a pair and the maximum code in a pair), and maintain two data fields based on those two keys as you move through the data.  But someone else would have to program that for you ... it's a medium complexity task for someone who knows hash tables.

Haikuo
Onyx | Level 15

Some SQL plus some data step should give what you need:

 

data have;
	infile cards expandtabs;
	input group$	code$;
	cards;
group1 code1
group1	code2
group1	code3
group2	code2
group2	code3
group3	code1
group3	code2
group3	code3
group3	code4
group3	code5
;

/*generate pairs combination within the same group*/
proc sql;
	create table _step1 as
		select a.*, b.code as _code
			from have a
				left join have b
					on a.group=b.group and a.code ne b.code
	;
quit;

/*sort the pair so for dedup purpose*/
data _step2;
	set _step1;
	array cd code _code;
	call sortc(of cd(*));
	_code_group = catx('-',code,_code);
run;

/*count the group number by unique pair, this is the essentially the count you need*/
proc sql;
	create table want  as 
		select distinct  code as first_code, _code as second_code, count (distinct group) as count from _step2 
			group by _code_group
	;
quit;
Ksharp
Super User
data have;
	infile cards expandtabs;
	input group$	code$;
	cards;
group1 code1
group1	code2
group1	code3
group2	code2
group2	code3
group3	code1
group3	code2
group3	code3
group3	code4
group3	code5
;
run;
proc transpose data=have out=temp(drop=_:);
 by group;
 var code;
run;
data temp1;
 set temp;
 array x{*} col:;
 do i=1 to dim(x)-1;
  do j=i+1 to dim(x);
   if not missing(x{i}) and not missing(x{j}) then do;
    first=x{i};second=x{j};output;
    first=x{j};second=x{i};output;
   end;
  end;
 end;
 drop i j;
run;
proc freq data=temp1 noprint;
 table first*second/out=want list;
run;


chrishull
Obsidian | Level 7
Thank you for your help. Is there a method to limit the variables in the transpose step? say to 25? I know that I could clean up my data would wouldn't require the limit. I'm just curious. Thanks again!
Ksharp
Super User
"o limit the variables in the transpose step? "
I don't understand what you mean.
You want control the output variable ?

proc transpose data=have out=temp(keep=id col1-col25)


chrishull
Obsidian | Level 7

Sorry. I was wondering if it was possible to limit the number of observations per group. Maybe by putting in a counter to move to the next group after 25 variables are read. If I drop anything past variable 25, the observations will still be read which would increase processing times.

 

If it's not an option to add a counter by group then it's probably best that I clean my initial dataset first.

 

Thanks again for your input!

Ksharp
Super User
Make a count variable:

data want;
 set have;
 by id;
 if first.id then count=0;
 count+1;
 if count le 25;
run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 11 replies
  • 2396 views
  • 3 likes
  • 4 in conversation