Solved: Re: Flag overlapping ranges

Cruise · Posted 06-11-2018 08:46 AM

I have a medical procedure codes with ranges overlapped at some occasions. I'd like to flag when two ranges overlap and create variable flag_overlap=0; else 1. Below approach doesn't seem to do the job. Any help is appreciate. Thanks.


data have; 
input type code1 $ code2 $;
cards;   
1   00100 00222
2   00300 00352
3   00400 00474
4   00500 00580
5   00600 00670
6   00700 00797
7   00800 00882
8   00902 00952
9   01112 01173
10  01200 01274
11  01320 01444
12  01462 01522
13  01610 01680
14  01710 01782
15  01810 01860
16  01916 01936
17  01951 01953
18  01958 01969
19  01990 01999
20  10021 10022
21  10030 19499
22  20005 29999
23  30000 32999
24  33010 37799
25  38100 38999
26  39000 39599
27  40490 49999
28  50010 53899
29  54000 55899
30  55920 55920
31  55970 55980
32  56405 58999
33  59000 59899
34  60000 60699
35  61000 64999
36  65091 68899
37  69000 69979
38  69990 69990
39  70010 76499
40  76506 76999
41  77001 77022
42  77053 77067
43  77071 77086
44  77261 77799
45  78012 79999
46  0001M 0013M
47  81490 81599
48  0001U 0034U
49  80047 80076
50  80150 80299
51  80305 80377
52  80400 80439
53  80500 80502
54  81000 81099
55  81105 81479
56  81410 81471
57  82009 84999
58  85002 85999
59  86000 86849
60  86850 86999
61  87003 87999
62  88000 88099
63  88104 88199
64  88230 88299
65  88300 88399
66  88720 88749
67  89049 89240
68  89250 89398
69  90281 90399
70  90460 90474
71  90476 90749
72  90785 90899
73  90901 90911
74  90935 90999
75  91010 91299 
76  92002 92499 
77  92502 92700 
78  92950 93799 
79  93880 93998 
80  94002 94799 
81  95004 95199 
82  95250 95251 
83  95803 96020 
84  96040 96040 
85  96101 96127 
86  96150 96161 
87  96360 96549 
88  96567 96574 
89  96900 96999 
90  97010 97799 
91  97802 97804 
92  97810 97814 
93  98925 98929 
94  98940 98943 
95  98960 98962 
96  98966 98969 
97  99000 99091 
98  99100 99140 
99  99151 99157 
100 99170 99199 
101 99500 99602 
102 99605 99607 
103 99201 99215 
104 99217 99226 
105 99221 99239 
106 99241 99255 
107 99281 99288 
108 99291 99292 
109 99304 99318 
110 99324 99337 
111 99339 99340 
112 99341 99350 
113 99354 99416 
114 99366 99368 
115 99374 99380 
116 99381 99429 
117 99441 99449 
118 99450 99456 
119 99460 99463 
120 99464 99465 
121 99466 99480 
122 99483 99483 
123 99484 99484 
124 99487 99490 
125 99492 99494 
126 99495 99496 
127 99497 99498 
128 99499 99499 
129 0001F 0015F 
130 0500F 0584F 
131 1000F 1505F 
132 2000F 2060F 
133 3006F 3776F 
134 4000F 4563F 
135 5005F 5250F 
136 6005F 6150F 
137 7010F 7025F 
138 9001F 9007F 
139 0042T 0042T 
140 0054T 0054T 
141 0055T 0055T 
142 0058T 0058T 
;

data want; 
set have(rename=(code1=code11 code2=code22));
retain code1 code2; code1=code11; code2=code22;
if code11 <= lag(code22) 
then flag_overlap=1; 
else flag_overlap=0;  
run;

Astounding · Posted 06-11-2018 09:07 AM

Can we assume that you meant to input CODE2 as character:

input type code1 $ code2 $ ;

It has to be character, since some values contain letters.

Possibly, the first problem is conceptual, understanding how SAS compares character values. For example:

"3006F" falls within the range "30000" to "32999"

Did you know that? Did you want to flag it?

View solution in original post

Cruise · Posted 06-11-2018 08:51 AM

I don't why why my codes posted flatten out to single line no matter how I try. I try not to use enter or any indents. Still flat!

PaigeMiller · Posted 06-11-2018 08:55 AM

You have a number of problems in your data and in your code.

You might want to read the data in as character instead of numeric, using

input type code1 $ code 2 $;

Furthermore, you might want to sort these somehow. Lastly, you can't use the LAG function inside an IF statement, it does not do what you think it should do. You want

prev_code22=lag(code22);

if code11<=prev_code22 then flag_overlap=1;

Lastly, I don't see any purpose in created code11 and code22, when you could use the original variables code1 and code2.

--
Paige Miller

Cruise · Posted 06-11-2018 09:59 AM

@PaigeMiller

Hi Paige, The final code reflecting your suggestions looked like this. Thanks for pointing out to a necessity to sort the data first!!!

proc sort data=have;
by code1 code2; 
run;

data want; set have;
prev_code2=lag(code2);
if code1<=prev_code2 then flag_overlap=1; else flag_overlap=0;  
run;

data want_nosort; set have_nosort;
prev_code2=lag(code2);
if code1<=prev_code2 then flag_overlap=1; else flag_overlap=0;  
run;

proc freq data=want; /*flagged 17*/
tables flag_overlap/nocol norow nocum nopercent list missing;
title "sorted";
run; 

proc freq data=want_nosort /*flagged only 8*/;
tables flag_overlap/nocol norow nocum nopercent list missing;
title "not sorted";
run;

s_lassen · Posted 06-12-2018 06:58 AM

Not sure if this is relevant by now. But sorting the data and using the LAG function to find the previous value of CODE2 may not be enough. Even if there is not an overlap with the previous record, one of the earlier records may have had a higher value of CODE2, generating an overlap(?).

Using this code on your data

Proc sort data=have;
  by code1 code2;
run;

data want;
  set have;
  retain last_code2 "        ";
  flag_overlap=last_code2>code1;
  drop last_code2;
  if code2>last_code2 then last_code2=code2;
run;

flagged 22 records as having overlaps.

Astounding · Posted 06-11-2018 09:07 AM

Can we assume that you meant to input CODE2 as character:

input type code1 $ code2 $ ;

It has to be character, since some values contain letters.

Possibly, the first problem is conceptual, understanding how SAS compares character values. For example:

"3006F" falls within the range "30000" to "32999"

Did you know that? Did you want to flag it?

Cruise · Posted 06-11-2018 10:10 AM

BIG conceptual problem

@Astounding

Yes, Big conceptual issue here. Thanks for pointing that out indeed. These are CPT medical billing codes. 5 digit codes belong to CPT category I, 4digitF to CPT category II while 4digitM codes to CPT category III codes. My point is that last letter at the end of the variables make variables mutually exclusive. They are different animals not alphabetical order expected as such 0001M less than 0001U as they're flagged out in the screen shot. I don't know what to do at this point. Gotta grab a coffee or something.

ballardw · Posted 06-11-2018 10:32 AM

@Cruise wrote:

BIG conceptual problem

@Astounding

Yes, Big conceptual issue here. Thanks for pointing that out indeed. These are CPT medical billing codes. 5 digit codes belong to CPT category I, 4digitF to CPT category II while 4digitM codes to CPT category III codes. My point is that last letter at the end of the variables make variables mutually exclusive. They are different animals not alphabetical order expected as such 0001M less than 0001U as they're flagged out in the screen shot. I don't know what to do at this point. Gotta grab a coffee or something.

That verbiage is not very clear. Post actual examples of values (note that your original example data include nothing with F or M or U or anything except digits).

Note that FORMATS might help as you can map character values such as 0001F and 0001M to the same 4 digit category for display /count purposes without changing the underlying value. Or to your CPT "category".

Cruise · Posted 06-11-2018 10:42 AM

Yes there are F, M, U values. I just double checked. proc format doesn't help. I need these codes categorized to adjust in the modeling eventually. there is no way to output from proc format to actual dataset, right?

ballardw · Posted 06-11-2018 01:29 PM

@Cruise wrote:
Yes there are F, M, U values. I just double checked. proc format doesn't help. I need these codes categorized to adjust in the modeling eventually. there is no way to output from proc format to actual dataset, right?

Depends on what you mean by "output from proc format to actual dataset". Proc format can create a data set of format definitions using the Cntlout option just as it can create a format from a data set using the Cnltin option. Informats or formats can be used to create additional variables that may (or may not) do what is desired or comparison of values with put(var,fmtname.) le put(var2, fmtname.)

However "range" of values with character values can be a problematic concept in general so "overlap" is has to be extremely carefully defined.

If the question is related to grouping things then formats may be one solution option. I haven't actually been able to figure out what this specific data/ code is actually about. So I did not attempt to provide any example of format usage.

I have some experience with multilabel formats that, depending on procedures use, allow treating all of 0001, 0001F, 0001U and 0001M as a single value for counting and/or group under 0001 for a count total.

The structure of ICD codes I am a little familiar with lends itself to multilabel formats as the values go like Vxx, then Vxxy or Vxxyz for subgroups.

Astounding · Posted 06-11-2018 11:54 AM

Here are a couple of ideas about how to approach the problem.

One possibility: create three sets of groupings (one for each "category").

Another possibility: for grouping purposes, move the "M" or the "U" to the beginning of the string. Store that as a separate variable.

Either of these would take care of the overlapping groups. Not knowing the data, I'm not really trying to favor one or the other.

Cruise · Posted 06-11-2018 12:04 PM

Earlier in my brainstorming processes, I asked the question more conceptually and KurtBremser had just suggested to delineate all the ranges to single lines all the way through. If CPT codes are listed out to one code per line I could simply merge the data to my patients file. I defined ICD9 and ICD10 categories for my patients using their one code per line reference code tables which doesn't exist for CPT codes due to AMA copyrights. However, KurBremser's approach appear to be helpful. I haven't tried his program yet but will do shortly upon some data cleaning.

https://communities.sas.com/t5/Base-SAS-Programming/Define-label-CPT-codes-using-look-up-table-with-...

Cruise · Posted 06-11-2018 12:07 PM

I had similar thinking. group_ind=Substr(codes, 5,1) and use that as group_ind and use the same code along with first. / last. functions in data step to account for mutually exclusiveness.

Classroom Training Available!