About chuakp

chuakp · ‎01-20-2016

I am working with a medical claims dataset with variables ID (enrollee ID), claimID (a unique identifier for each claim), and the date of service. For this example, assume that claimID 1-5 are all claims for the same medical service, and I only want to count services *within an individual* that occurred during a 7 day period as one service (this is because there are sometimes duplicate claims with dates of service that are in close proximity). For this example, claimID 1 would be kept and claimID 2 would be deleted because it's within 6 days of claimID 1. However, claimID 3 would be kept even though it's within 2 days of claimID 2 because claimID 2 will be deleted and claimID 3 is more than 7 days away from claimID 1. ClaimID 4 and ClaimID 5 would be deleted because they are within 7 days of ClaimID 3, which will be kept. ID ClaimID Date 1 1 1/1/2015 1 2 1/7/2015 1 3 1/9/2015 1 4 1/10/2015 1 5 1/16/2015 I think the solution might be to create a loop that finds a claim that needs to be eliminated because it's within 7 days of the first claim; as soon as such a claim is detected, the loop exits and the claim is eliminated from the dataset. On the next iteration of the loop, SAS starts over from the first claim and finds if there is another claim that needs to be eliminated because it's within 7 days; if not, SAS will keep the next the next claim that needs to be deduplicated, etc. But I'm not sure exactly how to implement this. I'd appreciate any advice.

chuakp · ‎12-26-2015

Thanks so much.

chuakp · ‎12-23-2015

Thank you. This works. One last question - how would I create a global macro variable that has the number of DXS diagnoses in the file?

chuakp · ‎12-23-2015

I have a large claims databases with four diagnosis codes per claim. The data looks like this: ID CLAIM_ID DATE DX1 DX2 DX3 DX4 1 100 1/1/2015 7804 1 101 1/1/2015 30921 39021 1 102 2/1/2015 30943 902 01920 1 103 3/1/2015 011 2 104 4/1/2015 4530 2 105 5/1/2015 V9090 3 106 6/1/2015 7039 3 107 6/1/2015 7039 3 108 6/1/2015 E884 0930 1092 0930 3 109 7/1/2015 3094 data have; infile datalines dlm="," missover; input id claim_id date $ dx1 $ dx2 $ dx3 $ dx4 $; datalines; 1,100,1/1/2015,7804,, 1,101,1/1/2015,30921,39021,, 1,102,2/1/2015,30943,902,01920, 1,103,3/1/2015,011,,, 2,104,4/1/2015,4530,,, 2,105,5/1/2015,V9090,,, 3,106,6/1/2015,7039,,, 3,107,6/1/2015,7039,,, 3,108,6/1/2015,E884,0930,1092,0930 3,109,7/1/2015,3094,,, ; run; I want to create a dataset that has all of the *unique* diagnosis codes that occurred on claims during the same day for a given individual, as below (I don't care about whether it was DX1, DX2, DX3, or DX4). This would involve using proc transpose somehow to create a series of variables that I might call "DXS1-DXS5" (in this fake example, five variables would be created but it would in reality be way more). I've been playing with the syntax of proc transpose and can't get this work, though. ID CLAIM_ID DATE DX1 DX2 DX3 DX4 DXS1 DXS2 DXS3 DXS4 DXS5 1 100 1/1/2015 7804 7804 30921 39021 1 101 1/1/2015 30921 39021 7804 30921 39021 1 102 2/1/2015 30943 902 01920 30943 902 01920 1 103 3/1/2015 011 011 2 104 4/1/2015 4530 4530 2 105 5/1/2015 V9090 V9090 3 106 6/1/2015 7039 7039 E884 0930 1092 0930 3 107 6/1/2015 7039 7039 E884 0930 1092 0930 3 108 6/1/2015 E884 0930 1092 0930 7039 E884 0930 1092 0930 3 109 7/1/2015 3094 3094 I'd appreciate suggestions on how to proceed. Thanks.

chuakp · ‎12-21-2015

Thanks. I do understand your point. I had tried it your way with proc transpose but was running into prolems. Going back to the original structure of the claims (which has four diagnosis codes per claim), I have data like this: data have; input ID CLAIM_ID DATE $ DX1 $ DX2 $ DX3 $ DX4; datalines; 1 100 1/1/2015 7804 1 101 1/1/2015 30921 39021 1 102 2/1/2015 30943 902 01920 1 103 3/1/2015 011 2 104 4/1/2015 4530 2 105 5/1/2015 V9090 3 106 6/1/2015 7039 3 107 6/1/2015 7039 3 108 6/1/2015 E884 0930 1092 0930 3 109 7/1/2015 3094 ; I want to create a dataset that has all of the *unique* diagnosis codes that occurred on claims during the same day for a given individual, as below (I don't care about whether it was DX1, DX2, DX3, or DX4). This would involve using proc transpose somehow to create a series of variables that I might call "DXSAMEDAY1-DXSAMEDAY5" (in this fake example, five variables would be created but it would in reality be way more). I've been playing with the syntax of proc transpose and can't get this work, though. I'd appreciate suggestions on how to proceed. Thanks.

chuakp · ‎12-21-2015

Thanks for the solution. My claims dataset has over 100 million observations and it's not clear to me how to choose a length for the DX_STRING_DAY variable. It's quite possible that this string could get really long if someone had something like 40 claims in a day. On the other hand, I don't want to waste valuable hard drive space. Any strategies on how to manage this situation? I guess you could use something like a length of $1000 to be overly generous, then trim it later. Thanks.

chuakp · ‎12-18-2015

Yes, thank you very much. I was doing something similar except without PROC SQL but I see how this is much more efficient.

chuakp · ‎12-18-2015

I completely agree that long form is easier for programming, but in this case I actually need to keep the claims in wide form to mesh with the rest of the program. Do you have any suggestions?

chuakp · ‎12-18-2015

I have a claims database with enrolle ID, unique claim ID, date, and a field called "Dx_string", which I created by concatatening all of the five-digit ICD-9 diagnosis codes on the claim. It looks like this: ID CLAIM_ID DATE DX_STRING 1 100 1/1/2015 7804 1 101 1/1/2015 30921 39021 1 102 2/1/2015 30943 902 01920 1 103 3/1/2015 011 2 104 4/1/2015 4530 2 105 5/1/2015 V9090 3 106 6/1/2015 E884 3092 3 107 6/1/2015 7039 3 108 6/1/2015 800 0930 1092 3 109 7/1/2015 3094 data have; input ID CLAIM_ID DATE $ DX_STRING $30.; datalines; 1 100 1/1/2015 7804 1 101 1/1/2015 30921 39021 1 102 2/1/2015 30943 902 01920 1 103 3/1/2015 011 2 104 4/1/2015 4530 2 105 5/1/2015 V9090 3 106 6/1/2015 E884 3092 3 107 6/1/2015 7039 3 108 6/1/2015 800 0930 1092 3 109 7/1/2015 3094 ; run; My goal is to concatenate all DX_STRING values that occur on the same date into a larger string called DX_STRING_DAY: ID CLAIM_ID DATE DX_STRING DX_STRING_DAY 1 100 1/1/2015 7804 7804 30921 39021 1 101 1/1/2015 30921 39021 7804 30921 39021 1 102 2/1/2015 30943 902 01920 30943 902 01920 1 103 3/1/2015 011 011 2 104 4/1/2015 4530 4530 2 105 5/1/2015 V9090 V9090 3 106 6/1/2015 E884 3092 E844 3092 7039 800 0930 1092 3 107 6/1/2015 7039 E844 3092 7039 800 0930 1092 3 108 6/1/2015 800 0930 1092 E844 3092 7039 800 0930 1092 3 109 7/1/2015 3094 3094 I can make the variable called DX_STRING_DAY using this code, but then I can't figure out how to make all values of DX_STRING_DAY the same within an individual on a given day. data want; set have; by id date; retain dx_string_day; if first.date then dx_string_day = dx_string; if not(first.date) then dx_string_day = catx(" ", dx_string_day, dx_string); run; Any suggestions? Thanks.

chuakp · ‎11-21-2015

There seems to be an error since the only three values of birthmonth from the output dataset are 2, 3, and 12. I think this has to do with the fact that the code uses lag2 instead of all possible values of lag like lag3, lag4, etc. Focusing on scenario 1 and 2 only for the moment, I'm trying the following code that manually compares every age to January age; if they are not equivalent, birth month is set to that month and then the loop should exit. However, this code results in an infinite loop. data want; set have; by id; retain _jan_age; if month = 1 then _jan_age = age; birthmonth = .; do while(birthmonth = .); if age NE _jan_age then birthmonth = month - 1; if month = 12 and age = _jan_age then birthmonth = 12; end; run;

chuakp · ‎11-21-2015

Thanks for looking at this - I really appreciate it. Let me try it on my dataset.

chuakp · ‎11-20-2015

Thanks for these responses. I realized I didn't do a good job of illustrating the complexity of this problem. There are four types of people that require different programming appraoches: 1) There are some individuals like ID 1 who are in the dataset for 12 months and whose AGE changes mid-year. Since AGE refers to age at the beginning of the month, ID 1 was born February 2-March 1, 2012 since he/she was born in 2012, 0 years old on February 1, 2013, and 1 year old on March 1, 2013. For the purposes of this analysis, I'm going to assign them to the birth month of February (want BIRTHMONTH = 2) 2) There are some individuals like ID 2 who are in the dataset for 12 months and whose AGE does not change mid-year. ID 2 was born December 2 - December 31, 2011 because he/she was born in 2011 and was 2 years old at the beginning of every month of 2013 (want BIRTHMONTH = 12). 3) Some people like ID 3 were clearly not births and were enrolled in the insurance plan for only 5 months. I'd actually like to kick these person out of the dataset since I'm interested in people who were continuously enrolled for all 12 months (excluding births). 4) ID 4 is a 0-year old who entered into the dataset in August 2013. While it's possible that this person was born earlier in the year and only appeared in the dataset because they switched to this insurance plan in August, I'm going to assume this person was born in August (want BIRTHMONTH = 😎 - I'll check this assumption later by looking for a birth-related claim. If I just had date of birth or even month of birth, none of these contortions would be necessary, but unfortunately this is what I'm faced with. Thanks. data have; input ID MONTH AGE BIRTHYEAR CURRENTYEAR; cards; 1 1 0 2012 2013 1 2 0 2012 2013 1 3 1 2012 2013 1 4 1 2012 2013 1 5 1 2012 2013 1 6 1 2012 2013 1 7 1 2012 2013 1 8 1 2012 2013 1 9 1 2012 2013 1 10 1 2012 2013 1 11 1 2012 2013 1 12 1 2012 2013 2 1 2 2011 2013 2 2 2 2011 2013 2 3 2 2011 2013 2 4 2 2011 2013 2 5 2 2011 2013 2 6 2 2011 2013 2 7 2 2011 2013 2 8 2 2011 2013 2 9 2 2011 2013 2 10 2 2011 2013 2 11 2 2011 2013 2 12 2 2011 2013 3 1 4 2010 2013 3 2 4 2010 2013 3 3 4 2010 2013 3 4 4 2010 2013 3 5 4 2010 2013 4 8 0 2013 2013 4 9 0 2013 2013 4 10 0 2013 2013 4 11 0 2013 2013 4 12 0 2013 2013 ; run;

chuakp · ‎11-19-2015

I am trying to ascertain month of birth from an insurance claims enrollment file in long format. AGE refers to the age at the beginning of the month. The basic structure is like this: ID MONTH AGE 1 1 0 1 2 0 1 3 1 1 4 1 1 5 1 1 6 1 1 7 1 1 8 1 1 9 1 1 10 1 1 11 1 1 12 1 This person was born in February because they were 0 years of age in February (month 2) and 1 year of age in March (month 3). I have been trying unsuccessfuly to use some combination of "by id" and "retain" statements in a DATA step to get SAS to set the month of birth to the first instance in which AGE changes values. Any suggestions? Thanks.

chuakp · ‎06-26-2014

Thanks, Patrick - you've helped me a tremendous amount and I really appreciate it.

chuakp · ‎06-25-2014

Patrick, thanks for this solution - this is very helpful. I am able to use this code to correctly set the flag for if the search string consists of candidate codes with four digits, but for some reason, the flag does not set correctly if the search string contains any candidate code with five digits. So for example, if I change your code to the following (see bold), the global macro _search_string1 resolves to "0039","00301","0031". %LET CANDIDATECODE_5DIGIT1 = 00301; %LET CANDIDATECODE_5DIGIT2 = 0031; %LET CANDIDATECODE_5DIGIT3 = 0039; %LET CANDIDATECODE_5DIGIT4 = 0084; %LET CANDIDATECODE_5DIGIT5 = 0085; %LET CANDIDATECODE_5DIGIT6 = 0090; .... data Have; ... datalines; 00301,,,003 0032,,,003 0032,0030,,003,003 0030,0085,0091,003,008,009 0030,0085,0090,003,008,009 0031,0080,0085,003,008,008 ; run; flag1 sets to 0 for the first observation and sets to 1 for observations 3-6. However, flag1 should equal 1 for observation 1 (since 00301 is present) and observation 6 (since 0031 is present) but should equal zero for observations 3-5 (0030 is present and 0030 is not part of the search string). flag2 sets correctly to 1 (because the search string consists of two four-digit codes "0084", "0085") , as does flag 3 (because the search string consists of a four-digit code "0090."). It's a bit of a weird pattern - SAS appears to be ignoring the fact that 00301 (part of the search code) is present in observation 1 and treating 0030 (not part of the search code) in observations 3-5 as being equal to 00301. Any ideas on why this might be the case? Thanks. __________________________________________________________________________________________________________________________ To address your other comment, I do have code that correctly set the flags but it is terribly inefficient. This code searches each of the 3 five-digit code variables for any of the 900 five-digit candidate codes and takes about 2 hours to run (compared to 1.5 minutes for your code). %let cc3d = candidatecode_3digit; %let cc5d = candidatecode_5digit; %macro flag; %do j = 1 %to 321; /*Counter for the 321 3-digit codes*/ data newflag&j; set maindatafile; array dx_5digit {*} $ dx_code_5digit1-dx_code_5digit3; array dx_3digit {*} $ dx_code_3digit1-dx_code_3digit3; flag&j = 0; %do k = 1 %to &nobs_codes_5digit; /*Counter for 900 5-digit codes*/ %do i = 1 %to 3; if dx_3digit(&i) = "&&&cc3d&j." then do; if dx_5digit(&i) = "&&&cc5d&k." then flag&j = 1; end; %end; %end; run; %end; %mend flag; %flag;

Online Status	Offline
Date Last Visited	‎10-26-2022 07:20 PM

Re: Adding overlapping dates to the end of a date range

Re: Adding overlapping dates to the end of a date range

Re: Adding overlapping dates to the end of a date range

Adding overlapping dates to the end of a date range

Re: Expanding a range of numbers

Re: Expanding a range of numbers

Re: Expanding a range of numbers

Re: Expanding a range of numbers

Expanding a range of numbers

Re: Reading in all GZ files in a folder

Re: Search string variable in one table with string variable from anot...

Search string variable in one table with string variable from another ...

Eliminating observations with datasets in long form

Re: Proc transpose

Re: Proc transpose

Proc transpose

Re: Question regarding "By" processing in data step

Re: Question regarding "By" processing in data step

Re: Question regarding "By" processing in data step

Re: Question regarding "By" processing in data step

Question regarding "By" processing in data step

Re: "By" processing in a data step

Re: "By" processing in a data step

Re: "By" processing in a data step

"By" processing in a data step

Re: Searching multiple lists of global macro variables

Re: Searching multiple lists of global macro variables