I have a claims database with enrolle ID, unique claim ID, date, and a field called "Dx_string", which I created by concatatening all of the five-digit ICD-9 diagnosis codes on the claim. It looks like this:
ID CLAIM_ID DATE DX_STRING
1 100 1/1/2015 7804
1 101 1/1/2015 30921 39021
1 102 2/1/2015 30943 902 01920
1 103 3/1/2015 011
2 104 4/1/2015 4530
2 105 5/1/2015 V9090
3 106 6/1/2015 E884 3092
3 107 6/1/2015 7039
3 108 6/1/2015 800 0930 1092
3 109 7/1/2015 3094
data have;
input ID CLAIM_ID DATE $ DX_STRING $30.;
datalines;
1 100 1/1/2015 7804
1 101 1/1/2015 30921 39021
1 102 2/1/2015 30943 902 01920
1 103 3/1/2015 011
2 104 4/1/2015 4530
2 105 5/1/2015 V9090
3 106 6/1/2015 E884 3092
3 107 6/1/2015 7039
3 108 6/1/2015 800 0930 1092
3 109 7/1/2015 3094
;
run;
My goal is to concatenate all DX_STRING values that occur on the same date into a larger string called DX_STRING_DAY:
ID CLAIM_ID DATE DX_STRING DX_STRING_DAY
1 100 1/1/2015 7804 7804 30921 39021
1 101 1/1/2015 30921 39021 7804 30921 39021
1 102 2/1/2015 30943 902 01920 30943 902 01920
1 103 3/1/2015 011 011
2 104 4/1/2015 4530 4530
2 105 5/1/2015 V9090 V9090
3 106 6/1/2015 E884 3092 E844 3092 7039 800 0930 1092
3 107 6/1/2015 7039 E844 3092 7039 800 0930 1092
3 108 6/1/2015 800 0930 1092 E844 3092 7039 800 0930 1092
3 109 7/1/2015 3094 3094
I can make the variable called DX_STRING_DAY using this code, but then I can't figure out how to make all values of DX_STRING_DAY the same within an individual on a given day.
data want; set have;
by id date;
retain dx_string_day;
if first.date then dx_string_day = dx_string;
if not(first.date) then dx_string_day = catx(" ", dx_string_day, dx_string);
run;
Any suggestions? Thanks.
Hi..not too sure if this is what you want:
data want; set have;
by id date;
retain dx_string_day;
if first.date then dx_string_day = dx_string;
if not(first.date) then dx_string_day = catx(" ", dx_string_day, dx_string);
run;
PROC SORT DATA=WANT;
BY ID DATE;
RUN;
DATA WANT1;
SET WANT;
BY id date;
IF LAST.DATE;
RUN;
PROC SQL;
CREATE TABLE WANT2 AS
SELECT DISTINCT
WANT.ID,
WANT.CLAIM_ID,
WANT.DATE,
WANT.DX_STRING,
WANT1.dx_string_day
FROM WORK.WANT LEFT JOIN WORK.WANT1 ON (WANT.ID = WANT1.ID) AND (WANT.DATE = WANT1.DATE);
QUIT;
Totally agree with Reeza, long form is easier to program with. Take a simple example, you want to get counts of code by subject, long form you could simply use a proc freq, by id. With the concatenated verision you would have to do looping over to get counts and then sum those.
I completely agree that long form is easier for programming, but in this case I actually need to keep the claims in wide form to mesh with the rest of the program. Do you have any suggestions?
Hi..not too sure if this is what you want:
data want; set have;
by id date;
retain dx_string_day;
if first.date then dx_string_day = dx_string;
if not(first.date) then dx_string_day = catx(" ", dx_string_day, dx_string);
run;
PROC SORT DATA=WANT;
BY ID DATE;
RUN;
DATA WANT1;
SET WANT;
BY id date;
IF LAST.DATE;
RUN;
PROC SQL;
CREATE TABLE WANT2 AS
SELECT DISTINCT
WANT.ID,
WANT.CLAIM_ID,
WANT.DATE,
WANT.DX_STRING,
WANT1.dx_string_day
FROM WORK.WANT LEFT JOIN WORK.WANT1 ON (WANT.ID = WANT1.ID) AND (WANT.DATE = WANT1.DATE);
QUIT;
Yes, thank you very much. I was doing something similar except without PROC SQL but I see how this is much more efficient.
data have;
input ID CLAIM_ID DATE $ DX_STRING $30.;
datalines;
1 100 1/1/2015 7804
1 101 1/1/2015 30921 39021
1 102 2/1/2015 30943 902 01920
1 103 3/1/2015 011
2 104 4/1/2015 4530
2 105 5/1/2015 V9090
3 106 6/1/2015 E884 3092
3 107 6/1/2015 7039
3 108 6/1/2015 800 0930 1092
3 109 7/1/2015 3094
;
run;
data want;
length DX_STRING_DAY $ 200;
do until(last.date);
set have;
by ID DATE ;
DX_STRING_DAY=catx(' ',DX_STRING_DAY ,DX_STRING );
end;
do until(last.date);
set have;
by ID DATE ;
output;
end;
run;
Thanks for the solution.
My claims dataset has over 100 million observations and it's not clear to me how to choose a length for the DX_STRING_DAY variable. It's quite possible that this string could get really long if someone had something like 40 claims in a day. On the other hand, I don't want to waste valuable hard drive space. Any strategies on how to manage this situation? I guess you could use something like a length of $1000 to be overly generous, then trim it later.
Thanks.
Thanks. I do understand your point. I had tried it your way with proc transpose but was running into prolems.
Going back to the original structure of the claims (which has four diagnosis codes per claim), I have data like this:
data have;
input ID CLAIM_ID DATE $ DX1 $ DX2 $ DX3 $ DX4;
datalines;
1 100 1/1/2015 7804
1 101 1/1/2015 30921 39021
1 102 2/1/2015 30943 902 01920
1 103 3/1/2015 011
2 104 4/1/2015 4530
2 105 5/1/2015 V9090
3 106 6/1/2015 7039
3 107 6/1/2015 7039
3 108 6/1/2015 E884 0930 1092 0930
3 109 7/1/2015 3094
;
I want to create a dataset that has all of the *unique* diagnosis codes that occurred on claims during the same day for a given individual, as below (I don't care about whether it was DX1, DX2, DX3, or DX4). This would involve using proc transpose somehow to create a series of variables that I might call "DXSAMEDAY1-DXSAMEDAY5" (in this fake example, five variables would be created but it would in reality be way more). I've been playing with the syntax of proc transpose and can't get this work, though.
I'd appreciate suggestions on how to proceed. Thanks.
Your problms, and future ones, all lie in the method of storing your data. If you normalise your data, and work with the normalised data, you will find your code is simpler to work with. This doesn't prevent you, if you want a proc report with transposed data, transposing it at that point.
data have;
infile datalines dlm="," missover;
input id claim_id date $ dx1 $ dx2 $ dx3 $ dx4 $;
datalines;
1,100,1/1/2015,7804,,
1,101,1/1/2015,30921,39021,,
1,102,2/1/2015,30943,902,01920,
1,103,3/1/2015,011,,,
2,104,4/1/2015,4530,,,
2,105,5/1/2015,V9090,,,
3,106,6/1/2015,7039,,,
3,107,6/1/2015,7039,,,
3,108,6/1/2015,E884,0930,1092,0930
3,109,7/1/2015,3094,,,
;
run;
/* Normalise the data to have one DX per row */
proc transpose data=have out=want;
by id claim_id date;
var dx:; /* Note the use of dx: which means all with the prefix dx */
run;
/* Sort duplicates and missings out */
proc sort data=want (drop=_name_ rename=(col1=dx)) nodupkey;
by id claim_id date dx;
where dx ne "";
run;
You see from the above that finding duplicates, sorting etc. becomes so much easier with the different strcuture.
In SAS ,each variable has the same length unlike other Data Base like Oracle have VARCHAR() format.
If you don't want this kind of code, you can create many variables like DX_STRING1 ,DX_STRING2,DX_STRING3, DX_STRING4 ....... to hold these values.
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.