Hello,
I'm trying to use a hash table and/or dow loops to concatenate values by rows of data representing separate dates a color was added to the database. First I created a counter to determine the number of days between the dates the same color was added. I need this later on so I do not concatenate if the number of days is greater than 30. Also another rule for contentation is that the colors need to be entered on the same day. I grouped by date which worked for the matched dates but I also need the dates to repeat until the next match.
Below I tried to create a sample dataset but had trouble with inputting dates (as usual) so if anyone can fix the input statement for the sample to work that would be great. The last variable, ExpectedCombo is what I want to end up with. The code I started with concatenates only the matched dates but needs to continue in other dates.
Can any one help me create code to match the "expectedCombo" field?
data have ;
infile datalines dsd ;
input id date MMDDYY10. color $10. expectedCombo $400.;
datalines
;
1321,07/10/2015,blue,blue
1321,07/11/2015,blue,blue
1321,07/12/2015,blue,blue
1321,07/13/2015,blue,blue
1321,08/18/2015,red,blue - red
1321,08/19/2015,blue,blue - red
1321,08/20/2015,blue,blue - red
1321,08/21/2015,blue,blue - red
1321,08/22/2015,blue,blue - red
1321,09/15/2015,red,blue - red
1321,09/16/2015,blue,blue - red
1321,09/17/2015,blue,blue - red
1321,09/18/2015,blue,blue - red
1321,01/12/2016,red,blue - red
1321,01/12/2016,blue,blue - red
1321,01/13/2016,blue,blue - red
1321,01/14/2016,blue,blue - red
1321,01/15/2016,blue,blue - red
1321,01/16/2016,blue,blue - red
1321,01/17/2016,blue,blue - red
1321,01/18/2016,blue,blue - red
1321,01/19/2016,green,blue - red - green
1321,01/19/2016,blue,blue - red - green
1321,01/20/2016,blue,blue - red - green
1321,01/21/2016,blue,blue - red - green
1321,01/22/2016,blue,blue - red - green
1321,01/23/2016,blue,blue - red - green
1321,01/24/2016,blue,blue - red - green
1321,01/25/2016,blue,blue - red - green
;
run;
proc sort data=have ;
by id date ;
run;
data revised1 ;
set have ;
by id date;
format redStart greenStart date9. redInterval greenInterval 8. ;
retain redStart greenStart redcounter 0 greencounter 0 redinterval greeninterval;
if color='red' then
do ;
if redcounter>0 then redInterval=date - redStart ;
redStart=date ;
redcounter+1 ;
end ;
if color='green' then
do ;
if greencounter>0 then greenInterval=date - greenStart ;
greenStart=date ;
greencounter+1 ;
end ;
if last.id then do ;
redStart=. ;
greenStart=. ;
end;
redInterval=redInterval ;
greenInterval=greenInterval ;
run;
proc print; run;
PROC SORT DATA=REVISED1 ;
BY ID Date ;
RUN;
data revised2 ;
length newCombo $400 ;
do until (last.date) ;
set revised1 ;
by id date ;
if newCombo="" then newCombo=color;
else newCombo = Catx(" - ",newcombo, color) ;
end;
do until (last.date) ;
set revised1 ;
by id date ;
output;
end;
run;
proc print; run;
This program uses hash tables to keep the most recent preceding date of each color, which allows the program below to removed any colors that are over 30 days. (Notice that, unlike your example, the 1/12/2016 NEWCOMBO is only RED, since other colors are over 30 days old).
A couple of notes:
Regards,
Mark
data have ;
informat date mmddyy10.;
input id $ date color $ expectedCombo $;
format date mmddyy10.;
datalines;
1321 07/10/2015 blue blue
1321 07/11/2015 blue blue
1321 07/12/2015 blue blue
1321 07/13/2015 blue blue
1321 08/18/2015 red blue-red
1321 08/19/2015 blue blue-red
1321 08/20/2015 blue blue-red
1321 08/21/2015 blue blue-red
1321 08/22/2015 blue blue-red
1321 09/15/2015 red blue-red
1321 09/16/2015 blue blue-red
1321 09/17/2015 blue blue-red
1321 09/18/2015 blue blue-red
1321 01/12/2016 red blue-red
1321 01/12/2016 blue blue-red
1321 01/13/2016 blue blue-red
1321 01/14/2016 blue blue-red
1321 01/15/2016 blue blue-red
1321 01/16/2016 blue blue-red
1321 01/17/2016 blue blue-red
1321 01/18/2016 blue blue-red
1321 01/19/2016 green blue-red-green
1321 01/19/2016 blue blue-red-green
1321 01/20/2016 blue blue-red-green
1321 01/21/2016 blue blue-red-green
1321 01/22/2016 blue blue-red-green
1321 01/23/2016 blue blue-red-green
1321 01/24/2016 blue blue-red-green
1321 01/25/2016 blue blue-red-green
;
data want;
if _n_=1 then do;
if 0 then set have;
attrib start format=mmddyy10. length=8;
declare hash history(ordered:'a');
history.definekey('color');
history.definedata('color','start');
history.definedone();
end;
array colors{20} $20 _temporary_;
attrib newcombo length=$400;
retain newcombo;
do until (last.date);
set have;
by id date;
/* Update this color item in the history table */
rc=history.find();
start=date;
rc=history.replace();
/* If it's a new color, add it to beginning of colors array, which will be blank */
if findw(trim(newcombo),trim(color),'-')=0 then colors{1}=color;
/* Use hash table to see if any colors are over 30 days and should be removed*/
/* Start at right-hand end of array because it is sorted (therefore left end */
/* is blank except for the new color) */
current_color=color;
do d=dim(colors) to 1 by -1 while(colors{d}^=' ');
rc=history.find(key:colors{d});
if date-start>30 then colors{d}=' ';
end;
color=current_color;
call sortc(of colors{*});
end;
newcombo=catx('-',of colors{*});
if last.id then do;
rc=history.clear();
call missing(of colors{*});
end;
run;
First: Is this suitable for your sample data?
data have ;
informat date mmddyy10.;
input id $ date color $ expectedCombo $;
format date mmddyy10.;
datalines;
1321 07/10/2015 blue blue
1321 07/11/2015 blue blue
1321 07/12/2015 blue blue
1321 07/13/2015 blue blue
1321 08/18/2015 red blue-red
1321 08/19/2015 blue blue-red
1321 08/20/2015 blue blue-red
1321 08/21/2015 blue blue-red
1321 08/22/2015 blue blue-red
1321 09/15/2015 red blue-red
1321 09/16/2015 blue blue-red
1321 09/17/2015 blue blue-red
1321 09/18/2015 blue blue-red
1321 01/12/2016 red blue-red
1321 01/12/2016 blue blue-red
1321 01/13/2016 blue blue-red
1321 01/14/2016 blue blue-red
1321 01/15/2016 blue blue-red
1321 01/16/2016 blue blue-red
1321 01/17/2016 blue blue-red
1321 01/18/2016 blue blue-red
1321 01/19/2016 green blue-red-green
1321 01/19/2016 blue blue-red-green
1321 01/20/2016 blue blue-red-green
1321 01/21/2016 blue blue-red-green
1321 01/22/2016 blue blue-red-green
1321 01/23/2016 blue blue-red-green
1321 01/24/2016 blue blue-red-green
1321 01/25/2016 blue blue-red-green
;
Thank you. Yes that helps with the date but I added informats to the other variables.
data have ;
informat id 8. date mmddyy10. color $10. expectedcombo $400.;
infile datalines dsd ;
input id date color $ expectedCombo $;
format date mmddyy10. color $10. expectedCombo $400.;
datalines
Also, if it will help understand what I want to do you can assume that the colors are actually drug names and the date is the date the drug was administered. If the drugs were taken on the same day then I want one field that shows all the drugs taken at the same time. Since the newly added drug/color will continue but is taken once every month or so then it should still be considered a concurrrent drug. This is why I want to extend the concatenation of the prior concatenation down. Once I can figure how to do that I will add a condition so that if it reaches the drug/color with an interval >30 then the concatenation stops since it is likely that drug is not longer being administered. Hopefully that helps and doesn't confuse the issue.
Thanks
The following code populates column expectedCombo_calc the same way than what you have given us in your sample data.
Given your narrative I feel this won't be the final solution for you.
"If the drugs were taken on the same day then I want one field that shows all the drugs taken at the same time"
Does that mean you need two columns? One for "same day" and one that accumulates over time as shown in your sample?
add a condition so that if it reaches the drug/color with an interval >30 then the concatenation stops
That would need some additional coding. Please provid the sample data.
Is the order of the concatenated string important or should it always be the same? So could you get "blue-red" once and another time "red-blue"? Is that what you want or would you rather always have the same sort order of the words in the string?
It probably would also help if you could explain what you need this concatenated string for. If this is about clustering for some analysis then may be some of the SAS Proc can do something like that for you (not my expertise but there are people in this forum who will know).
And here the code
data have;
informat id 8. date mmddyy10. color $10. expectedcombo $400.;
infile datalines dsd dlm=',' truncover;
input id date color $ expectedCombo $;
format date mmddyy10. color $10. expectedCombo $400.;
datalines;
1321,07/10/2015,blue,blue
1321,07/11/2015,blue,blue
1321,07/12/2015,blue,blue
1321,07/13/2015,blue,blue
1321,08/18/2015,red,blue-red
1321,08/19/2015,blue,blue-red
1321,08/20/2015,blue,blue-red
1321,08/21/2015,blue,blue-red
1321,08/22/2015,blue,blue-red
1321,09/15/2015,red,blue-red
1321,09/16/2015,blue,blue-red
1321,09/17/2015,blue,blue-red
1321,09/18/2015,blue,blue-red
1321,01/12/2016,red,blue-red
1321,01/12/2016,blue,blue-red
1321,01/13/2016,blue,blue-red
1321,01/14/2016,blue,blue-red
1321,01/15/2016,blue,blue-red
1321,01/16/2016,blue,blue-red
1321,01/17/2016,blue,blue-red
1321,01/18/2016,blue,blue-red
1321,01/19/2016,green,blue-red-green
1321,01/19/2016,blue,blue-red-green
1321,01/20/2016,blue,blue-red-green
1321,01/21/2016,blue,blue-red-green
1321,01/22/2016,blue,blue-red-green
1321,01/23/2016,blue,blue-red-green
1321,01/24/2016,blue,blue-red-green
1321,01/25/2016,blue,blue-red-green
;
run;
data want;
set have;
length expectedCombo_calc $ 400;
retain expectedCombo_calc;
if not findw(expectedCombo_calc,color,'-','it')
then expectedCombo_calc=catx('-',expectedCombo_calc,color);
run;
Thank you. The only data I have to work with are the id, drug name/color and the date of administration. I included the expected column in the sample data to show what I was looking for. I calculated the the days between entry of the red color (which could be a specific drug name) which I wanted to use to determine the length of time between drug/color administrations which might help to determine when to stop the concatenation.
I added to my code but still having problems. When the interval is >30 for a particular color then the color should not be concatenated. Also, I'd like a way to figure out how to concatenate without hardcoding in the colors because it is possible for the real dataset to have over 25 different colors (or drugs in the real database).
If you are willing to take another look, here is the code which might answer some of your questions and I have attached a larger sample file.
proc sort data=colors;
by id date ;
run;
data revised1 (drop=var8 octstart lanstart);
set colors;
by id date;
format redStart greenStart date9. redInterval greenInterval 8. ;
retain redStart greenStart redcounter 0 greencounter 0 redinterval greeninterval;
if color='red' then
do ;
if redcounter>0 then redInterval=date - redStart ;
redStart=date ;
redcounter+1 ;
end ;
if color='green' then
do ;
if greencounter>0 then greenInterval=date - greenStart ;
greenStart=date ;
greencounter+1 ;
end ;
if last.id then do ;
redStart=. ;
greenStart=. ;
end;
redInterval=redInterval ;
greenInterval=greenInterval ;
run;
proc print; run;
PROC SORT DATA=REVISED1 ;
BY ID Date ;
RUN;
/*concatenate for colors entered on the same date*/
data revised2 ;
length newCombo $400 ;
do until (last.date) ;
set revised1 ;
by id date ;
if newCombo="" then newCombo=color;
else newCombo = Catx(" - ",newcombo, color) ;
end;
do until (last.date) ;
set revised1 ;
by id date ;
/*CONTINUE newCombo BASED ON PRIOR CONCATENATED newCombo IF THE INTERVAL IS > 30 FOR TESTING PURPOSES (VALUE MAY CHANGE TO 45 )*/
/*PROBLEM WITH THIS IS THAT THE Color NEEDS TO BE HARDCODED IN FOR EACH DRUG and create a variable in order to incorporare all colors*/
/*Can this be done without knowing all the colors that might show up in the database? */
if index(newcombo, 'red') = 0 and (redcounter > 0 OR 0 < redinterval < 30) then newCombo2=Catx(" - ",newcombo, 'red') ;
else newcombo2=newcombo ;
if index(newCombo2, 'green') = 0 and (greencounter > 0 OR 0 < greeninterval < 30) then newCombo3=Catx(" - ",newCombo2, 'green') ;
else newcombo3=newCombo2;
output;
end;
run;
/*alphabetize combos*/
Data revised3 ;
set revised2;
length finalCombo $400;
/*create an array to store each separate drug*/
array colorlist[30] $50 _temporary_;
/*initialize all array elements to missing*/
call missing(of colorList[*]);
/*parse the colorList and place into array elements*/
do i = 1 to dim(colorList) until(p eq 0);
call scan(newcombo3,i,p,l,"-"); /*p is the position of word returned; l is the length of the word returned*/
/*account for legitimate space between color names such as light blue*/
if p=1 then colorList[i] = substrn(newcombo3,p,l-1);
else colorList[i] = substrn(newcombo3,p+1,l-1); /*substrn is like substr but can return a length of 0 needed for the until clause*/
end;
/*sort array elements*/
call sortc(of colorList[*]);
/*Remove duplicate colorList */
do w=1 to dim(colorList);
do x=1 to dim(colorList);
/*IF logic to determine if the values are the same and if so, set
the repeat value to missing*/
if w ne x and colorList(x)=colorList(w) then colorList(x)=' ';
end;
end;
/*this section of the code checks to see if a value is missing and if so,
shifts the non-missing values to the left*/
do y=1 to dim(colorList) ;
if colorList[y] = '' then do;
do z= y+1 to dim(colorList);
if colorList[z] ne '' then do;
colorList[y] = colorList[z];
colorList[z] = '';
leave;
end;
end;
end;
end;
finalCombo = catx(" - ",of colorList[*]);
drop i p l w x y z ;
run;
proc print ; run;
If I understand correctly what you're after then the following code should do. I went for an approach which has a bit more looping over the hash table in order to avoid re-building the combo string in every single iteration of the data step (as string operations are costly).
There is no limitation in the number of colors you can have other than by the length definition for combo.
The combo string will always be in sorted order (eg. blue before red) independent which color came first.
%let max_interval=30;
data want(drop=_:);
set have;
by id date;
length Combo $ 400;
retain Combo;
if _n_=1 then
do;
_color=color;
format _date date9.;
dcl hash h1(multidata:'n', ordered:'y');
dcl hiter hh1('h1');
h1.defineKey('id','color');
h1.defineData('_color','_date');
h1.defineDone();
end;
/* clear hash if new id starts */
if first.id then h1.clear();
/* _new_color_flg=1 if there is a new color */
_new_color_flg= (h1.check() ne 0);
/* keep hash up to date */
_color=color;
_date=date;
_rc=h1.replace();
/* do we have to refresh the values in combo? */
_rc = hh1.first();
do while (_rc = 0);
if _new_color_flg or intck('day',_date,date)>&max_interval then
do;
/* refresh combo */
call missing(combo);
_rc = hh1.first();
do while (_rc = 0);
if intck('day',_date,date)<=&max_interval then combo=catx('-',combo,_color);
_rc = hh1.next();
end;
leave;
end;
_rc = hh1.next();
end;
run;
Patrick:
I like the simplicity of your code ... but ...
... I suspect that looping through the hash object to build the combo can be expensive relative to looping through the old combo and examining the corresponding hash items. What if there are 100 colors over a person's history even though only, (say) 5 are current? If that's the case, then looping through 5 to see if they should be excluded will likely take less time than looping through 100 to see iwhich should be included.
After all, with each incoming record, the number of active colors can go up by only one, or down as far as 1.
regards,
Mark
I principally agree with what you say. Given that the "colors" are drugs taken by people I'd hope though that the numbers will in average remain a bit lower than 100.
This program uses hash tables to keep the most recent preceding date of each color, which allows the program below to removed any colors that are over 30 days. (Notice that, unlike your example, the 1/12/2016 NEWCOMBO is only RED, since other colors are over 30 days old).
A couple of notes:
Regards,
Mark
data have ;
informat date mmddyy10.;
input id $ date color $ expectedCombo $;
format date mmddyy10.;
datalines;
1321 07/10/2015 blue blue
1321 07/11/2015 blue blue
1321 07/12/2015 blue blue
1321 07/13/2015 blue blue
1321 08/18/2015 red blue-red
1321 08/19/2015 blue blue-red
1321 08/20/2015 blue blue-red
1321 08/21/2015 blue blue-red
1321 08/22/2015 blue blue-red
1321 09/15/2015 red blue-red
1321 09/16/2015 blue blue-red
1321 09/17/2015 blue blue-red
1321 09/18/2015 blue blue-red
1321 01/12/2016 red blue-red
1321 01/12/2016 blue blue-red
1321 01/13/2016 blue blue-red
1321 01/14/2016 blue blue-red
1321 01/15/2016 blue blue-red
1321 01/16/2016 blue blue-red
1321 01/17/2016 blue blue-red
1321 01/18/2016 blue blue-red
1321 01/19/2016 green blue-red-green
1321 01/19/2016 blue blue-red-green
1321 01/20/2016 blue blue-red-green
1321 01/21/2016 blue blue-red-green
1321 01/22/2016 blue blue-red-green
1321 01/23/2016 blue blue-red-green
1321 01/24/2016 blue blue-red-green
1321 01/25/2016 blue blue-red-green
;
data want;
if _n_=1 then do;
if 0 then set have;
attrib start format=mmddyy10. length=8;
declare hash history(ordered:'a');
history.definekey('color');
history.definedata('color','start');
history.definedone();
end;
array colors{20} $20 _temporary_;
attrib newcombo length=$400;
retain newcombo;
do until (last.date);
set have;
by id date;
/* Update this color item in the history table */
rc=history.find();
start=date;
rc=history.replace();
/* If it's a new color, add it to beginning of colors array, which will be blank */
if findw(trim(newcombo),trim(color),'-')=0 then colors{1}=color;
/* Use hash table to see if any colors are over 30 days and should be removed*/
/* Start at right-hand end of array because it is sorted (therefore left end */
/* is blank except for the new color) */
current_color=color;
do d=dim(colors) to 1 by -1 while(colors{d}^=' ');
rc=history.find(key:colors{d});
if date-start>30 then colors{d}=' ';
end;
color=current_color;
call sortc(of colors{*});
end;
newcombo=catx('-',of colors{*});
if last.id then do;
rc=history.clear();
call missing(of colors{*});
end;
run;
Thank you @Patrick and @mkeintz . This is very helpful. I have a few questions for @mkeintz :
1) what is the purpose of current_color? Right now the color variable gets cleared but replaced with current_color. I'd like to keep the original data.
2) I'd like to keep the same number of rows I started with even if the newcombo and date combination might be duplicate. This code seems to be giving me unique newcombo/date rows.
also observation 4 should be blue - red because both administered on the same day. That is 7/13/2015 should be blue-red in both instances.
oops - forget it. I was looking at the new 'start' variable in the print out. 1/12/2016 is giving the correct combination but it removed the 'red' row. As stated above I want to keep all rows and the original variable in tact.
regards,
markk
This should be my last question. why is it necessary to change the color =current color? This is what orignally made me think that the original data will be changed. It doesn't seem to do anything in my real drug data but I'm not sure if I'm missing something.
current_color=color;
do d=dim(colors) to 1 by -1 while(colors{d}^=' ');
rc=history.find(key:colors{d});
if date-start>30 then colors{d}=' ';
end;
color=current_color;
also, how would I add a counter to keep track of each time the color combination changes?
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.