Hi
Edited @novinosrin
I am having trouble to retain the non-missing value in replace for the following missing value. The dataset with missings is as of the following:
session_id name time 1 pd_confirm_resp_for_quest_adr_yes_ind_time 22.7
1 pd_confirm_next_time 23.7 1 pd_load_time 13.3
1 pd_confirm_lang_entry_ind .
1 pd_confirm_lang_exit_ind .
1 pd_confirm_close_browser_time 24 1 pd_login_resp_user_id_time 2.1 1 pd_login_login_time 12.2
1 pd_login_lang_entry_ind .
1 pd_login_lang_exit_ind .
1 pd_login_close_browser_time 36.8
1 pd_respondent_p_first_name_time 40.3
1 pd_respondent_p_middle_name_time 41.6
1 pd_respondent_p_last_name_time 35.8
1 pd_respondent_load_time 35.8
1 pd_respondent_lang_entry_ind .
1 pd_respondent_lang_exit_ind .
1 pd_respondent_closemodal_time 137.6
1 pd_respondent_close_browser_time 140
1 pd_respondent_backmodal_time 75
1 pd_verifyaddress_next_time 34.6
1 pd_verifyaddress_load_time 24.7
1 pd_verifyaddress_lang_entry_ind .
1 pd_verifyaddress_lang_exit_ind .
1 pd_verifyaddress_h_occ_yes_ind_time 33.7
1 pd_verifyaddress_close_browser_time 35.1
For each name contains string "_sessionstart_time" or "_load_time" or "_xxx_time" there will always be a time value, and I hope to use only "_sessionstart_time" or "_load_time" time values to fill in the missing values. If "_load_time" and "_sessionstart_time" exist for the same "yyy_xxx" entries (ex: "entry_login") then we will take "_sessionstart_time".
Desired output:
As a reminder, for simplicity, the sample dataset only has one session_id, but in reality, there are many session_id each defines a respondent. And names end with _ind is not the only factor that determines missing time value. Missing time value is determined by name not ending with _time. For example, names end with _test or _xxx other than _time have missing time values.
Notice pd_login_lang_entry_ind and pd_login_lang_exit_ind should have missing time values because this group (login_lang) of names does not have _sessionstart_time or _load_time entries.
session_id name time 1 pd_confirm_resp_for_quest_adr_yes_ind_time 22.7
1 pd_confirm_next_time 23.7 1 pd_confirm_load_time 13.3
1 pd_confirm_lang_entry_ind 13.3
1 pd_confirm_lang_exit_ind 13.3
1 pd_confirm_close_browser_time 24 1 pd_login_resp_user_id_time 2.1 1 pd_login_login_time 12.2
1 pd_login_lang_entry_ind .
1 pd_login_lang_exit_ind .
1 pd_login_close_browser_time 36.8
1 pd_respondent_p_first_name_time 40.3
1 pd_respondent_p_middle_name_time 41.6
1 pd_respondent_p_last_name_time 35.8
1 pd_respondent_load_time 35.8
1 pd_respondent_lang_entry_ind 35.8
1 pd_respondent_lang_exit_ind 35.8
1 pd_respondent_closemodal_time 137.6
1 pd_respondent_close_browser_time 140
1 pd_respondent_backmodal_time 75
1 pd_verifyaddress_next_time 34.6
1 pd_verifyaddress_load_time 24.7
1 pd_verifyaddress_lang_entry_ind 24.7
1 pd_verifyaddress_lang_exit_ind 24.7
1 pd_verifyaddress_h_occ_yes_ind_time 33.7
1 pd_verifyaddress_close_browser_time 35.1
The code I have so far does not produce the desired output, but the outcome is very close:
data have;
set have;
t1 = scan(strip(name), 2, '_');
t3 = scan(strip(name), 3, '_');
run;
proc sort data = have; by session_id t1 descending t3;run;
data want;
drop temp;
set have;
by session_id t1 descending t3 value;
retain temp;
if first.session_id or first.t1 or first.t3 then temp=.;
if t3 = "load" or t3 = "sessionstart" and not missing(time) then temp=time;
else if missing(time) then time = temp;
run;
Code output:
session_id name time 1 pd_confirm_resp_for_quest_adr_yes_ind_time 22.7
1 pd_confirm_next_time 23.7 1 pd_confirm_load_time 13.3
1 pd_confirm_lang_entry_ind 13.3
1 pd_confirm_lang_exit_ind 13.3
1 pd_confirm_close_browser_time 24 1 pd_login_resp_user_id_time 2.1 1 pd_login_login_time 12.2
1 pd_login_lang_entry_ind 35.8
1 pd_login_lang_exit_ind 35.8
1 pd_login_close_browser_time 36.8
1 pd_respondent_p_first_name_time 40.3
1 pd_respondent_p_middle_name_time 41.6
1 pd_respondent_p_last_name_time 35.8
1 pd_respondent_load_time 35.8
1 pd_respondent_lang_entry_ind 35.8
1 pd_respondent_lang_exit_ind 35.8
1 pd_respondent_closemodal_time 137.6
1 pd_respondent_close_browser_time 140
1 pd_respondent_backmodal_time 75
1 pd_verifyaddress_next_time 34.6
1 pd_verifyaddress_load_time 24.7
1 pd_verifyaddress_lang_entry_ind 24.7
1 pd_verifyaddress_lang_exit_ind 24.7
1 pd_verifyaddress_h_occ_yes_ind_time 33.7
1 pd_verifyaddress_close_browser_time 35.1
The code does not work on pd_login_lang_entry_ind and pd_login_lang_exit_ind instead of being missing, they took the value of
pd_respondent_load_time.
Grateful for any help!
Hi @lydiawawa I followed your code to understand the split of the name that forms the group. I think t3 is perhaps not needed to accomplish what you want.
I made some minor adjustments to your code on noticing your comment "The code does not work on pd_login_lang_entry_ind and pd_login_lang_exit_ind instead of being missing, they took the value of
pd_respondent_load_time."
Please review your code that i adjusted below
data have;
input session_id name : $70. time;
rownum=_n_;
cards;
1 pd_confirm_resp_for_quest_adr_yes_ind_time 22.7
1 pd_confirm_next_time 23.7
1 pd_confirm_load_time 13.3
1 pd_confirm_lang_entry_ind .
1 pd_confirm_lang_exit_ind .
1 pd_confirm_close_browser_time 24
1 pd_login_resp_user_id_time 2.1
1 pd_login_login_time 12.2
1 pd_login_lang_entry_ind .
1 pd_login_lang_exit_ind .
1 pd_login_close_browser_time 36.8
1 pd_respondent_p_first_name_time 40.3
1 pd_respondent_p_middle_name_time 41.6
1 pd_respondent_p_last_name_time 35.8
1 pd_respondent_load_time 35.8
1 pd_respondent_lang_entry_ind .
1 pd_respondent_lang_exit_ind .
1 pd_respondent_closemodal_time 137.6
1 pd_respondent_close_browser_time 140
1 pd_respondent_backmodal_time 75
1 pd_verifyaddress_next_time 34.6
1 pd_verifyaddress_load_time 24.7
1 pd_verifyaddress_lang_entry_ind .
1 pd_verifyaddress_lang_exit_ind .
1 pd_verifyaddress_h_occ_yes_ind_time 33.7
1 pd_verifyaddress_close_browser_time 35.1
;
data have;
set have;
t1 = scan(strip(name), 2, '_');
t3 = scan(strip(name), 3, '_');
run;
proc sort data=have out=_have;
by session_id t1 descending time;
run;
data want;
set _have;
by session_id t1 descending time;
retain temp ;
if first.session_id or first.t1 then temp=.;
if (t3 = "load" or t3 = "sessionstart") and not missing(time) then temp=time;
else if missing(time) then time = temp;
run;
Looking at session_id=2, pd_device has time=11. But that value is only provided in a later observation whith pd_load and time=11. Your code would not achieve that. You use the term "retain"but that only allows for values to be carried foreward and not backward. Probably your way out is to do a double pass. First make a set of replacement values and then merge that with the original data.
I can't find time to do that exercise for you. Maybe you find it a challenge yourself?
Hope this helps,
-- Jan.
EDITED to use keep statement instead of drop
data have;
input session_id name : $12. time;
cards;
1 pd_load 12
1 entry_load 13
1 entry_start 12
1 pd_time 13
1 entry_time 9
1 entry_text .
1 pd_text .
1 pd_device .
2 pd_device .
2 pd_text .
2 pd_time 7
2 pd_load 11
3 pd_text .
3 pd_load 10
3 entry_start 4
3 entry_text .
;
proc sort data=have out=_have;
by session_id descending time;
run;
data want;
if _n_=1 then do;
dcl hash H (multidata:'y') ;
h.definekey ("session_id","t1") ;
h.definedata ("t2","ti") ;
h.definedone () ;
end;
do until(last.session_id);
set _have end=lr;
by session_id;
t1=scan(name,1,'_');
t2=scan(name,2,'_');
ti=time;
if time>. then rc=h.add();
else do;
do rc=h.find() by 0 while(rc=0);
if t2='start' then start=ti;
else if t2='load' then load=ti;
rc=h.find_next();
end;
time=Coalesce(start,load);
end;
output;
end;
keep session_id name time;
run;
@novinosrinThe reason why I used index() was that the actual names have more than one underscore. Such as PD_LOGIN_LANG_ENTRY_IND or PD_CONFIRM_LOAD_TIME (_load) or PD_LANDING_SESSIONSTART_TIME (_start). How should I implement scan() in this case? PD_XXX_LOAD_TIME, XXX is not always the same value.
For In the actual dataset I should be subsetting by "_LOAD_TIME" and "_SESSIONSTART_TIME".
Hi @lydiawawa Can you post a more representative sample of your real i.e data have and data want?
That can help anybody have the right approavh
Hi @lydiawawa See if this works, I have modified to handle your updated HAVE
data have;
input session_id name : $35. time;
cards;
1 pd_login_load_time 12
1 entry_confirm_load_time 13
1 entry_confirm_start_time 12
1 pd_save_time 13
1 entry_login_time 9
1 entry_login_text .
1 pd_login_text .
1 pd_login_device .
2 pd_confirm_device .
2 pd_confirm_text .
2 pd_login_time 7
2 pd_confirm_load_time 11
3 pd_login_text .
3 pd_login_load_time 10
3 entry_confirm_start_time 4
3 entry_confirm_text .
;
proc sort data=have out=_have;
by session_id descending time;
run;
data want;
if _n_=1 then do;
dcl hash H (multidata:'y') ;
h.definekey ("session_id","t1") ;
h.definedata ("t2","ti") ;
h.definedone () ;
end;
do until(last.session_id);
set _have end=lr;
by session_id;
t1=scan(name,1,'_');
k= findw(name,'load','_');
k1= findw(name,'start','_');
if k>0 then t2= substr(name,k,4) ;
else if k1>0 then t2= substr(name,k1,5);
ti=time;
if time>. and sum(k,k1)>0 then rc=h.add();
else if time=. then do;
do rc=h.find() by 0 while(rc=0);
if t2='start' then start=ti;
else if t2='load' then load=ti;
rc=h.find_next();
end;
time=Coalesce(start,load);
end;
output;
end;
keep session_id name time;
run;
@lydiawawa it's a bit too late here and Im half asleep. Let me take a look in the morning. In the mean time I'm sure somebody else in a different time zone might respond. Hang in there
Hi @lydiawawa
Ok, I think I need help from you in clarifying what makes it one unique set(group) .
For example., your example pd_HH_dash_lang_entry_ind and pd_HH_dash_load_time , what exacts would be classed as a group.
Since the discussion has been going back and forth, I would need you to thoroughly revisit the data samples in your original and clearly establish what constitutes a group. This is crucial as this forms the basis for key variables that I would want to be defined in the hash object or as look up/join keys for that matter be it sql or hash
Take your time and respond here with utmost clarity if you can.
Best!
@novinosrin I believe a group is defined by the term comes before "_load_time" or "_start_time" and after "pd_" therefore, in this case, the group will be "HH_dash" .
Hmm, Ok taking this example below i.e your latest sample
All records have the same prefix pd_
So obviously we need to segregate within. So can you manually show me the segregated groups with each groups load and start in place for the below sample? We really have to get over this roadblock
pd_login_load_time |
pd_entry_confirm_load_time |
pd_entry_confirm_start_time |
pd_save_time |
pd_entry_login_time |
pd_entry_confirm_text |
pd_login_text |
pd_login_device |
pd_confirm_device |
pd_confirm_text |
pd_login_time |
pd_confirm_load_time |
pd_login_text |
pd_login_load_time |
pd_entry_confirm_start_time |
pd_entry_confirm_text |
pd_HH_dash_lang_entry_ind |
pd_HH_dash_load_time |
pd_dob_close_browser_time |
pd_dob_lang_entry_ind |
pd_dob_lang_exit_ind |
pd_dob_load_time |
pd_dob_next_time |
pd_home_lang_entry_ind |
pd_home_lang_exit_ind |
pd_home_lang_load_time |
pd_home_lang_next_time |
Hope the following helps on clarification. For every session_id, entries without "_load_time", "_start_time" or "_time" have missing time value. We want to fill in missings by "_start_time" or "_load_time", "_start_time" comes in first because it is sorted after "_load_time".
pd_login_load_time (login with time value) |
pd_entry_confirm_load_time (entry_confirm with time value) |
pd_entry_confirm_start_time (entry_confirm with time value) |
pd_save_time ("_time " always have a time value, no need to fill in missing for this type of entries) |
pd_entry_login_time ("_time" time value not missing) |
pd_entry_confirm_text (entry_confirm) |
pd_login_text (login) |
pd_login_device (login) |
pd_confirm_device (confirm) |
pd_confirm_text (confirm) |
pd_login_time (login) |
pd_confirm_load_time (confirm with time value) |
pd_login_text (login) |
pd_login_load_time (login with time value) |
pd_entry_confirm_start_time (entry_confirm) |
pd_entry_confirm_text (entry_confirm) |
pd_HH_dash_lang_entry_ind (HH_dash) |
pd_HH_dash_load_time (HH_dash with time value) |
pd_dob_close_browser_time ("_time" time value not missing) |
pd_dob_lang_entry_ind (dob) |
pd_dob_lang_exit_ind (dob) |
pd_dob_load_time (dob with time value) |
pd_dob_next_time ("_time" time value not missing) |
pd_home_lang_entry_ind (home_lang) |
pd_home_lang_exit_ind (home_lang) |
pd_home_lang_load_time (home_lang with time value) |
pd_home_lang_next_time ("_time" time value not missing) |
@lydiawawa Sounds like pattern is really hard as in some the first 2 words extract makes it a group and in others it happens to be the 1st 3
For example
pd_login_text (login) /*first 2*/ |
pd_login_load_time (login with time value)/*first 2*/ |
pd_entry_confirm_start_time (entry_confirm) /*first 3*/ |
pd_entry_confirm_text (entry_confirm) /*first 3*/ |
pd_HH_dash_lang_entry_ind (HH_dash) /*first 3*/ |
pd_HH_dash_load_time (HH_dash with time value) ) /*first 3*/ |
pd_dob_close_browser_time ("_time" time value not missing) /*first 2*/ |
pd_dob_lang_entry_ind (dob) /*first 2*/ |
pd_dob_lang_exit_ind (dob) /*first 2*/ |
pd_dob_load_time (dob with time value) /*first 2*/ |
Isn't there any logical way to group? I mean a consistent pattern?
Ok @lydiawawa
I am gonna attempt with this
"For every session_id, entries without "_load_time", "_start_time" or "_time" have missing time value.
We want to fill in missings by "_start_time" or "_load_time", "_start_time" comes in first because it is sorted after "_load_time".
Is this consistent? Can you confirm there is only one occurence of entries with load_time, start_time respectively for a session_id? This is very important!
Also, for every session_id , are you sure anything at has missing values can be filled with the above logic
Awaiting your response. Once you confirm, i will jump to code
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.