Solved: Retain non-missing by specific string to fill in missings

lydiawawa · Posted 02-21-2019 05:09 PM

Hi

I am having trouble to retain the non-missing value in replace for the following missing value. The dataset with missings is as of the following:

session_id         name                                               time   
    1            pd_confirm_resp_for_quest_adr_yes_ind_time            22.7      
    1            pd_confirm_next_time                                  23.7
    1            pd_load_time                                         13.3
    1            pd_confirm_lang_entry_ind                             .
    1            pd_confirm_lang_exit_ind                              .
    1            pd_confirm_close_browser_time                         24
    1            pd_login_resp_user_id_time                           2.1
    1            pd_login_login_time                                  12.2
    1            pd_login_lang_entry_ind                              .
    1            pd_login_lang_exit_ind                               .
    1            pd_login_close_browser_time                          36.8
    1            pd_respondent_p_first_name_time                       40.3
    1            pd_respondent_p_middle_name_time                     41.6
    1            pd_respondent_p_last_name_time                       35.8
    1            pd_respondent_load_time                              35.8
    1            pd_respondent_lang_entry_ind                         .
    1            pd_respondent_lang_exit_ind                          .
    1            pd_respondent_closemodal_time                        137.6
    1            pd_respondent_close_browser_time                     140
    1            pd_respondent_backmodal_time                         75
    1            pd_verifyaddress_next_time                           34.6
    1            pd_verifyaddress_load_time                           24.7
    1            pd_verifyaddress_lang_entry_ind                      .
    1            pd_verifyaddress_lang_exit_ind                       .
    1            pd_verifyaddress_h_occ_yes_ind_time                  33.7
    1            pd_verifyaddress_close_browser_time                  35.1

For each name contains string "_sessionstart_time" or "_load_time" or "_xxx_time" there will always be a time value, and I hope to use only "_sessionstart_time" or "_load_time" time values to fill in the missing values. If "_load_time" and "_sessionstart_time" exist for the same "yyy_xxx" entries (ex: "entry_login") then we will take "_sessionstart_time".

Desired output:

As a reminder, for simplicity, the sample dataset only has one session_id, but in reality, there are many session_id each defines a respondent. And names end with _ind is not the only factor that determines missing time value. Missing time value is determined by name not ending with _time. For example, names end with _test or _xxx other than _time have missing time values.

Notice pd_login_lang_entry_ind and pd_login_lang_exit_ind should have missing time values because this group (login_lang) of names does not have _sessionstart_time or _load_time entries.

session_id         name                                               time   
    1            pd_confirm_resp_for_quest_adr_yes_ind_time            22.7      
    1            pd_confirm_next_time                                  23.7
    1            pd_confirm_load_time                                  13.3
    1            pd_confirm_lang_entry_ind                             13.3
    1            pd_confirm_lang_exit_ind                              13.3
    1            pd_confirm_close_browser_time                         24
    1            pd_login_resp_user_id_time                           2.1
    1            pd_login_login_time                                  12.2
    1            pd_login_lang_entry_ind                              .
    1            pd_login_lang_exit_ind                               .
    1            pd_login_close_browser_time                          36.8
    1            pd_respondent_p_first_name_time                       40.3
    1            pd_respondent_p_middle_name_time                     41.6
    1            pd_respondent_p_last_name_time                       35.8
    1            pd_respondent_load_time                              35.8
    1            pd_respondent_lang_entry_ind                         35.8
    1            pd_respondent_lang_exit_ind                          35.8
    1            pd_respondent_closemodal_time                        137.6
    1            pd_respondent_close_browser_time                     140
    1            pd_respondent_backmodal_time                         75
    1            pd_verifyaddress_next_time                           34.6
    1            pd_verifyaddress_load_time                           24.7
    1            pd_verifyaddress_lang_entry_ind                      24.7
    1            pd_verifyaddress_lang_exit_ind                       24.7
    1            pd_verifyaddress_h_occ_yes_ind_time                  33.7
    1            pd_verifyaddress_close_browser_time                  35.1

The code I have so far does not produce the desired output, but the outcome is very close:

data have; 
set have;
t1 = scan(strip(name), 2, '_');
t3 = scan(strip(name), 3, '_');
run;

proc sort data = have; by session_id t1 descending t3;run;

data want;
 drop temp;
  set have;
   by session_id t1 descending t3 value;
   retain temp;
    if first.session_id or first.t1 or first.t3 then temp=.;
    if t3 = "load" or t3 = "sessionstart" and not missing(time) then temp=time;
    else if missing(time) then time = temp;
run;

Code output:

session_id         name                                               time   
    1            pd_confirm_resp_for_quest_adr_yes_ind_time            22.7      
    1            pd_confirm_next_time                                  23.7
    1            pd_confirm_load_time                                  13.3
    1            pd_confirm_lang_entry_ind                             13.3
    1            pd_confirm_lang_exit_ind                              13.3
    1            pd_confirm_close_browser_time                         24
    1            pd_login_resp_user_id_time                           2.1
    1            pd_login_login_time                                  12.2
    1            pd_login_lang_entry_ind                              35.8
    1            pd_login_lang_exit_ind                               35.8
    1            pd_login_close_browser_time                          36.8
    1            pd_respondent_p_first_name_time                       40.3
    1            pd_respondent_p_middle_name_time                     41.6
    1            pd_respondent_p_last_name_time                       35.8
    1            pd_respondent_load_time                              35.8
    1            pd_respondent_lang_entry_ind                         35.8
    1            pd_respondent_lang_exit_ind                          35.8
    1            pd_respondent_closemodal_time                        137.6
    1            pd_respondent_close_browser_time                     140
    1            pd_respondent_backmodal_time                         75
    1            pd_verifyaddress_next_time                           34.6
    1            pd_verifyaddress_load_time                           24.7
    1            pd_verifyaddress_lang_entry_ind                      24.7
    1            pd_verifyaddress_lang_exit_ind                       24.7
    1            pd_verifyaddress_h_occ_yes_ind_time                  33.7
    1            pd_verifyaddress_close_browser_time                  35.1

The code does not work on pd_login_lang_entry_ind and pd_login_lang_exit_ind instead of being missing, they took the value of

pd_respondent_load_time.

Grateful for any help!

novinosrin · Posted 02-27-2019 11:01 AM

Hi @lydiawawa I followed your code to understand the split of the name that forms the group. I think t3 is perhaps not needed to accomplish what you want.

I made some minor adjustments to your code on noticing your comment "The code does not work on pd_login_lang_entry_ind and pd_login_lang_exit_ind instead of being missing, they took the value of

pd_respondent_load_time."

Please review your code that i adjusted below

data have;
input session_id         name  : $70.       time;
rownum=_n_;
cards;
1            pd_confirm_resp_for_quest_adr_yes_ind_time            22.7      
    1            pd_confirm_next_time                                  23.7
    1            pd_confirm_load_time                                         13.3
    1            pd_confirm_lang_entry_ind                             .
    1            pd_confirm_lang_exit_ind                              .
    1            pd_confirm_close_browser_time                         24
    1            pd_login_resp_user_id_time                           2.1
    1            pd_login_login_time                                  12.2
    1            pd_login_lang_entry_ind                              .
    1            pd_login_lang_exit_ind                               .
    1            pd_login_close_browser_time                          36.8
    1            pd_respondent_p_first_name_time                       40.3
    1            pd_respondent_p_middle_name_time                     41.6
    1            pd_respondent_p_last_name_time                       35.8
    1            pd_respondent_load_time                              35.8
    1            pd_respondent_lang_entry_ind                         .
    1            pd_respondent_lang_exit_ind                          .
    1            pd_respondent_closemodal_time                        137.6
    1            pd_respondent_close_browser_time                     140
    1            pd_respondent_backmodal_time                         75
    1            pd_verifyaddress_next_time                           34.6
    1            pd_verifyaddress_load_time                           24.7
    1            pd_verifyaddress_lang_entry_ind                      .
    1            pd_verifyaddress_lang_exit_ind                       .
    1            pd_verifyaddress_h_occ_yes_ind_time                  33.7
    1            pd_verifyaddress_close_browser_time                  35.1
	;

data have; 
set have;
t1 = scan(strip(name), 2, '_');
t3 = scan(strip(name), 3, '_');
run;

proc sort data=have out=_have;
by session_id t1  descending time;
run;

data want;
set _have;
by session_id t1  descending time;
   retain temp ;
    if first.session_id or first.t1 then  temp=.;
    if (t3 = "load" or t3 = "sessionstart") and not missing(time) then  temp=time;
    else if missing(time) then time = temp;
run;

View solution in original post

jklaverstijn · Posted 02-21-2019 05:45 PM

Looking at session_id=2, pd_device has time=11. But that value is only provided in a later observation whith pd_load and time=11. Your code would not achieve that. You use the term "retain"but that only allows for values to be carried foreward and not backward. Probably your way out is to do a double pass. First make a set of replacement values and then merge that with the original data.

I can't find time to do that exercise for you. Maybe you find it a challenge yourself?

Hope this helps,

-- Jan.

novinosrin · Posted 02-21-2019 05:57 PM

EDITED to use keep statement instead of drop



data have;
input session_id         name  : $12.       time;
cards;
    1            pd_load           12
    1            entry_load        13
    1            entry_start       12
    1            pd_time           13
    1            entry_time        9
    1            entry_text         .
    1            pd_text            .
    1            pd_device          .
    2            pd_device          .
    2            pd_text            .
    2            pd_time            7
    2            pd_load            11 
    3            pd_text            .
    3            pd_load            10
    3            entry_start        4
    3            entry_text         .
	;

	proc sort data=have out=_have;
	by session_id descending time;
	run;

data want;
if _n_=1 then do;
dcl hash H (multidata:'y') ;
   h.definekey  ("session_id","t1") ;
   h.definedata ("t2","ti") ;
   h.definedone () ;
end;
do until(last.session_id);
set _have end=lr;
by session_id;
t1=scan(name,1,'_');
t2=scan(name,2,'_');
ti=time;
if time>. then rc=h.add();
else do;
do rc=h.find() by 0 while(rc=0);
if t2='start' then start=ti;
else if t2='load' then load=ti;
rc=h.find_next();
end;
time=Coalesce(start,load);
end;
output;
end;
keep session_id name time;
run;

lydiawawa · Posted 02-21-2019 09:42 PM

@novinosrinThe reason why I used index() was that the actual names have more than one underscore. Such as PD_LOGIN_LANG_ENTRY_IND or PD_CONFIRM_LOAD_TIME (_load) or PD_LANDING_SESSIONSTART_TIME (_start). How should I implement scan() in this case? PD_XXX_LOAD_TIME, XXX is not always the same value.

For In the actual dataset I should be subsetting by "_LOAD_TIME" and "_SESSIONSTART_TIME".

novinosrin · Posted 02-21-2019 10:04 PM

Hi @lydiawawa Can you post a more representative sample of your real i.e data have and data want?

That can help anybody have the right approavh

lydiawawa · Posted 02-21-2019 10:13 PM

@novinsorin I made another update to the name values. Now I think it is the closest to the actual dataset. Sorry about the delay.

novinosrin · Posted 02-21-2019 10:41 PM

Hi @lydiawawa See if this works, I have modified to handle your updated HAVE

data have;
input session_id         name  : $35.       time;
cards;
    1            pd_login_load_time             12
    1            entry_confirm_load_time        13
    1            entry_confirm_start_time       12
    1            pd_save_time                   13
    1            entry_login_time               9
    1            entry_login_text               .
    1            pd_login_text                  .
    1            pd_login_device                .
    2            pd_confirm_device              .
    2            pd_confirm_text                .
    2            pd_login_time                  7
    2            pd_confirm_load_time           11 
    3            pd_login_text                  .
    3            pd_login_load_time             10
    3            entry_confirm_start_time        4
    3            entry_confirm_text              .
	;


proc sort data=have out=_have;
by session_id descending time;
run;

data want;
if _n_=1 then do;
dcl hash H (multidata:'y') ;
   h.definekey  ("session_id","t1") ;
   h.definedata ("t2","ti") ;
   h.definedone () ;
end;
do until(last.session_id);
set _have end=lr;
by session_id;
t1=scan(name,1,'_'); 
k= findw(name,'load','_');
k1=	findw(name,'start','_');
if k>0 then t2= substr(name,k,4) ;
else if k1>0 then t2= substr(name,k1,5);
ti=time;
if time>. and sum(k,k1)>0 then rc=h.add();
else if time=. then do;
    do rc=h.find() by 0 while(rc=0);
    if t2='start' then start=ti;
     else if t2='load' then load=ti;
      rc=h.find_next();
     end;
time=Coalesce(start,load);
end;
output;
end;
keep session_id name time;
run;

lydiawawa · Posted 02-21-2019 11:45 PM

@novinosrin I looked more into the dataset, and realized the name values all starts with “pd_” which will need to alter t1 in the program. It should be t1 = scan(name,2,'_'). However, just by extracting one word is not enough to define unique t1, some name such as pd_HH_dash_lang_entry_ind and pd_HH_dash_load_time will need more than one word. Sorry, the original dataset is very large, it was hard to find the exact pattern, but I think I finally have the finalized version of the sample dataset, please see my most recent updates on the original post. Thank you so much for your patience.

novinosrin · Posted 02-21-2019 11:49 PM

@lydiawawa it's a bit too late here and Im half asleep. Let me take a look in the morning. In the mean time I'm sure somebody else in a different time zone might respond. Hang in there

novinosrin · Posted 02-22-2019 11:06 AM

Hi @lydiawawa

Ok, I think I need help from you in clarifying what makes it one unique set(group) .

For example., your example pd_HH_dash_lang_entry_ind and pd_HH_dash_load_time , what exacts would be classed as a group.

Since the discussion has been going back and forth, I would need you to thoroughly revisit the data samples in your original and clearly establish what constitutes a group. This is crucial as this forms the basis for key variables that I would want to be defined in the hash object or as look up/join keys for that matter be it sql or hash

Take your time and respond here with utmost clarity if you can.

Best!

lydiawawa · Posted 02-22-2019 11:52 AM

@novinosrin I believe a group is defined by the term comes before "_load_time" or "_start_time" and after "pd_" therefore, in this case, the group will be "HH_dash" .

novinosrin · Posted 02-22-2019 12:01 PM

@lydiawawa

Hmm, Ok taking this example below i.e your latest sample

All records have the same prefix pd_

So obviously we need to segregate within. So can you manually show me the segregated groups with each groups load and start in place for the below sample? We really have to get over this roadblock

pd_login_load_time

pd_entry_confirm_load_time

pd_entry_confirm_start_time

pd_save_time

pd_entry_login_time

pd_entry_confirm_text

pd_login_text

pd_login_device

pd_confirm_device

pd_confirm_text

pd_login_time

pd_confirm_load_time

pd_login_text

pd_login_load_time

pd_entry_confirm_start_time

pd_entry_confirm_text

pd_HH_dash_lang_entry_ind

pd_HH_dash_load_time

pd_dob_close_browser_time

pd_dob_lang_entry_ind

pd_dob_lang_exit_ind

pd_dob_load_time

pd_dob_next_time

pd_home_lang_entry_ind

pd_home_lang_exit_ind

pd_home_lang_load_time

pd_home_lang_next_time

lydiawawa · Posted 02-22-2019 12:22 PM

Hope the following helps on clarification. For every session_id, entries without "_load_time", "_start_time" or "_time" have missing time value. We want to fill in missings by "_start_time" or "_load_time", "_start_time" comes in first because it is sorted after "_load_time".

pd_login_load_time (login with time value)

pd_entry_confirm_load_time (entry_confirm with time value)

pd_entry_confirm_start_time (entry_confirm with time value)

pd_save_time ("_time " always have a time value, no need to fill in missing for this type of entries)

pd_entry_login_time ("_time" time value not missing)

pd_entry_confirm_text (entry_confirm)

pd_login_text (login)

pd_login_device (login)

pd_confirm_device (confirm)

pd_confirm_text (confirm)

pd_login_time (login)

pd_confirm_load_time (confirm with time value)

pd_login_text (login)

pd_login_load_time (login with time value)

pd_entry_confirm_start_time (entry_confirm)

pd_entry_confirm_text (entry_confirm)

pd_HH_dash_lang_entry_ind (HH_dash)

pd_HH_dash_load_time (HH_dash with time value)

pd_dob_close_browser_time ("_time" time value not missing)

pd_dob_lang_entry_ind (dob)

pd_dob_lang_exit_ind (dob)

pd_dob_load_time (dob with time value)

pd_dob_next_time ("_time" time value not missing)

pd_home_lang_entry_ind (home_lang)

pd_home_lang_exit_ind (home_lang)

pd_home_lang_load_time (home_lang with time value)

pd_home_lang_next_time ("_time" time value not missing)

novinosrin · Posted 02-22-2019 12:39 PM

@lydiawawa Sounds like pattern is really hard as in some the first 2 words extract makes it a group and in others it happens to be the 1st 3

For example

pd_login_text (login) /*first 2*/

pd_login_load_time (login with time value)/*first 2*/

pd_entry_confirm_start_time (entry_confirm) /*first 3*/

pd_entry_confirm_text (entry_confirm) /*first 3*/

pd_HH_dash_lang_entry_ind (HH_dash) /*first 3*/

pd_HH_dash_load_time (HH_dash with time value) ) /*first 3*/

pd_dob_close_browser_time ("_time" time value not missing) /*first 2*/

pd_dob_lang_entry_ind (dob) /*first 2*/

pd_dob_lang_exit_ind (dob) /*first 2*/

pd_dob_load_time (dob with time value) /*first 2*/

Isn't there any logical way to group? I mean a consistent pattern?

novinosrin · Posted 02-22-2019 12:50 PM

Ok @lydiawawa

I am gonna attempt with this

"For every session_id, entries without "_load_time", "_start_time" or "_time" have missing time value.

We want to fill in missings by "_start_time" or "_load_time", "_start_time" comes in first because it is sorted after "_load_time".

Is this consistent? Can you confirm there is only one occurence of entries with load_time, start_time respectively for a session_id? This is very important!

Also, for every session_id , are you sure anything at has missing values can be filled with the above logic

Awaiting your response. Once you confirm, i will jump to code

Retain non-missing by specific string to fill in missings

Re: Retain non-missing by specific string to fill in missings

Re: Retain non-missing by specific string to fill in missings

Re: Retain non-missing by specific string to fill in missings

Re: Retain non-missing by specific string to fill in missings

Re: Retain non-missing by specific string to fill in missings

Re: Retain non-missing by specific string to fill in missings

Re: Retain non-missing by specific string to fill in missings

Re: Retain non-missing by specific string to fill in missings

Re: Retain non-missing by specific string to fill in missings

Re: Retain non-missing by specific string to fill in missings

Re: Retain non-missing by specific string to fill in missings

Re: Retain non-missing by specific string to fill in missings

Re: Retain non-missing by specific string to fill in missings

Re: Retain non-missing by specific string to fill in missings

Re: Retain non-missing by specific string to fill in missings

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away