BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
lydiawawa
Lapis Lazuli | Level 10

Hi

Edited @novinosrin 

I am having trouble to retain the non-missing value in replace for the following missing value. The dataset with missings is as of the following:

 

session_id         name                                               time   
    1            pd_confirm_resp_for_quest_adr_yes_ind_time            22.7      
1 pd_confirm_next_time 23.7 1 pd_load_time 13.3
1 pd_confirm_lang_entry_ind .
1 pd_confirm_lang_exit_ind .
1 pd_confirm_close_browser_time 24 1 pd_login_resp_user_id_time 2.1 1 pd_login_login_time 12.2
1 pd_login_lang_entry_ind .
1 pd_login_lang_exit_ind .
1 pd_login_close_browser_time 36.8
1 pd_respondent_p_first_name_time 40.3
1 pd_respondent_p_middle_name_time 41.6
1 pd_respondent_p_last_name_time 35.8
1 pd_respondent_load_time 35.8
1 pd_respondent_lang_entry_ind .
1 pd_respondent_lang_exit_ind .
1 pd_respondent_closemodal_time 137.6
1 pd_respondent_close_browser_time 140
1 pd_respondent_backmodal_time 75
1 pd_verifyaddress_next_time 34.6
1 pd_verifyaddress_load_time 24.7
1 pd_verifyaddress_lang_entry_ind .
1 pd_verifyaddress_lang_exit_ind .
1 pd_verifyaddress_h_occ_yes_ind_time 33.7
1 pd_verifyaddress_close_browser_time 35.1


 

For each name contains string "_sessionstart_time" or "_load_time" or "_xxx_time" there will always be a time value, and I hope to use only "_sessionstart_time" or "_load_time" time values to fill in the missing values. If "_load_time" and "_sessionstart_time" exist for the same "yyy_xxx" entries (ex: "entry_login") then we will take "_sessionstart_time".

 

Desired output:

As a reminder, for simplicity, the sample dataset only has one session_id, but in reality, there are many session_id each defines a respondent. And names end with _ind is not the only factor that determines missing time value. Missing time value is determined by name not ending with _time. For example, names end with _test or _xxx other than _time have missing time values. 

 

Notice pd_login_lang_entry_ind and  pd_login_lang_exit_ind should have missing time values because this group (login_lang) of names does not have _sessionstart_time or _load_time entries.

 

session_id         name                                               time   
    1            pd_confirm_resp_for_quest_adr_yes_ind_time            22.7      
1 pd_confirm_next_time 23.7 1 pd_confirm_load_time 13.3
1 pd_confirm_lang_entry_ind 13.3
1 pd_confirm_lang_exit_ind 13.3
1 pd_confirm_close_browser_time 24 1 pd_login_resp_user_id_time 2.1 1 pd_login_login_time 12.2
1 pd_login_lang_entry_ind .
1 pd_login_lang_exit_ind .
1 pd_login_close_browser_time 36.8
1 pd_respondent_p_first_name_time 40.3
1 pd_respondent_p_middle_name_time 41.6
1 pd_respondent_p_last_name_time 35.8
1 pd_respondent_load_time 35.8
1 pd_respondent_lang_entry_ind 35.8
1 pd_respondent_lang_exit_ind 35.8
1 pd_respondent_closemodal_time 137.6
1 pd_respondent_close_browser_time 140
1 pd_respondent_backmodal_time 75
1 pd_verifyaddress_next_time 34.6
1 pd_verifyaddress_load_time 24.7
1 pd_verifyaddress_lang_entry_ind 24.7
1 pd_verifyaddress_lang_exit_ind 24.7
1 pd_verifyaddress_h_occ_yes_ind_time 33.7
1 pd_verifyaddress_close_browser_time 35.1

 

The code I have so far does not produce the desired output, but the outcome is very close:

 

data have; 
set have;
t1 = scan(strip(name), 2, '_');
t3 = scan(strip(name), 3, '_');
run;

proc sort data = have; by session_id t1 descending t3;run;

data want;
 drop temp;
  set have;
   by session_id t1 descending t3 value;
   retain temp;
    if first.session_id or first.t1 or first.t3 then temp=.;
    if t3 = "load" or t3 = "sessionstart" and not missing(time) then temp=time;
    else if missing(time) then time = temp;
run;

Code output:

 

session_id         name                                               time   
    1            pd_confirm_resp_for_quest_adr_yes_ind_time            22.7      
1 pd_confirm_next_time 23.7 1 pd_confirm_load_time 13.3
1 pd_confirm_lang_entry_ind 13.3
1 pd_confirm_lang_exit_ind 13.3
1 pd_confirm_close_browser_time 24 1 pd_login_resp_user_id_time 2.1 1 pd_login_login_time 12.2
1 pd_login_lang_entry_ind 35.8
1 pd_login_lang_exit_ind 35.8
1 pd_login_close_browser_time 36.8
1 pd_respondent_p_first_name_time 40.3
1 pd_respondent_p_middle_name_time 41.6
1 pd_respondent_p_last_name_time 35.8
1 pd_respondent_load_time 35.8
1 pd_respondent_lang_entry_ind 35.8
1 pd_respondent_lang_exit_ind 35.8
1 pd_respondent_closemodal_time 137.6
1 pd_respondent_close_browser_time 140
1 pd_respondent_backmodal_time 75
1 pd_verifyaddress_next_time 34.6
1 pd_verifyaddress_load_time 24.7
1 pd_verifyaddress_lang_entry_ind 24.7
1 pd_verifyaddress_lang_exit_ind 24.7
1 pd_verifyaddress_h_occ_yes_ind_time 33.7
1 pd_verifyaddress_close_browser_time 35.1

The code does not work on pd_login_lang_entry_ind and  pd_login_lang_exit_ind instead of being missing, they took the value of 

pd_respondent_load_time.

 

 

Grateful for any help!

 

1 ACCEPTED SOLUTION

Accepted Solutions
novinosrin
Tourmaline | Level 20

Hi @lydiawawa  I followed your code to understand the split of the name that forms the group. I think t3 is perhaps not needed to accomplish what you want.

 

I made some minor adjustments to your code on noticing your comment "The code does not work on pd_login_lang_entry_ind and  pd_login_lang_exit_ind instead of being missing, they took the value of 

pd_respondent_load_time."

 

Please review your code that i adjusted below

 

data have;
input session_id         name  : $70.       time;
rownum=_n_;
cards;
1            pd_confirm_resp_for_quest_adr_yes_ind_time            22.7      
    1            pd_confirm_next_time                                  23.7
    1            pd_confirm_load_time                                         13.3
    1            pd_confirm_lang_entry_ind                             .
    1            pd_confirm_lang_exit_ind                              .
    1            pd_confirm_close_browser_time                         24
    1            pd_login_resp_user_id_time                           2.1
    1            pd_login_login_time                                  12.2
    1            pd_login_lang_entry_ind                              .
    1            pd_login_lang_exit_ind                               .
    1            pd_login_close_browser_time                          36.8
    1            pd_respondent_p_first_name_time                       40.3
    1            pd_respondent_p_middle_name_time                     41.6
    1            pd_respondent_p_last_name_time                       35.8
    1            pd_respondent_load_time                              35.8
    1            pd_respondent_lang_entry_ind                         .
    1            pd_respondent_lang_exit_ind                          .
    1            pd_respondent_closemodal_time                        137.6
    1            pd_respondent_close_browser_time                     140
    1            pd_respondent_backmodal_time                         75
    1            pd_verifyaddress_next_time                           34.6
    1            pd_verifyaddress_load_time                           24.7
    1            pd_verifyaddress_lang_entry_ind                      .
    1            pd_verifyaddress_lang_exit_ind                       .
    1            pd_verifyaddress_h_occ_yes_ind_time                  33.7
    1            pd_verifyaddress_close_browser_time                  35.1
	;

data have; 
set have;
t1 = scan(strip(name), 2, '_');
t3 = scan(strip(name), 3, '_');
run;

proc sort data=have out=_have;
by session_id t1  descending time;
run;

data want;
set _have;
by session_id t1  descending time;
   retain temp ;
    if first.session_id or first.t1 then  temp=.;
    if (t3 = "load" or t3 = "sessionstart") and not missing(time) then  temp=time;
    else if missing(time) then time = temp;
run;

View solution in original post

24 REPLIES 24
jklaverstijn
Rhodochrosite | Level 12

Looking at session_id=2, pd_device has time=11. But that value is only provided in a later observation whith pd_load and time=11. Your code would not achieve that. You use the term "retain"but that only allows for values to be carried foreward and not backward. Probably your way out is to do a double pass. First make a set of replacement values and then merge that with the original data.

 

I can't find time to do that exercise for you. Maybe you find it a challenge yourself?

 

 

Hope this helps,

-- Jan.

novinosrin
Tourmaline | Level 20

EDITED to use keep statement instead of drop

 



data have;
input session_id         name  : $12.       time;
cards;
    1            pd_load           12
    1            entry_load        13
    1            entry_start       12
    1            pd_time           13
    1            entry_time        9
    1            entry_text         .
    1            pd_text            .
    1            pd_device          .
    2            pd_device          .
    2            pd_text            .
    2            pd_time            7
    2            pd_load            11 
    3            pd_text            .
    3            pd_load            10
    3            entry_start        4
    3            entry_text         .
	;

	proc sort data=have out=_have;
	by session_id descending time;
	run;

data want;
if _n_=1 then do;
dcl hash H (multidata:'y') ;
   h.definekey  ("session_id","t1") ;
   h.definedata ("t2","ti") ;
   h.definedone () ;
end;
do until(last.session_id);
set _have end=lr;
by session_id;
t1=scan(name,1,'_');
t2=scan(name,2,'_');
ti=time;
if time>. then rc=h.add();
else do;
do rc=h.find() by 0 while(rc=0);
if t2='start' then start=ti;
else if t2='load' then load=ti;
rc=h.find_next();
end;
time=Coalesce(start,load);
end;
output;
end;
keep session_id name time;
run;
lydiawawa
Lapis Lazuli | Level 10

@novinosrinThe reason why I used index() was that the actual names have more than one underscore. Such as PD_LOGIN_LANG_ENTRY_IND or PD_CONFIRM_LOAD_TIME (_load) or PD_LANDING_SESSIONSTART_TIME (_start). How should I implement scan() in this case? PD_XXX_LOAD_TIME, XXX is not always the same value.

 

For In the actual dataset I should be subsetting by "_LOAD_TIME" and "_SESSIONSTART_TIME".

novinosrin
Tourmaline | Level 20

Hi @lydiawawa   Can you post a more representative sample of your real i.e data have and data want?

That can help anybody have the right approavh

lydiawawa
Lapis Lazuli | Level 10
@novinsorin I made another update to the name values. Now I think it is the closest to the actual dataset. Sorry about the delay.
novinosrin
Tourmaline | Level 20

Hi @lydiawawa   See if this works, I have modified to handle  your updated HAVE

 

data have;
input session_id         name  : $35.       time;
cards;
    1            pd_login_load_time             12
    1            entry_confirm_load_time        13
    1            entry_confirm_start_time       12
    1            pd_save_time                   13
    1            entry_login_time               9
    1            entry_login_text               .
    1            pd_login_text                  .
    1            pd_login_device                .
    2            pd_confirm_device              .
    2            pd_confirm_text                .
    2            pd_login_time                  7
    2            pd_confirm_load_time           11 
    3            pd_login_text                  .
    3            pd_login_load_time             10
    3            entry_confirm_start_time        4
    3            entry_confirm_text              .
	;


proc sort data=have out=_have;
by session_id descending time;
run;

data want;
if _n_=1 then do;
dcl hash H (multidata:'y') ;
   h.definekey  ("session_id","t1") ;
   h.definedata ("t2","ti") ;
   h.definedone () ;
end;
do until(last.session_id);
set _have end=lr;
by session_id;
t1=scan(name,1,'_'); 
k= findw(name,'load','_');
k1=	findw(name,'start','_');
if k>0 then t2= substr(name,k,4) ;
else if k1>0 then t2= substr(name,k1,5);
ti=time;
if time>. and sum(k,k1)>0 then rc=h.add();
else if time=. then do;
    do rc=h.find() by 0 while(rc=0);
    if t2='start' then start=ti;
     else if t2='load' then load=ti;
      rc=h.find_next();
     end;
time=Coalesce(start,load);
end;
output;
end;
keep session_id name time;
run;
lydiawawa
Lapis Lazuli | Level 10
@novinosrin I looked more into the dataset, and realized the name values all starts with “pd_” which will need to alter t1 in the program. It should be t1 = scan(name,2,'_'). However, just by extracting one word is not enough to define unique t1, some name such as pd_HH_dash_lang_entry_ind and pd_HH_dash_load_time will need more than one word. Sorry, the original dataset is very large, it was hard to find the exact pattern, but I think I finally have the finalized version of the sample dataset, please see my most recent updates on the original post. Thank you so much for your patience.
novinosrin
Tourmaline | Level 20

@lydiawawa  it's a bit too late here and Im half asleep. Let me take a look in the morning. In the mean time I'm sure somebody else in a different time zone might respond. Hang in there

novinosrin
Tourmaline | Level 20

Hi @lydiawawa   

Ok, I think I need help from you in clarifying what makes it one unique set(group) .

For example., your example pd_HH_dash_lang_entry_ind and pd_HH_dash_load_time , what exacts would be classed as a group.

Since the discussion has been going back and forth, I would need you to thoroughly revisit the data samples in your original and clearly establish what constitutes a group.  This is crucial as this forms the basis for key variables that I would want to be defined in the hash object or as look up/join keys for that matter be it sql or hash

 

Take your time and respond here with utmost clarity if you can. 

 

Best!

 

 

lydiawawa
Lapis Lazuli | Level 10

@novinosrin I believe a group is defined by the term comes before "_load_time" or "_start_time" and after "pd_" therefore, in this case, the group will be "HH_dash" .

novinosrin
Tourmaline | Level 20

@lydiawawa  

 

Hmm, Ok taking this example below i.e your latest sample

 

All records have the same prefix pd_

So obviously we need to segregate within. So can you manually show me the segregated  groups with each groups load and start in place for the below sample? We really have to get over this roadblock

 

 

pd_login_load_time
pd_entry_confirm_load_time
pd_entry_confirm_start_time
pd_save_time
pd_entry_login_time
pd_entry_confirm_text
pd_login_text
pd_login_device
pd_confirm_device
pd_confirm_text
pd_login_time
pd_confirm_load_time
pd_login_text
pd_login_load_time
pd_entry_confirm_start_time
pd_entry_confirm_text
pd_HH_dash_lang_entry_ind
pd_HH_dash_load_time
pd_dob_close_browser_time
pd_dob_lang_entry_ind
pd_dob_lang_exit_ind
pd_dob_load_time
pd_dob_next_time
pd_home_lang_entry_ind
pd_home_lang_exit_ind
pd_home_lang_load_time
pd_home_lang_next_time

 

lydiawawa
Lapis Lazuli | Level 10

Hope the following helps on clarification. For every session_id, entries without "_load_time", "_start_time" or "_time" have missing time value. We want to fill in missings by "_start_time" or "_load_time", "_start_time" comes in first because it is sorted after "_load_time".

 

pd_login_load_time (login with time value)
pd_entry_confirm_load_time (entry_confirm with time value)
pd_entry_confirm_start_time (entry_confirm with time value)
pd_save_time ("_time " always have a time value, no need to fill in missing for this type of entries)
pd_entry_login_time ("_time" time value not missing)
pd_entry_confirm_text (entry_confirm)
pd_login_text (login)
pd_login_device (login)
pd_confirm_device (confirm)
pd_confirm_text (confirm)
pd_login_time (login)
pd_confirm_load_time (confirm with time value)
pd_login_text (login)
pd_login_load_time (login with time value)
pd_entry_confirm_start_time (entry_confirm)
pd_entry_confirm_text (entry_confirm)
pd_HH_dash_lang_entry_ind (HH_dash)
pd_HH_dash_load_time (HH_dash with time value)
pd_dob_close_browser_time ("_time" time value not missing)
pd_dob_lang_entry_ind (dob)
pd_dob_lang_exit_ind (dob)
pd_dob_load_time (dob with time value)
pd_dob_next_time ("_time" time value not missing)
pd_home_lang_entry_ind (home_lang)
pd_home_lang_exit_ind (home_lang)
pd_home_lang_load_time (home_lang with time value)
pd_home_lang_next_time ("_time" time value not missing)
novinosrin
Tourmaline | Level 20

@lydiawawa   Sounds like pattern is really hard as in some the first 2 words extract makes it a group and in others it happens to be the 1st 3

 

For example

pd_login_text (login)   /*first 2*/
pd_login_load_time (login with time value)/*first 2*/
pd_entry_confirm_start_time (entry_confirm) /*first 3*/
pd_entry_confirm_text (entry_confirm) /*first 3*/
pd_HH_dash_lang_entry_ind (HH_dash) /*first 3*/
pd_HH_dash_load_time (HH_dash with time value) ) /*first 3*/
pd_dob_close_browser_time ("_time" time value not missing) /*first 2*/
pd_dob_lang_entry_ind (dob) /*first 2*/
pd_dob_lang_exit_ind (dob) /*first 2*/
pd_dob_load_time (dob with time value) /*first 2*/

 

Isn't there any logical way to group? I mean a consistent pattern?

 

 

 

novinosrin
Tourmaline | Level 20

Ok @lydiawawa 

 

I am gonna attempt with this 

 

"For every session_id, entries without "_load_time", "_start_time" or "_time" have missing time value.

We want to fill in missings by "_start_time" or "_load_time", "_start_time" comes in first because it is sorted after "_load_time".  

 

Is this consistent? Can you confirm there is only one occurence of entries with load_time, start_time respectively for a session_id? This is very important!

Also, for every session_id , are you sure anything at has missing values can be filled with the above logic

 

Awaiting your response. Once you confirm, i will jump to code

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 24 replies
  • 1409 views
  • 13 likes
  • 3 in conversation