Determining flows in a survey

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 19
Accepted Solution

Determining flows in a survey

Hello SAS guru’s,
    I really need your help!  I have been struggling with this for a while and am starting to wonder if it’s even possible?  I’m using SAS 9.2.  Think of the following data set as a survey (collection of question_id’s) with different question flows (skip_id’s).
 
How can I determine the different possible flows?
 
I have the following data set:
data have;

input question_id $ skip_id $;

cards;

1.1.1   

1.1.2   

1.1.3   1.4.1

1.1.4   1.2.1

1.1.5   1.3.1

1.1.6   

1.1.7   

1.2.1   

1.2.2   

1.2.3   1.4.1

1.3.1   

1.3.2   

1.3.3   

1.4.1   

2.1.1 

2.1.2 2.2.3

2.1.3 2.2.4

2.2.1 

2.2.2

2.2.3

2.2.4

;
 
I know what I need to end up with:
FLOW 1                 
1.1.1  
1.1.2  
1.1.3   1.4.1
1.4.1  
2.1.1
2.1.2 2.2.3
2.2.3
2.2.4
 
FLOW 2
1.1.1  
1.1.2  
1.1.3   1.4.1
1.4.1  
2.1.1
2.1.3 2.2.4
2.2.4
 
FLOW 3
1.1.1  
1.1.2  
1.1.4   1.2.1
1.2.1  
1.2.2  
1.2.3   1.4.1
1.4.1  
2.1.1
2.1.2 2.2.3
2.2.3
2.2.4
 
FLOW 4
1.1.1  
1.1.2  
1.1.4   1.2.1
1.2.1  
1.2.2  
1.2.3   1.4.1
1.4.1  
2.1.1
2.1.3 2.2.4
2.2.4
 
FLOW 5
1.1.1  
1.1.2  
1.1.5   1.3.1
1.3.1  
1.3.2  
1.3.3  
1.4.1  
2.1.1
2.1.2 2.2.3
2.2.3
2.2.4
 
FLOW 6
1.1.1  
1.1.2  
1.1.5   1.3.1
1.3.1  
1.3.2  
1.3.3  
1.4.1  
2.1.1
2.1.3 2.2.4
2.2.4
 
FLOW 7
1.1.1  
1.1.2  
1.1.6  
1.1.7  
1.2.1  
1.2.2  
1.2.3   1.4.1
1.4.1  
2.1.1
2.1.2 2.2.3
2.2.3
2.2.4
 
FLOW 8
1.1.1  
1.1.2  
1.1.6  
1.1.7  
1.2.1  
1.2.2  
1.2.3   1.4.1
1.4.1  
2.1.1
2.1.3 2.2.4
2.2.4
 
 I just have no idea how to get there.  I know I need to:
  1. Identify if a question_id is valid for a flow.
  2. Create a table for each possible flow.
  3. Update the original table to identify question_id’s that should no longer be used (i.e. after flow 1 and flow 2 are created, question_id 1.1.3 is never used again).
Keeping in mind that the number of flows isn’t predetermined (survey questions can be added removed over time).
 
Any help you can give me to get this off the ground would be greatly appreciated.
 
Thanks in advance!
 
 

Accepted Solutions
Solution
‎04-11-2016 08:43 AM
Super User
Posts: 9,856

Re: Determining flows in a survey

I love this question.
Assuming 'question_id' has unique value.
Good Luck.





data have;
infile cards truncover;
input question_id $ skip_id $;
cards;
1.1.1   
1.1.2   
1.1.3   1.4.1
1.1.4   1.2.1
1.1.5   1.3.1
1.1.6   
1.1.7   
1.2.1   
1.2.2   
1.2.3   1.4.1
1.3.1   
1.3.2   
1.3.3   
1.4.1   
2.1.1 
2.1.2 2.2.3
2.1.3 2.2.4
2.2.1 
2.2.2
2.2.3
2.2.4
;
run;
data have;
 set have;
 idx+1;
 if not missing(lag(skip_id)) and missing(skip_id) then group+1;
run;
proc sql;
create table key as
 select question_id,skip_id
  from (select group,question_id from have where skip_id is missing) as a,
       (select group,skip_id from have where skip_id not is missing) as b
	where a.group=b.group
 union all
 select question_id,skip_id from have where skip_id not is missing ;
quit;

data want;
if _n_ eq 1 then do;
length path _path  $ 4000 _question_id $ 200;

if 0 then set have;
declare hash ha(dataset:'have',ordered:'y');
declare hiter hi('ha');
ha.definekey('idx');
ha.definedata('question_id','skip_id');
ha.definedone();

declare hash p(dataset:'have');
p.definekey('question_id');
p.definedata('idx');
p.definedone();

if 0 then set key;
declare hash k(hashexp:20,dataset:'key(where=(question_id is not missing and skip_id is not missing))',multidata:'y');
k.definekey('question_id');
k.definedata('skip_id');
k.definedone();
 
declare hash pa(ordered:'y');
declare hiter hi_path('pa');
pa.definekey('n');
pa.definedata('path');
pa.definedone();
 
end;
 
set have(where=(question_id is not missing and skip_id is not missing));
n=1;
path=catx(' ',question_id,skip_id);
pa.add();
do while(hi_path.next()=0);
_path=path;   
question_id=scan(path,-1,' ');
rc=k.find();  

if rc ne 0 then do;
 flow+1; 
 do i=1 to countw(path);
  question_id=scan(path,i,' ');
  rr=p.find();
  if rr=0 then do;
    if i=1 then do;
      _idx=0;
      do while(hi.next()=0);
	   _idx+1;
       if missing(skip_id) then output;
	   if idx=_idx then do;output;leave;end;
	  end;
    end;
    else do;
	 _question_id=scan(path,i+1,' ');
     rx=hi.setcur();
     do while(rx=0);
      if missing(skip_id) then output;
	  else do;       	
        if _question_id=skip_id then do;output;leave;end;
      end;
      rx=hi.next();
     end;
    end;
  end;
 end;
end;

do while(rc=0);
  if not findw(path,strip(skip_id)) then do;
   if length(path)+length(skip_id)+1 gt lengthc(path) then do; 
     putlog 'ERROR: The length of path and _path are set too short';
     stop;
   end;
   n+1;
   path=catx(' ',path,skip_id);
   pa.add();
   path=_path;
  end;
   
  rc=k.find_next();
end;
end;
pa.clear();
keep flow question_id skip_id ;
run;




View solution in original post


All Replies
Super User
Posts: 11,105

Re: Determining flows in a survey

Before attempting this I think you may want to consider what you mean by your requirement

2: Create a table for each possible flow.

 

Is this supposed to be each possible order of question response or for each possible value of each question?

And how many questions total do you expect to model? And then the ever popular just what will be done with the resulting set?

I did a rough analysis one one of my surveys and came up with over 52,000,000,000,000 paths and that was counting any of the continuous response variables (height or weight) as only one value. So printing all of the paths out was out of the question (printing at one page per second =>1.6 million years).

 

And please describe what the values of the variables mean:

1.1.3   1.4.1

What does the first position (1), second position (1) and third position (3) for the first variable indicate.

 

I would suggest first starting with how many conditions cause skips.

 

 

 

Occasional Contributor
Posts: 19

Re: Determining flows in a survey

Thanks for your reply ballardw.  When a skip_id or flow is stated, it relates to a question response.  So to explain the first 5 observations as an example:

1.1.1 relates to a question.  I.E.  What is the store name?

1.1.2 relates to a question.  I.E.  What is the store address?

1.1.3 relates to a question and possible response.  I.E.  Store status:  open?

1.1.4 relates to a question and possible response.  I.E.  Store status:  temporarily closed?

1.1.5 relates to a question and possible response.  I.E.  Store status:  closed permanently?

A user would have to answer 1.1.1, 1.1.2 and one of the following (1.1.3, 1.1.4 or 1.1.5).

 

So to answer your question, 2: Create a table for each possible flow relates to each possible value of each question and which question would follow based on the response.

 

Currently the real data set has 500 questions and 55 skips.  

 

There is also peripheral data that is used to remove certain questions when conditions are met (which I have programmed already) so a user will not have to answer hundreds of questions.

 

What will be done with the resulting data sets?  I need to write out test scenerios in blocks to test the survey any time changes are made.  I assume that a table is needed for each flow, but perhaps there is a different way to approach this.  I need to provide a tester with the question, which response to enter and then which question they should see next.  For example:  Enter survey, type "Store 1" into "What is the store name" question, type "123 Main St" into "What is the address" question, select "Open" in the "Store status" question.....    

 

The meaning of the values aren't really important, but say the first position is equivalent to a survey module with related questions, the second position is a subset of those questions and the third position represents the specific question and/or question response.

 

I hope I have provided enough information.  Please let me know if there is anything else I can do to help determine if this is a feasible request.

 

Thanks!

Super User
Posts: 11,105

Re: Determining flows in a survey

At only 2 options per question you start with about 3.3E150 complete paths (counting each response possible) through your survey.

Things like Store Name and Address add considerable additional items (unless you only have two stores..)

55 branches, ignoring potential for nesting gets to about 3.6E16 branch combinations.

 

I'll leave it to you figure how long testing a survey with the proposed approach would take. Note my previous bit on how many years to print at page per second. Testing will take longer.

 

A more practical approach is to identify all of the values that actually cause a skip and concentrate on those. Identify what value(s) for a question cause skip and the resulting first point in that path. Note that you are still in a place that it will take time to exhaust all 3.6E16 paths. Depending on complexity I would start by looking at the values involved in the most skips and involved in composite requirements for some questions (gender and age combination for instance).

 

Identifying the "range" of a skip may be more important then the specific skip. If you have skips like "If the answer to question 6 is x skip to question 8" (or other similar simple one response one result) those test quickly. The nasty part is someone inserting a question between 7 and 8. Should the skip now go to 7 or to 8 (and more difficult to catch if the range is 5 or 6 questions)?

 

Another thing is questions that allow a more open end (numerical) response is to test the range end points and ensuring values are in acceptable ranges.

 

I believe that some survey software can generate a flow chart of the code. Which may be an option. It may be easier to review that than to try to program in a another language.

 

I don't think there's any practical way to summarize 6 years of work into a couple of posts.

 

 

 

 

Respected Advisor
Posts: 4,804

Re: Determining flows in a survey

The underlying problem here is to list all paths in a network, going from the first question to the last. Here is an adaptation of a macro I wrote a while ago to solve this kind of problem.

 

data have;
length question_id skip_id $8;
infile datalines truncover;
input question_id skip_id;
datalines;
1.1.1   
1.1.2   
1.1.3   1.4.1
1.1.4   1.2.1
1.1.5   1.3.1
1.1.6   
1.1.7   
1.2.1   
1.2.2   
1.2.3   1.4.1
1.3.1   
1.3.2   
1.3.3   
1.4.1   
2.1.1 
2.1.2  2.2.3
2.1.3  2.2.4
2.2.1 
2.2.2
2.2.3
2.2.4
;

/* Create the network of questions */

data links;
set have;
prev_q = lag(question_id);
if not missing(prev_q) then do;
    from = prev_q;
    to = question_id;
    output;
    end;
if not missing(skip_id) then do;
    from = question_id;
    to = skip_id;
    output;
    end;
run;

/* Find the first and last questions */
data request;
length start end $8;
retain start;
set have end=done;
if _n_ = 1 then start = question_id;
if done then do;
    end = question_id;
    output;
    end;
keep start end;
run;

/* Add an index to the network */
proc sort data=links out=links(index=(from)); by from to; run;

/* Define the allPaths macro */
%macro allPaths(network, pairs, paths, maxPathLength=100);
%let nodeIDlength=8;
%let which=whichn;
data _null_;
set &network;
if vtype(from) = "C" then do;
    call symputx("nodeIDlength", cats("$",vlength(from)));
    call symputx("which", "whichc");
    end;
stop;
run;

data &paths(keep=start end pathNo nodeOrder from to);
set &pairs;
array arc[&maxPathLength];
array node[0:&maxPathLength] &nodeIDlength;
pathNo = 0;
len = 0;
node[len] = start;

do k = 1 to 10000; /* Catch infinite loop, just in case */
    if missing(node[len]) then do; /* Move to sibling or back off*/
        if len = 0 then leave;
        p = arc[len] + 1; 
        if p > maxp then do; /* Last arc: no sibling. Back off */
            len + (-1);
            call missing(node[len]);
            end;
        else do;
            set &network point=p nobs=maxp;
            if from = node[len-1] then do; /* Sibling found */
                node[len] = to;
                arc[len] = p;
                end;
            else do; /* No more sibling. Back off */
                len + (-1);
                call missing(node[len]);
                end;
            end;
        end;
    else do; /* Go further down the path */
        if node[len] = end then do; /* Simple path found */
            pathNo + 1;
            do j = 1 to len;
                from = node[j-1]; to = node[j]; nodeOrder=j; output;
                end;
            call missing(node[len]);
            end;
        else if &which(node[len], of node[*]) < len+1 then do; /* Cycle detected in path */
            call missing(node[len]);
            end;
        else do; /* Move ahead in path */
            from = node[len];
            set &network key=from/unique curobs=p;
            if p and len < dim(arc) then do; /* To node found, add it to path, if possible */
                len + 1;
                node[len] = to;
                arc[len] = p;
                end;
            else do; /* No arc found or no room left on path. Dead end */
                _error_ = 0;
                call missing(node[len]);
                end;
            end;
        end;
    end;
if pathNo = 0 then do; /* No path found. Output missing values */
    call missing(from, to, nodeOrder); output; end;
run;
%mend allPaths;

/* Call the allPaths macro */
%allpaths(links, request, flowList);

/* Reformat the output */
data flows;
set flowList; by pathNo;
flow = pathNo;
question_id = from;
output;
if last.pathNo then do;
    question_id = to;
    output;
    end;
keep flow question_id;
run;

/* Print the flows */
proc print data=flows noobs; 
by flow; id flow; 
run;
PG
Occasional Contributor
Posts: 19

Re: Determining flows in a survey

Hi PG,

    I get the following error when I run the code:

 

NOTE: Line generated by the invoked macro "ALLPATHS".

7     );             end;         else do;              from = node[len];             set

7   ! &network key=from/unique curobs=p;             if p and len < dim(arc) then do;

                                     -

                                     22

                                     76

7   !       len + 1;                 node[len] = to;                 arc[len] = p;

 

ERROR 22-322: Syntax error, expecting one of the following: END, INDSNAME, KEY, KEYS, NOBS,

              POINT.

 

ERROR 76-322: Syntax error, statement will be ignored.

 

NOTE: The SAS System stopped processing this step because of errors.

 

I included it as a screenshot as well as it may be easier to see.  I am not too familiar with complex macros, do you have any suggestions on how I could fix this?

 

Thanks.

 

 

 


image001.png
Respected Advisor
Posts: 4,804

Re: Determining flows in a survey

Unfortunately, set statement option curobs= appeared only in SAS 9.4.

PG
Solution
‎04-11-2016 08:43 AM
Super User
Posts: 9,856

Re: Determining flows in a survey

I love this question.
Assuming 'question_id' has unique value.
Good Luck.





data have;
infile cards truncover;
input question_id $ skip_id $;
cards;
1.1.1   
1.1.2   
1.1.3   1.4.1
1.1.4   1.2.1
1.1.5   1.3.1
1.1.6   
1.1.7   
1.2.1   
1.2.2   
1.2.3   1.4.1
1.3.1   
1.3.2   
1.3.3   
1.4.1   
2.1.1 
2.1.2 2.2.3
2.1.3 2.2.4
2.2.1 
2.2.2
2.2.3
2.2.4
;
run;
data have;
 set have;
 idx+1;
 if not missing(lag(skip_id)) and missing(skip_id) then group+1;
run;
proc sql;
create table key as
 select question_id,skip_id
  from (select group,question_id from have where skip_id is missing) as a,
       (select group,skip_id from have where skip_id not is missing) as b
	where a.group=b.group
 union all
 select question_id,skip_id from have where skip_id not is missing ;
quit;

data want;
if _n_ eq 1 then do;
length path _path  $ 4000 _question_id $ 200;

if 0 then set have;
declare hash ha(dataset:'have',ordered:'y');
declare hiter hi('ha');
ha.definekey('idx');
ha.definedata('question_id','skip_id');
ha.definedone();

declare hash p(dataset:'have');
p.definekey('question_id');
p.definedata('idx');
p.definedone();

if 0 then set key;
declare hash k(hashexp:20,dataset:'key(where=(question_id is not missing and skip_id is not missing))',multidata&colon;'y');
k.definekey('question_id');
k.definedata('skip_id');
k.definedone();
 
declare hash pa(ordered:'y');
declare hiter hi_path('pa');
pa.definekey('n');
pa.definedata('path');
pa.definedone();
 
end;
 
set have(where=(question_id is not missing and skip_id is not missing));
n=1;
path=catx(' ',question_id,skip_id);
pa.add();
do while(hi_path.next()=0);
_path=path;   
question_id=scan(path,-1,' ');
rc=k.find();  

if rc ne 0 then do;
 flow+1; 
 do i=1 to countw(path);
  question_id=scan(path,i,' ');
  rr=p.find();
  if rr=0 then do;
    if i=1 then do;
      _idx=0;
      do while(hi.next()=0);
	   _idx+1;
       if missing(skip_id) then output;
	   if idx=_idx then do;output;leave;end;
	  end;
    end;
    else do;
	 _question_id=scan(path,i+1,' ');
     rx=hi.setcur();
     do while(rx=0);
      if missing(skip_id) then output;
	  else do;       	
        if _question_id=skip_id then do;output;leave;end;
      end;
      rx=hi.next();
     end;
    end;
  end;
 end;
end;

do while(rc=0);
  if not findw(path,strip(skip_id)) then do;
   if length(path)+length(skip_id)+1 gt lengthc(path) then do; 
     putlog 'ERROR: The length of path and _path are set too short';
     stop;
   end;
   n+1;
   path=catx(' ',path,skip_id);
   pa.add();
   path=_path;
  end;
   
  rc=k.find_next();
end;
end;
pa.clear();
keep flow question_id skip_id ;
run;




Occasional Contributor
Posts: 19

Re: Determining flows in a survey

Hi Xia,

     Your solution is great!  I hadn't thought of adding an ID to the flows and keeping everything in one table.  This will be nice to have as a peripheral table as well.  You are brilliant.  Thank you so much for the time you put into this! 

Occasional Contributor
Posts: 19

Re: Determining flows in a survey

Thank you for your solution PGstats. I unfortunately don't have sas at home but am excited to try your solution on Monday.
Super User
Posts: 9,856

Re: Determining flows in a survey

multidata&colon;'y'); -->> mul...a : 'y');
Occasional Contributor
Posts: 19

Re: Determining flows in a survey

Thank you Ksharp! I went from thinking I may have the daunting task of doing this manually to having two solutions to try. I've never looked so forward to a Monday!
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 11 replies
  • 529 views
  • 2 likes
  • 4 in conversation