I am trying to put all questions and answers for a given ID on a single row, but am having difficulties doing so. There are some other fields but the main ones are concerned with re-arranging the data are:CaseID QuestionText AnswerText. The other columns are TestPeriod, TestName, RecordPrimaryID CaseWorkDT AssignedOwner ManagerName
Data I have is like this (with the additional columns but that's just extra noise at this point):
CaseID QuestionText AnswerText
1111 Claim Pass
1111 Memos Pass
1111 Procedure Fail
1112 Documented Pass
1112 Sent Pass
1113 Contents Pass
1114 Completed Pass
1114 Investigated Pass
Data Want:
CaseID Q1 A1 Q2 A2 Q3 A3
1111 Claim Pass Memos Pass Procedure Fail
1112 Documented Pass Sent Pass
1113 Contents Pass
1114 Completed Pass Investigated Pass
I first tried doing proc transpose but that didn't seem to get me what I needed. I then tried an array, and it looked like it was working, but we found some weird data stuff going on.
Here is what I was trying below (Note that the number of questions and answers can vary on which tests were doing that given month. I tried using variables for the array and was given errors, but 50 seemed to be enough for any given test):
data work.TM_Assessment_4c;
set work.TM_Assessment_3c;
by CaseID;
array questionarray {50} $400 Q1-Q50;
array answerarray {50} $400 A1-A50;
if first.CaseID then
i = 1;
questionarray(i) = QuestionText;
answerarray(i) = AnswerText;
if last.CaseID then
output;
i+1;
retain Q1-Q50;
retain A1-A50;
run;
proc sql;
create table work.TM_Assessment_5c AS
select TestPeriod
, TestName
, CaseID
, RecordPrimaryID
, CaseWorkDT
, AssignedOwner
, ManagerName
, Q1
, A1
, Q2
, A2
, Q3
, A3
, Q4
, A4
, Q5
, A5
, Q6
, A6
, Q7
, A7
, Q8
, A8
, Q9
, A9
, Q10
, A10
, Q11
, A11
, Q12
, A12
, Q13
, A13
, Q14
, A14
, Q15
, A15
, Q16
, A16
, Q17
, A17
, Q18
, A18
, Q19
, A19
, Q20
, A20
, Q21
, A21
, Q22
, A22
, Q23
, A23
, Q24
, A24
, Q25
, A25
, Q26
, A26
, Q27
, A27
, Q28
, A28
, Q29
, A29
, Q30
, A30
, Q31
, A31
, Q32
, A32
, Q33
, A33
, Q34
, A34
, Q35
, A35
, Q36
, A36
, Q37
, A37
, Q38
, A38
, Q39
, A39
, Q40
, A40
, Q41
, A41
, Q42
, A42
, Q43
, A43
, Q44
, A44
, Q45
, A45
, Q46
, A46
, Q47
, A47
, Q48
, A48
, Q49
, A49
, Q50
, A50
from work.TM_Assessment_4c
order by AssignedOwner, TestPeriod, TestName, RecordPrimaryID
;
quit;
The output data looked like it was what we wanted at first, but the actual data was not correct. We were seeing instances where a case would have only one question and one answer, and while it was showing that in the data, it was also showing additional questions and answers on the same row that had nothing to do with it.
Can anyone help me out?
Thanks all.
Personally I find your "want" structure to be subject to many problems. I can see a strong possibility the the Q value of "Claim" could end up in many different variables is just processing in order.
I suggest something like this may be more appropriate where each "question text" becomes the variable name.
data example; input id $ Activity :$15. Status :$6.; datalines; 1111 Claim Pass 1111 Memos Pass 1111 Procedure Fail 1112 Documented Pass 1112 Sent Pass 1113 Contents Pass 1114 Completed Pass 1114 Investigated Pass ; proc transpose data=example out=trans (drop=_name_); by id; var status; id Activity; run;
Note providing some data in the form of a data step that you can run to duplicate the results and posted into a text box.
@Maldra wrote:
Unfortunately that won't work really since there are thousands of different questions since there are thousands of test with different questions. It was just dummy data that I provided; the actual questions are long sentences. I may just have to go to them with the "have" data format if I can't get the "want" format.
Possibly even more reason not to make a data set in that format.
"Long sentences" might typically be a Label for a specific variable.
I would take a document, you do have one don't you, with all the "long sentence" values and make a data set or informat that links each specific sentence to a variable name to a specific order number. Then you have something that could add "Q1" or "A1" consistently.
Example:
proc format; invalue text2order (upcase) 'CLAIM' = 1 'MEMOS' = 2 'PROCEDURE' = 3 'DOCUMENTED' = 4 'SENT' = 5 'CONTENTS' = 6 'COMPLETED' = 7 'INVESTIGATED' = 8 ; run; data example; input CaseID :$10. QuestionText :$40. AnswerText :$10.; datalines; 1111 Claim Pass 1111 Memos Pass 1111 Procedure Fail 1112 Documented Pass 1112 Sent Pass 1113 Contents Pass 1114 Completed Pass 1114 Investigated Pass ; data want; set example; array q(8) $ 40; array a(8) $ 10; retain q: a:; by caseid; if first.caseid then call missing(of q(*), of a(*)); q[input(questiontext,text2order.)]= questiontext; a[input(questiontext,text2order.)]= AnswerText; if last.caseid; keep caseid q1-q8 a1-a8; run;
If you have any form of electronic document with the "questions" you should be able to create the invalue using Proc Format and a CNTLIN data set.
Could you explain the logic of how the data is stored?
What I have observed is Q1 is always the first.caseid which you will save as QuestionText and A1 as the Answertext
Then you remove the first.caseid record, in the next iteration Q2 is the first.caseid.
and so on for Q3
Will this iterative process work with your survey data structre?
Before you can make the data you need to answer this question:
Do you want the same question/answer pair to always end up in the same pair of variables across the cases?
For example do you want CLAIM to always go into Q1? And if a case does not have the CLAIM quesiton that it has missing values of Q1? Or perhaps missing values for A1 but Q1 still says CLAIM?
In that case you need to first map the Question text to the variable number. Do you have a full set of possible values? Or do you want to just use the set of questions in the current file? If the later will you ever get any other files of this type? Do they need to be mapped into the same variables?
Or is it ok that for the first case Q1 is CLAIM and for the second case Q1 is DOCUMENTED?
To just numbers them sequentially within each CASE is the easiest.
I like to use PROC SUMMARY.
First you need to know the maximum number of questions any single CASE could have.
Let's call that number N and stick it into a macro variable named N. For your dummy data N is 3.
data have ;
input CaseID :$10. QuestionText :$40. AnswerText :$10.;
cards;
1111 Claim Pass
1111 Memos Pass
1111 Procedure Fail
1112 Documented Pass
1112 Sent Pass
1113 Contents Pass
1114 Completed Pass
1114 Investigated Pass
;
%let N=3;
proc summary data=have nway ;
by caseid;
output out=want(drop=_type_) idgroup (out[&n] (questiontext answertext)=);
run;
Results:
Case Question Question Question Answer Answer Answer Obs ID _FREQ_ Text_1 Text_2 Text_3 Text_1 Text_2 Text_3 1 1111 3 Claim Memos Procedure Pass Pass Fail 2 1112 2 Documented Sent Pass Pass 3 1113 1 Contents Pass 4 1114 2 Completed Investigated Pass Pass
Nothing to identify what the source of the questions is???
If there were such a field then this is partially defensible but I bet you are going to deal with issues of dates and numeric values converting to text somewhere along the line.
After talking to the stakeholders some more, the reason they want it in this format is so they can enter it into a system, and that is how it takes the data. The solution of having gaps also wouldn't work. It needs to look like my original post. Is this something SAS is able to do?
You have a couple of answer to produce what you asked for.
Please explain it what way they are not what you need?
If they are entering into a "system" what type of file do they need?
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.