I have a data set which is the result of a matrix/likert/grid table in a survey. However, I received the data in a strange format and am having trouble re-parsing it into separate columns of data - 1 column per likert prompt.
In the sample data, row one has all the right columns of data and the answers are just following it in separate brackets (possible answers are VERY IMPORTANT4, 3, 2, Not AT ALLIMPORTANT1), but as you can see, the following rows have some missing data.
Any thoughts on how to get a column of result data per prompt and show missing data where appropriate? I am pretty rusty on string matches and don't really know how to get the missing information to show as missing when not there. Create separate tables and merge things back together?
Thanks!
Sample Data is here:
DATA WORK.Book1;
LENGTH
ID 8
Results $ 145 ;
FORMAT
ID BEST12.
Results $CHAR145. ;
INFORMAT
ID BEST12.
Results $CHAR145. ;
INFILE DATALINES4
DLM='7F'x
MISSOVER
DSD ;
INPUT
ID : BEST32.
Results : $CHAR145. ;
DATALINES4;
1{Spousal considerations}{VERY IMPORTANT4}|{Lifestyle}{VERY IMPORTANT4}|{Hours}{VERY IMPORTANT4}|{Financial considerations }{NOT AT ALLIMPORTANT1}
2{Spousal considerations}{VERY IMPORTANT4}|{Hours}{VERY IMPORTANT4}|{Financial considerations }{NOT AT ALLIMPORTANT1}
3{Spousal considerations}{VERY IMPORTANT4}|{Lifestyle}{3}|{Financial considerations }{NOT AT ALLIMPORTANT1}
;;;;
Hi @altatunc
It's a cold, rainy and stormy day here, so I really enjoyed working with your problem. Many years ago I often worked with survey data in all sort of unreadable proprietary formats, and I went down memory lane and tried to give your sample a full treatment. The principles used are:
Given your input, the result is:
Dataset "as is" without labels and formats:
Dataset as presented using labels and formats:
Here is the full code (assuming your Book1 data set exists). Note: It is fully dynamic and data driven, as long as answers follow the pattern <some text>category, like Very Important4 or 3 , it will need modifications in Step 1 to handle other patterns, and further modifications to handle not-numeric answers with texts only and no category number.
* Split in one observation per ID / question with answer category and -text;
data wa1 (drop=Results w1 w2); set book1;
length w1 w2 $84 Qtext $40 Avalue 8 Alabel $40;
do Pos = 1 to countc(results,'|') + 1;
w1 = scan(Results,Pos,'|');
Qtext = scan(w1,1,'{}');
w2 = scan(w1,2,'{}');
Avalue = input(prxchange('s/(\D*)(\d+)/$2/',-1,w2),8.);
Alabel = prxchange('s/(\D*)(\d+)/$1/',-1,w2);
output;
end;
run;
/*----------------------------------------------------------------------
Transform to "full questionnaire" with all questions present for each ID
----------------------------------------------------------------------*/
* Extract all possible questions from survey data;
* Number is the last position where a question is found for any ID;
proc sql;
create table qlist as
select distinct
Qtext,
max(Pos) as Qnum
from wa1
group by Qtext
order by Qnum;
quit;
* get number of questions - will be used in a later section;
proc sql noprint;
select max(Qnum) into :MaxQ
from qlist;
quit;
* Expand numbered list with ID's to get full list of questions for all ID's;
proc sql;
create table wqlist as
select
a.ID,
b.Qtext,
b.Qnum
from qlist as b,
(select distinct ID from wa1) as a
where 1 = 1;
quit;
* Merge list with survey data to get "full questionnaires" for all ID's;
proc sql;
create table wa2 as
select
a.ID,
a.Qnum,
b.Qtext,
b.Avalue,
b.Alabel
from wqlist as a
left join wa1 as b
on
a.ID = b.ID
and a.Qtext = b.Qtext
order by
ID,
Qnum;
quit;
/*--------------------------------------------------------------------
Formats - value texts for all questions
--------------------------------------------------------------------*/
* Get data to create formats;
proc sql;
create table wfmt1 as
select distinct
Qnum,
Avalue,
Alabel
from wa2
where Avalue ne .
order by
Qnum,
Avalue;
quit;
* Create formats for all questions/answer categories;
data _null_; set wfmt1 end=end;
by Qnum;
length katlist $1000 w $30;
retain katlist;
if _N_ = 1 then call execute('proc format;');
if first.Qnum then do;
katlist = 'value q' || strip(put(Qnum,8.)) || 'fmt';
katlist = catx(' ',katlist,'. = "N/A"');
end;
nolabel = strip(put(Avalue,8.)) || ' <no label>';
katlist = catx(' ',katlist,strip(put(Avalue,8.)) || ' = "' || coalescec(trim(Alabel),trim(nolabel)) || '"');
if last.Qnum then do;
katlist = catt(katlist,';');
call execute(katlist);
end;
if end then call execute('run;');
run;
/*-------------------------------------------------------------------
Result data - Convert to columns V1-Vn per ID and move question and
category texts to variable labels and formats.
--------------------------------------------------------------------*/
* Transpose data to one observation per ID with V1-Vn populated - drop all texts;
data wa3; set wa2;
drop Qnum Qtext Avalue Alabel;
by ID;
array A 8 V1-V%eval(&MaxQ);
retain V1-V%eval(&MaxQ);
if first.ID then call missing (of V1-V%eval(&MaxQ));
A{Qnum} = Avalue;
if last.ID then output;
run;
* Initiate empty result data from list of questions;
* Variables are numbered V1-Vn, has question text as label and answer text as format;
data _null_; set qlist end=end;
length attr $100;
if _N_ = 1 then call execute('data result; attrib ID length=8;');
attr = 'attrib V' || strip(put(Qnum,8.)) || ' length=8 label="' || trim(Qtext) || '" format=q' || strip(put(Qnum,8.)) || 'fmt.;';
call execute(attr);
if end then call execute('stop; run;');
run;
* Append data to get final result;
proc append base=result data=wa3;
run;
Something like below should give you a start.
DATA WORK.Book1;
LENGTH
ID $10
key $40
value $40;
INFILE DATALINES4
DLM=',{}|'
truncover
;
INPUT
ID :$10. @;
do while(1);
input key :$40. value :$40. @;
if missing(key) then leave;
output;
end;
DATALINES4;
1,{Spousal considerations}{VERY IMPORTANT4}|{Lifestyle}{VERY IMPORTANT4}|{Hours}{VERY IMPORTANT4}|{Financial considerations }{NOT AT ALLIMPORTANT1}
2,{Spousal considerations}{VERY IMPORTANT4}|{Hours}{VERY IMPORTANT4}|{Financial considerations }{NOT AT ALLIMPORTANT1}
3,{Spousal considerations}{VERY IMPORTANT4}|{Lifestyle}{3}|{Financial considerations }{NOT AT ALLIMPORTANT1}
;;;;
Should you have issue to read the actual data then please post a few lines from there and post them as a .txt attachment (so that the EG import wizard doesn't change delimiters).
Thank you! This is helpful. It doesn't save a place for missing values but that could be solved in later processing. Thanks again, Adam
@altatunc wrote:
I have a data set which is the result of a matrix/likert/grid table in a survey. However, I received the data in a strange format and am having trouble re-parsing it into separate columns of data - 1 column per likert prompt.
In the sample data, row one has all the right columns of data and the answers are just following it in separate brackets (possible answers are VERY IMPORTANT4, 3, 2, Not AT ALLIMPORTANT1), but as you can see, the following rows have some missing data.
Any thoughts on how to get a column of result data per prompt and show missing data where appropriate? I am pretty rusty on string matches and don't really know how to get the missing information to show as missing when not there. Create separate tables and merge things back together?
Thanks!
Sample Data is here:
DATA WORK.Book1; LENGTH ID 8 Results $ 145 ; FORMAT ID BEST12. Results $CHAR145. ; INFORMAT ID BEST12. Results $CHAR145. ; INFILE DATALINES4 DLM='7F'x MISSOVER DSD ; INPUT ID : BEST32. Results : $CHAR145. ; DATALINES4; 1{Spousal considerations}{VERY IMPORTANT4}|{Lifestyle}{VERY IMPORTANT4}|{Hours}{VERY IMPORTANT4}|{Financial considerations }{NOT AT ALLIMPORTANT1} 2{Spousal considerations}{VERY IMPORTANT4}|{Hours}{VERY IMPORTANT4}|{Financial considerations }{NOT AT ALLIMPORTANT1} 3{Spousal considerations}{VERY IMPORTANT4}|{Lifestyle}{3}|{Financial considerations }{NOT AT ALLIMPORTANT1} ;;;;
When I copied your datalines there is a non-printable character appearing after the ID value. I replaced that with a | and used the | as a delimiter. I don't know if that is practical from your data or not.
You really should show what you expect for a result. Without any such actual guideline I built something that has one variable per "topic" or question or what ever and read the likert values as a numeric with a custom informat. I suspect you skipped some likert values but this may give you a starting point for one way to approach this.
Proc format library=work ; invalue mylikert (default=25) "{NOT AT ALLIMPORTANT1}"=1 "{3}" =3 "{VERY IMPORTANT4}" =4 ; run; DATA WORK.Book1; INFILE DATALINES4 DLM='|' MISSOVER DSD ; informat ID $5. SC LS Hours FC Mylikert.; INPUT ID @"{Spousal considerations}" Sc @"{Lifestyle}" LS @"{Hours}" Hours @"{Financial considerations }" FC ; label SC = "Spousal considerations" LS = "Lifestyle" Hours= "Hours" FC = "Financial considerations" ; DATALINES4; 1|{Spousal considerations}{VERY IMPORTANT4}|{Lifestyle}{VERY IMPORTANT4}|{Hours}{VERY IMPORTANT4}|{Financial considerations }{NOT AT ALLIMPORTANT1} 2|{Spousal considerations}{VERY IMPORTANT4}|{Hours}{VERY IMPORTANT4}|{Financial considerations }{NOT AT ALLIMPORTANT1} 3|{Spousal considerations}{VERY IMPORTANT4}|{Lifestyle}{3}|{Financial considerations }{NOT AT ALLIMPORTANT1} ;;;;
The key that works here is the @"some string" on an input statement finds the value in the string and then starts reading the column after. I didn't do anything with the {} in case you have a topic or question that had something along the lines of {I consider x VERY IMPORTANT} and many approaches that could ignore the braces might have difficulty. Leaving the braces in makes complete strings.
This is a great solution! Thanks for the quick update to my problem!
Adam
Hi @altatunc
It's a cold, rainy and stormy day here, so I really enjoyed working with your problem. Many years ago I often worked with survey data in all sort of unreadable proprietary formats, and I went down memory lane and tried to give your sample a full treatment. The principles used are:
Given your input, the result is:
Dataset "as is" without labels and formats:
Dataset as presented using labels and formats:
Here is the full code (assuming your Book1 data set exists). Note: It is fully dynamic and data driven, as long as answers follow the pattern <some text>category, like Very Important4 or 3 , it will need modifications in Step 1 to handle other patterns, and further modifications to handle not-numeric answers with texts only and no category number.
* Split in one observation per ID / question with answer category and -text;
data wa1 (drop=Results w1 w2); set book1;
length w1 w2 $84 Qtext $40 Avalue 8 Alabel $40;
do Pos = 1 to countc(results,'|') + 1;
w1 = scan(Results,Pos,'|');
Qtext = scan(w1,1,'{}');
w2 = scan(w1,2,'{}');
Avalue = input(prxchange('s/(\D*)(\d+)/$2/',-1,w2),8.);
Alabel = prxchange('s/(\D*)(\d+)/$1/',-1,w2);
output;
end;
run;
/*----------------------------------------------------------------------
Transform to "full questionnaire" with all questions present for each ID
----------------------------------------------------------------------*/
* Extract all possible questions from survey data;
* Number is the last position where a question is found for any ID;
proc sql;
create table qlist as
select distinct
Qtext,
max(Pos) as Qnum
from wa1
group by Qtext
order by Qnum;
quit;
* get number of questions - will be used in a later section;
proc sql noprint;
select max(Qnum) into :MaxQ
from qlist;
quit;
* Expand numbered list with ID's to get full list of questions for all ID's;
proc sql;
create table wqlist as
select
a.ID,
b.Qtext,
b.Qnum
from qlist as b,
(select distinct ID from wa1) as a
where 1 = 1;
quit;
* Merge list with survey data to get "full questionnaires" for all ID's;
proc sql;
create table wa2 as
select
a.ID,
a.Qnum,
b.Qtext,
b.Avalue,
b.Alabel
from wqlist as a
left join wa1 as b
on
a.ID = b.ID
and a.Qtext = b.Qtext
order by
ID,
Qnum;
quit;
/*--------------------------------------------------------------------
Formats - value texts for all questions
--------------------------------------------------------------------*/
* Get data to create formats;
proc sql;
create table wfmt1 as
select distinct
Qnum,
Avalue,
Alabel
from wa2
where Avalue ne .
order by
Qnum,
Avalue;
quit;
* Create formats for all questions/answer categories;
data _null_; set wfmt1 end=end;
by Qnum;
length katlist $1000 w $30;
retain katlist;
if _N_ = 1 then call execute('proc format;');
if first.Qnum then do;
katlist = 'value q' || strip(put(Qnum,8.)) || 'fmt';
katlist = catx(' ',katlist,'. = "N/A"');
end;
nolabel = strip(put(Avalue,8.)) || ' <no label>';
katlist = catx(' ',katlist,strip(put(Avalue,8.)) || ' = "' || coalescec(trim(Alabel),trim(nolabel)) || '"');
if last.Qnum then do;
katlist = catt(katlist,';');
call execute(katlist);
end;
if end then call execute('run;');
run;
/*-------------------------------------------------------------------
Result data - Convert to columns V1-Vn per ID and move question and
category texts to variable labels and formats.
--------------------------------------------------------------------*/
* Transpose data to one observation per ID with V1-Vn populated - drop all texts;
data wa3; set wa2;
drop Qnum Qtext Avalue Alabel;
by ID;
array A 8 V1-V%eval(&MaxQ);
retain V1-V%eval(&MaxQ);
if first.ID then call missing (of V1-V%eval(&MaxQ));
A{Qnum} = Avalue;
if last.ID then output;
run;
* Initiate empty result data from list of questions;
* Variables are numbered V1-Vn, has question text as label and answer text as format;
data _null_; set qlist end=end;
length attr $100;
if _N_ = 1 then call execute('data result; attrib ID length=8;');
attr = 'attrib V' || strip(put(Qnum,8.)) || ' length=8 label="' || trim(Qtext) || '" format=q' || strip(put(Qnum,8.)) || 'fmt.;';
call execute(attr);
if end then call execute('stop; run;');
run;
* Append data to get final result;
proc append base=result data=wa3;
run;
This is great! I had to do some adjustments given my actual data, but I was able to get most of this to work. The first bit of code was especially helpful in getting the data parsed into a usable format.
Thanks so much again and I hope the rain passes eventually!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.