Solved: Parsing strange data (which has missing values)

altatunc · Posted 01-03-2020 04:24 PM

I have a data set which is the result of a matrix/likert/grid table in a survey. However, I received the data in a strange format and am having trouble re-parsing it into separate columns of data - 1 column per likert prompt.

In the sample data, row one has all the right columns of data and the answers are just following it in separate brackets (possible answers are VERY IMPORTANT4, 3, 2, Not AT ALLIMPORTANT1), but as you can see, the following rows have some missing data.

Any thoughts on how to get a column of result data per prompt and show missing data where appropriate? I am pretty rusty on string matches and don't really know how to get the missing information to show as missing when not there. Create separate tables and merge things back together?

Thanks!

Sample Data is here:

DATA WORK.Book1;
    LENGTH
        ID                 8
        Results          $ 145 ;
    FORMAT
        ID               BEST12.
        Results          $CHAR145. ;
    INFORMAT
        ID               BEST12.
        Results          $CHAR145. ;
    INFILE DATALINES4
        DLM='7F'x
        MISSOVER
        DSD ;
    INPUT
        ID               : BEST32.
        Results          : $CHAR145. ;
DATALINES4;
1{Spousal considerations}{VERY IMPORTANT4}|{Lifestyle}{VERY IMPORTANT4}|{Hours}{VERY IMPORTANT4}|{Financial considerations }{NOT AT ALLIMPORTANT1}
2{Spousal considerations}{VERY IMPORTANT4}|{Hours}{VERY IMPORTANT4}|{Financial considerations }{NOT AT ALLIMPORTANT1}
3{Spousal considerations}{VERY IMPORTANT4}|{Lifestyle}{3}|{Financial considerations }{NOT AT ALLIMPORTANT1}
;;;;

ErikLund_Jensen · Posted 01-04-2020 07:50 AM

Hi @altatunc

It's a cold, rainy and stormy day here, so I really enjoyed working with your problem. Many years ago I often worked with survey data in all sort of unreadable proprietary formats, and I went down memory lane and tried to give your sample a full treatment. The principles used are:

No meaningful variable names - All answers are named V1-Vn according to their position in the questionnaire, and the corresponding prompt text is applied to the variable as a SAS variable label.

No text variables for categories - All answers contain only a category number, and the corresponding text is applied to the variable as a SAS format..

Given your input, the result is:

Dataset "as is" without labels and formats:

Dataset as presented using labels and formats:

Here is the full code (assuming your Book1 data set exists). Note: It is fully dynamic and data driven, as long as answers follow the pattern <some text>category, like Very Important4 or 3 , it will need modifications in Step 1 to handle other patterns, and further modifications to handle not-numeric answers with texts only and no category number.

* Split in one observation per ID / question with answer category and -text;
data wa1 (drop=Results w1 w2); set book1;
	length w1 w2 $84 Qtext $40 Avalue 8 Alabel $40;
	do Pos = 1 to countc(results,'|') + 1;
		w1 = scan(Results,Pos,'|');
		Qtext = scan(w1,1,'{}');
		w2 = scan(w1,2,'{}');
		Avalue = input(prxchange('s/(\D*)(\d+)/$2/',-1,w2),8.);
		Alabel = prxchange('s/(\D*)(\d+)/$1/',-1,w2);
		output;
	end;
run;

/*----------------------------------------------------------------------
Transform to "full questionnaire" with all questions present for each ID
----------------------------------------------------------------------*/

* Extract all possible questions from survey data;
* Number is the last position where a question is found for any ID;
proc sql;
	create table qlist as 
		select distinct 
			Qtext,
			max(Pos) as Qnum
		from wa1
		group by Qtext
		order by Qnum;
quit;

* get number of questions - will be used in a later section;
proc sql noprint;
	select max(Qnum) into :MaxQ
	from qlist;
quit;

* Expand numbered list with ID's to get full list of questions for all ID's;
proc sql;
	create table wqlist as 
		select 
			a.ID,
			b.Qtext,
			b.Qnum
		from qlist as b, 
		(select distinct ID from wa1) as a
		where 1 = 1;
quit;
			
* Merge list with survey data to get "full questionnaires" for all ID's;
proc sql;
	create table wa2 as
		select 
			a.ID,
			a.Qnum,
			b.Qtext,
			b.Avalue,
			b.Alabel
		from wqlist as a
		left join wa1 as b
		on 
			a.ID = b.ID
			and a.Qtext = b.Qtext 
		order by
			ID,
			Qnum;
quit;

/*--------------------------------------------------------------------
Formats - value texts for all questions
--------------------------------------------------------------------*/

* Get data to create formats;
proc sql;
	create table wfmt1 as 
		select distinct
			Qnum,
			Avalue,
			Alabel
		from wa2
		where Avalue ne .
		order by
			Qnum,
			Avalue;
quit;

* Create formats for all questions/answer categories;
data _null_; set wfmt1 end=end;
	by Qnum;
	length katlist $1000 w $30;
	retain katlist;
	if _N_ = 1 then call execute('proc format;');

	if first.Qnum then do;
		katlist = 'value q' || strip(put(Qnum,8.)) || 'fmt';
		katlist = catx(' ',katlist,'. = "N/A"');
	end;
	nolabel = strip(put(Avalue,8.)) || ' <no label>';
	katlist = catx(' ',katlist,strip(put(Avalue,8.)) || ' = "' || coalescec(trim(Alabel),trim(nolabel)) || '"');
	if last.Qnum then do; 
		katlist = catt(katlist,';');
		call execute(katlist);
	end;

	if end then call execute('run;');
run;

/*-------------------------------------------------------------------
Result data - Convert to columns V1-Vn per ID and move question and 
category texts to variable labels and formats.
--------------------------------------------------------------------*/

* Transpose data to one observation per ID with V1-Vn populated - drop all texts; 
data wa3; set wa2; 
	drop Qnum Qtext Avalue Alabel;
	by ID;
	array A 8 V1-V%eval(&MaxQ);
	retain V1-V%eval(&MaxQ);
	if first.ID then call missing (of V1-V%eval(&MaxQ));
	A{Qnum} = Avalue;
	if last.ID then output;
run;

* Initiate empty result data from list of questions;
* Variables are numbered V1-Vn, has question text as label and answer text as format;
data _null_; set qlist end=end;
	length attr $100;
	if _N_ = 1 then call execute('data result; attrib ID length=8;');
	attr = 'attrib V' || strip(put(Qnum,8.)) || ' length=8 label="' || trim(Qtext) || '" format=q' || strip(put(Qnum,8.)) || 'fmt.;';
	call execute(attr);
	if end then call execute('stop; run;');
run;

* Append data to get final result;
proc append base=result data=wa3;
run;

View solution in original post

Patrick · Posted 01-03-2020 05:12 PM

Something like below should give you a start.

DATA WORK.Book1;
    LENGTH
    ID    $10
    key   $40
    value $40;
  INFILE DATALINES4
    DLM=',{}|'
    truncover
    ;
  INPUT
    ID    :$10. @;
  do while(1);
    input key :$40. value :$40. @;
    if missing(key) then leave;
    output;
  end;
    
  DATALINES4;
1,{Spousal considerations}{VERY IMPORTANT4}|{Lifestyle}{VERY IMPORTANT4}|{Hours}{VERY IMPORTANT4}|{Financial considerations }{NOT AT ALLIMPORTANT1}
2,{Spousal considerations}{VERY IMPORTANT4}|{Hours}{VERY IMPORTANT4}|{Financial considerations }{NOT AT ALLIMPORTANT1}
3,{Spousal considerations}{VERY IMPORTANT4}|{Lifestyle}{3}|{Financial considerations }{NOT AT ALLIMPORTANT1}
;;;;

Should you have issue to read the actual data then please post a few lines from there and post them as a .txt attachment (so that the EG import wizard doesn't change delimiters).

altatunc · Posted 01-05-2020 10:22 AM

Thank you! This is helpful. It doesn't save a place for missing values but that could be solved in later processing. Thanks again, Adam

ballardw · Posted 01-03-2020 05:17 PM

@altatunc wrote:

I have a data set which is the result of a matrix/likert/grid table in a survey. However, I received the data in a strange format and am having trouble re-parsing it into separate columns of data - 1 column per likert prompt.

In the sample data, row one has all the right columns of data and the answers are just following it in separate brackets (possible answers are VERY IMPORTANT4, 3, 2, Not AT ALLIMPORTANT1), but as you can see, the following rows have some missing data.

Any thoughts on how to get a column of result data per prompt and show missing data where appropriate? I am pretty rusty on string matches and don't really know how to get the missing information to show as missing when not there. Create separate tables and merge things back together?

Thanks!

Sample Data is here:
DATA WORK.Book1;
    LENGTH
        ID                 8
        Results          $ 145 ;
    FORMAT
        ID               BEST12.
        Results          $CHAR145. ;
    INFORMAT
        ID               BEST12.
        Results          $CHAR145. ;
    INFILE DATALINES4
        DLM='7F'x
        MISSOVER
        DSD ;
    INPUT
        ID               : BEST32.
        Results          : $CHAR145. ;
DATALINES4;
1{Spousal considerations}{VERY IMPORTANT4}|{Lifestyle}{VERY IMPORTANT4}|{Hours}{VERY IMPORTANT4}|{Financial considerations }{NOT AT ALLIMPORTANT1}
2{Spousal considerations}{VERY IMPORTANT4}|{Hours}{VERY IMPORTANT4}|{Financial considerations }{NOT AT ALLIMPORTANT1}
3{Spousal considerations}{VERY IMPORTANT4}|{Lifestyle}{3}|{Financial considerations }{NOT AT ALLIMPORTANT1}
;;;;

When I copied your datalines there is a non-printable character appearing after the ID value. I replaced that with a | and used the | as a delimiter. I don't know if that is practical from your data or not.

You really should show what you expect for a result. Without any such actual guideline I built something that has one variable per "topic" or question or what ever and read the likert values as a numeric with a custom informat. I suspect you skipped some likert values but this may give you a starting point for one way to approach this.

Proc format library=work ;
invalue mylikert (default=25)
"{NOT AT ALLIMPORTANT1}"=1
"{3}"                   =3
"{VERY IMPORTANT4}"     =4
;
run;


DATA WORK.Book1;
    INFILE DATALINES4
        DLM='|'
        MISSOVER
        DSD ;
    informat ID $5. SC LS Hours FC Mylikert.;
    INPUT
        ID               
        @"{Spousal considerations}"    Sc 
        @"{Lifestyle}"                 LS 
        @"{Hours}"                     Hours 
        @"{Financial considerations }" FC 
   ;
   label
      SC = "Spousal considerations"
      LS = "Lifestyle"
      Hours= "Hours"
      FC   = "Financial considerations"
   ;
DATALINES4;
1|{Spousal considerations}{VERY IMPORTANT4}|{Lifestyle}{VERY IMPORTANT4}|{Hours}{VERY IMPORTANT4}|{Financial considerations }{NOT AT ALLIMPORTANT1}
2|{Spousal considerations}{VERY IMPORTANT4}|{Hours}{VERY IMPORTANT4}|{Financial considerations }{NOT AT ALLIMPORTANT1}
3|{Spousal considerations}{VERY IMPORTANT4}|{Lifestyle}{3}|{Financial considerations }{NOT AT ALLIMPORTANT1}
;;;;

The key that works here is the @"some string" on an input statement finds the value in the string and then starts reading the column after. I didn't do anything with the {} in case you have a topic or question that had something along the lines of {I consider x VERY IMPORTANT} and many approaches that could ignore the braces might have difficulty. Leaving the braces in makes complete strings.

altatunc · Posted 01-05-2020 10:25 AM

This is a great solution! Thanks for the quick update to my problem!

Adam

ErikLund_Jensen · Posted 01-04-2020 07:50 AM

Hi @altatunc

It's a cold, rainy and stormy day here, so I really enjoyed working with your problem. Many years ago I often worked with survey data in all sort of unreadable proprietary formats, and I went down memory lane and tried to give your sample a full treatment. The principles used are:

No meaningful variable names - All answers are named V1-Vn according to their position in the questionnaire, and the corresponding prompt text is applied to the variable as a SAS variable label.

No text variables for categories - All answers contain only a category number, and the corresponding text is applied to the variable as a SAS format..

Given your input, the result is:

Dataset "as is" without labels and formats:

Dataset as presented using labels and formats:

Here is the full code (assuming your Book1 data set exists). Note: It is fully dynamic and data driven, as long as answers follow the pattern <some text>category, like Very Important4 or 3 , it will need modifications in Step 1 to handle other patterns, and further modifications to handle not-numeric answers with texts only and no category number.

* Split in one observation per ID / question with answer category and -text;
data wa1 (drop=Results w1 w2); set book1;
	length w1 w2 $84 Qtext $40 Avalue 8 Alabel $40;
	do Pos = 1 to countc(results,'|') + 1;
		w1 = scan(Results,Pos,'|');
		Qtext = scan(w1,1,'{}');
		w2 = scan(w1,2,'{}');
		Avalue = input(prxchange('s/(\D*)(\d+)/$2/',-1,w2),8.);
		Alabel = prxchange('s/(\D*)(\d+)/$1/',-1,w2);
		output;
	end;
run;

/*----------------------------------------------------------------------
Transform to "full questionnaire" with all questions present for each ID
----------------------------------------------------------------------*/

* Extract all possible questions from survey data;
* Number is the last position where a question is found for any ID;
proc sql;
	create table qlist as 
		select distinct 
			Qtext,
			max(Pos) as Qnum
		from wa1
		group by Qtext
		order by Qnum;
quit;

* get number of questions - will be used in a later section;
proc sql noprint;
	select max(Qnum) into :MaxQ
	from qlist;
quit;

* Expand numbered list with ID's to get full list of questions for all ID's;
proc sql;
	create table wqlist as 
		select 
			a.ID,
			b.Qtext,
			b.Qnum
		from qlist as b, 
		(select distinct ID from wa1) as a
		where 1 = 1;
quit;
			
* Merge list with survey data to get "full questionnaires" for all ID's;
proc sql;
	create table wa2 as
		select 
			a.ID,
			a.Qnum,
			b.Qtext,
			b.Avalue,
			b.Alabel
		from wqlist as a
		left join wa1 as b
		on 
			a.ID = b.ID
			and a.Qtext = b.Qtext 
		order by
			ID,
			Qnum;
quit;

/*--------------------------------------------------------------------
Formats - value texts for all questions
--------------------------------------------------------------------*/

* Get data to create formats;
proc sql;
	create table wfmt1 as 
		select distinct
			Qnum,
			Avalue,
			Alabel
		from wa2
		where Avalue ne .
		order by
			Qnum,
			Avalue;
quit;

* Create formats for all questions/answer categories;
data _null_; set wfmt1 end=end;
	by Qnum;
	length katlist $1000 w $30;
	retain katlist;
	if _N_ = 1 then call execute('proc format;');

	if first.Qnum then do;
		katlist = 'value q' || strip(put(Qnum,8.)) || 'fmt';
		katlist = catx(' ',katlist,'. = "N/A"');
	end;
	nolabel = strip(put(Avalue,8.)) || ' <no label>';
	katlist = catx(' ',katlist,strip(put(Avalue,8.)) || ' = "' || coalescec(trim(Alabel),trim(nolabel)) || '"');
	if last.Qnum then do; 
		katlist = catt(katlist,';');
		call execute(katlist);
	end;

	if end then call execute('run;');
run;

/*-------------------------------------------------------------------
Result data - Convert to columns V1-Vn per ID and move question and 
category texts to variable labels and formats.
--------------------------------------------------------------------*/

* Transpose data to one observation per ID with V1-Vn populated - drop all texts; 
data wa3; set wa2; 
	drop Qnum Qtext Avalue Alabel;
	by ID;
	array A 8 V1-V%eval(&MaxQ);
	retain V1-V%eval(&MaxQ);
	if first.ID then call missing (of V1-V%eval(&MaxQ));
	A{Qnum} = Avalue;
	if last.ID then output;
run;

* Initiate empty result data from list of questions;
* Variables are numbered V1-Vn, has question text as label and answer text as format;
data _null_; set qlist end=end;
	length attr $100;
	if _N_ = 1 then call execute('data result; attrib ID length=8;');
	attr = 'attrib V' || strip(put(Qnum,8.)) || ' length=8 label="' || trim(Qtext) || '" format=q' || strip(put(Qnum,8.)) || 'fmt.;';
	call execute(attr);
	if end then call execute('stop; run;');
run;

* Append data to get final result;
proc append base=result data=wa3;
run;

altatunc · Posted 01-05-2020 10:19 AM

This is great! I had to do some adjustments given my actual data, but I was able to get most of this to work. The first bit of code was especially helpful in getting the data parsed into a usable format.

Thanks so much again and I hope the rain passes eventually!

Parsing strange data (which has missing values)

Re: Parsing strange data (which has missing values)

Re: Parsing strange data (which has missing values)

Re: Parsing strange data (which has missing values)

Re: Parsing strange data (which has missing values)

Re: Parsing strange data (which has missing values)

Re: Parsing strange data (which has missing values)

Re: Parsing strange data (which has missing values)

SAS Innovate 2025: Call for Content

Classroom Training Available!