Solved
Contributor
Posts: 45

# Messy Data

Hi,

I have some messy raw data as follows:

Question

What is the Capital of Malawi?

A.

Paris

B.

London

C.

Lilongwe

Question

How many prime numbers are there between 1 - 100?

A.

Three

B.

Four

C.

five

D.

Can't be determined

How would I write a program which will extract a variable question by reading what comes after the observation "Question", and similarly variables A, B, C, D by reading what comes after each of those.

Pardon me if I have not illustrated my question clearly.

Result would look like what it is in the attachment:

Jijil Ramakrishnan

Message was edited by: Jijil Ramakrishnan

Accepted Solutions
Solution
‎07-29-2013 03:37 PM
Posts: 3,167

## Re: Messy Data

If it is already in SAS dataset, the following quick_n_dirty code may help:

data have;

infile cards4 truncover;

input var;

cards4;

Question

What is the Capital of Malawi?

A.

Paris

B.

London

C.

Lilongwe

Question

How many prime numbers are there between 1 - 100?

A.

Three

B.

Four

C.

five

D.

Can't be determined

;;;;

data want;

merge have have(firstobs=2 rename=var=_var) end=last;

length Question \$ 100

A \$ 100

B \$ 100

C  \$ 100

D \$ 100

;

retain A B C D question;

if _n_>1 and var='Question' or last then do;output;call missing (a, b, c, d);end;

select (var);

when("Question") do;question=_var;end;

when ("A.") do;  A=_var;end;

when ("B.") do;  B=_var;end;

when ("C.") do;  C=_var;end;

when ("D.") do;  D=_var;end;

otherwise;

end;

drop _var var;

run;

Haikuo

All Replies
Super User
Posts: 6,762

## Re: Messy Data

You've illustrated the situation very clearly.  But you need to provide a little more guidance on your intended outcome.  For example, you could want an outcome that includes one observation per question, and these variables:

question

In that case, you might want to provide the largest possible number of answers per question.

Or you might want an outcome that includes many observations per question, one for each possible answer.  The variables might be:

question

answer_code (a, b, c, d ...)

Or you might want to design something slightly different from either of these.

That sort of information would be helpful.

Ooooops ... looks like your diagram already maps that out.  So assuming that you want one observation per question, and that four answers will be the maximum ...

data want;

infile rawdata truncover end=alldone;

length test \$ 8;

input test;

if test = 'Question' then do;

if _n_ > 1 then output;

input question;

end;

else if test='A.' then input question_a;

else if test='B.' then input question_b;

else if test='C.' then input question_c;

else if test='D.' then input question_d;

if alldone then output;

run;

It's untested code, but should be at least approximatley right.

Good luck.

Contributor
Posts: 45

## Re: Messy Data

Dear Astounding,

This is works. Hats off to your writing untested code with near perfection.

This, however, has a problem I can't fix. When it reads a string with more than one word (such as a question), it reads only the first word.

Jijil

Super User
Posts: 6,762

## Re: Messy Data

For the original problem, reading just a single word, I thought the INFORMAT would take care of that.  I guess SAS needs more explicit instructions.  The INFORMAT statement isn't needed, but change 5 INPUT statements:

input question_a \$char100.;

...

input question_d \$char100.;

The issue about already having a SAS data set to begin is a different animal entirely.  What is the variable in your SAS data set?

Contributor
Posts: 45

## Re: Messy Data

There is only one variable named 'record', all of the above are strings are observations in it.

Jijil

Valued Guide
Posts: 2,191

## Re: Messy Data

If no one has explained it, have a look at the documentation of the INPUT statement and give particular attention to "the trailing @ sign"

Contributor
Posts: 45

## Re: Messy Data

Ooops... I have one more doubt:

Suppose all of rawdata is in SAS format under work.have (sas7bdaat format)

Question

What is the Capital of Malawi?

A.

Paris

B.

London

C.

Lilongwe

Question

How many prime numbers are there between 1 - 100?

A.

Three

B.

Four

C.

five

D.

Can't be determined

How do I use a "set" (instead of "infile") to achieve the same?

Jijil Ramakrishnan

Solution
‎07-29-2013 03:37 PM
Posts: 3,167

## Re: Messy Data

If it is already in SAS dataset, the following quick_n_dirty code may help:

data have;

infile cards4 truncover;

input var;

cards4;

Question

What is the Capital of Malawi?

A.

Paris

B.

London

C.

Lilongwe

Question

How many prime numbers are there between 1 - 100?

A.

Three

B.

Four

C.

five

D.

Can't be determined

;;;;

data want;

merge have have(firstobs=2 rename=var=_var) end=last;

length Question \$ 100

A \$ 100

B \$ 100

C  \$ 100

D \$ 100

;

retain A B C D question;

if _n_>1 and var='Question' or last then do;output;call missing (a, b, c, d);end;

select (var);

when("Question") do;question=_var;end;

when ("A.") do;  A=_var;end;

when ("B.") do;  B=_var;end;

when ("C.") do;  C=_var;end;

when ("D.") do;  D=_var;end;

otherwise;

end;

drop _var var;

run;

Haikuo

Occasional Contributor
Posts: 14

## Re: Messy Data

Hi ,

if the raw file is the exactly in the form as you have mentioned

you could try below code :

date temp;

infile '/path/temp.txt'  DLM='09'x DSD MISSOVER;

input mydata \$ 50;

run;

data temp_new;

lenght question \$ 50 a \$ 25 b \$ 25 c \$ 25 d \$ 25 ;

do i=1 to nobs/10;

m=2+m;

set temp point=m nobs=nobs;
question=mydata;

m=m+2;

set temp point=m nobs=nobs;
a=mydata;

m=m+2;

set temp point=m nobs=nobs;
b=mydata;

m=m+2;

set temp point=m nobs=nobs;
c=mydata;

m=m+2;

set temp point=m nobs=nobs;
d=mydata;

retain m;

output;

if nobs then;

stop

run;

i believe this will do

Occasional Contributor
Posts: 14

## Re: Messy Data

you can try the below which is quite simple

data temp;

infile '/path/file.txt' scanover;

input @'question' question \$char100.;

input @'a' a \$char20.;

input @'b' b \$char20.;

input @'c' c \$char20.;

input @'d' d \$char20.;

run;

🔒 This topic is solved and locked.