Solved: Re: Messy Data

JAR · Posted 07-29-2013 11:55 AM

Hi,

I have some messy raw data as follows:

Question

What is the Capital of Malawi?

A.

Paris

B.

London

C.

Lilongwe

Question

How many prime numbers are there between 1 - 100?

A.

Three

B.

Four

C.

five

D.

Can't be determined

How would I write a program which will extract a variable question by reading what comes after the observation "Question", and similarly variables A, B, C, D by reading what comes after each of those.

Pardon me if I have not illustrated my question clearly.

Result would look like what it is in the attachment:

Jijil Ramakrishnan

Message was edited by: Jijil Ramakrishnan

Haikuo · Posted 07-29-2013 03:37 PM

If it is already in SAS dataset, the following quick_n_dirty code may help:

data have;

infile cards4 truncover;

input var;

cards4;

Question

What is the Capital of Malawi?

A.

Paris

B.

London

C.

Lilongwe

Question

How many prime numbers are there between 1 - 100?

A.

Three

B.

Four

C.

five

D.

Can't be determined

;;;;

data want;

merge have have(firstobs=2 rename=var=_var) end=last;

length Question $ 100

A $ 100

B $ 100

C $ 100

D $ 100

;

retain A B C D question;

if _n_>1 and var='Question' or last then do;output;call missing (a, b, c, d);end;

select (var);

when("Question") do;question=_var;end;

when ("A.") do; A=_var;end;

when ("B.") do; B=_var;end;

when ("C.") do; C=_var;end;

when ("D.") do; D=_var;end;

otherwise;

end;

drop _var var;

run;

Haikuo

View solution in original post

Astounding · Posted 07-29-2013 12:06 PM

You've illustrated the situation very clearly. But you need to provide a little more guidance on your intended outcome. For example, you could want an outcome that includes one observation per question, and these variables:

question

answer_a

answer_b

answer_c

answer_d

In that case, you might want to provide the largest possible number of answers per question.

Or you might want an outcome that includes many observations per question, one for each possible answer. The variables might be:

question

answer_code (a, b, c, d ...)

answer_text

Or you might want to design something slightly different from either of these.

That sort of information would be helpful.

Ooooops ... looks like your diagram already maps that out. So assuming that you want one observation per question, and that four answers will be the maximum ...

data want;

infile rawdata truncover end=alldone;

length test $ 8;

input test;

length question answer_a answer_b answer_c answer_d $ 100;

informat question answer_a answer_b answer_c answer_d $char100.;

retain question answer_a answer_b answer_c answer_d;

keep question answer_a answer_b answer_c answer_d;

if test = 'Question' then do;

if _n_ > 1 then output;

answer_a = ' ';

answer_b = ' ';

answer_c = ' ';

answer_d = ' ';

input question;

end;

else if test='A.' then input question_a;

else if test='B.' then input question_b;

else if test='C.' then input question_c;

else if test='D.' then input question_d;

if alldone then output;

run;

It's untested code, but should be at least approximatley right.

Good luck.

JAR · Posted 07-29-2013 02:34 PM

Dear Astounding,

This is works. Hats off to your writing untested code with near perfection.

This, however, has a problem I can't fix. When it reads a string with more than one word (such as a question), it reads only the first word.

Please tell me how can I read the whole line..

Thanks in advance,

Jijil

Astounding · Posted 07-29-2013 03:08 PM

For the original problem, reading just a single word, I thought the INFORMAT would take care of that. I guess SAS needs more explicit instructions. The INFORMAT statement isn't needed, but change 5 INPUT statements:

input question_a $char100.;

...

input question_d $char100.;

The issue about already having a SAS data set to begin is a different animal entirely. What is the variable in your SAS data set?

JAR · Posted 07-29-2013 04:27 PM

There is only one variable named 'record', all of the above are strings are observations in it.

Jijil

Peter_C · Posted 07-30-2013 03:15 AM

If no one has explained it, have a look at the documentation of the INPUT statement and give particular attention to "the trailing @ sign"

JAR · Posted 07-29-2013 02:52 PM

Ooops... I have one more doubt:

Suppose all of rawdata is in SAS format under work.have (sas7bdaat format)

Question

What is the Capital of Malawi?

A.

Paris

B.

London

C.

Lilongwe

Question

How many prime numbers are there between 1 - 100?

A.

Three

B.

Four

C.

five

D.

Can't be determined

How do I use a "set" (instead of "infile") to achieve the same?

Thanks in advance,

Jijil Ramakrishnan

Haikuo · Posted 07-29-2013 03:37 PM

If it is already in SAS dataset, the following quick_n_dirty code may help:

data have;

infile cards4 truncover;

input var;

cards4;

Question

What is the Capital of Malawi?

A.

Paris

B.

London

C.

Lilongwe

Question

How many prime numbers are there between 1 - 100?

A.

Three

B.

Four

C.

five

D.

Can't be determined

;;;;

data want;

merge have have(firstobs=2 rename=var=_var) end=last;

length Question $ 100

A $ 100

B $ 100

C $ 100

D $ 100

;

retain A B C D question;

if _n_>1 and var='Question' or last then do;output;call missing (a, b, c, d);end;

select (var);

when("Question") do;question=_var;end;

when ("A.") do; A=_var;end;

when ("B.") do; B=_var;end;

when ("C.") do; C=_var;end;

when ("D.") do; D=_var;end;

otherwise;

end;

drop _var var;

run;

Haikuo

sugeshnambiar · Posted 07-30-2013 07:32 AM

Hi ,

if the raw file is the exactly in the form as you have mentioned

you could try below code :

date temp;

infile '/path/temp.txt' DLM='09'x DSD MISSOVER;

input mydata $ 50;

run;

data temp_new;

lenght question $ 50 a $ 25 b $ 25 c $ 25 d $ 25 ;

do i=1 to nobs/10;

m=2+m;

set temp point=m nobs=nobs;
question=mydata;

m=m+2;

set temp point=m nobs=nobs;
a=mydata;

m=m+2;

set temp point=m nobs=nobs;
b=mydata;

m=m+2;

set temp point=m nobs=nobs;
c=mydata;

m=m+2;

set temp point=m nobs=nobs;
d=mydata;

retain m;

output;

if nobs then;

stop

run;

i believe this will do

sugeshnambiar · Posted 07-30-2013 08:45 AM

you can try the below which is quite simple

data temp;

infile '/path/file.txt' scanover;

input @'question' question $char100.;

input @'a' a $char20.;

input @'b' b $char20.;

input @'c' c $char20.;

input @'d' d $char20.;

run;

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!