BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
JAR
Obsidian | Level 7 JAR
Obsidian | Level 7

Hi,

I have some messy raw data as follows:

Question

What is the Capital of Malawi?

A.

Paris

B.

London

C.

Lilongwe

Question

How many prime numbers are there between 1 - 100?

A.

Three

B.

Four

C.

five

D.

Can't be determined

How would I write a program which will extract a variable question by reading what comes after the observation "Question", and similarly variables A, B, C, D by reading what comes after each of those.

Pardon me if I have not illustrated my question clearly.

Result would look like what it is in the attachment:

Jijil Ramakrishnan

Message was edited by: Jijil Ramakrishnan


Want.png
1 ACCEPTED SOLUTION

Accepted Solutions
Haikuo
Onyx | Level 15

If it is already in SAS dataset, the following quick_n_dirty code may help:

data have;

  infile cards4 truncover;

  input var;

  cards4;

Question

What is the Capital of Malawi?

A.

Paris

B.

London

C.

Lilongwe

Question

How many prime numbers are there between 1 - 100?

A.

Three

B.

Four

C.

five

D.

Can't be determined

;;;;

data want;

  merge have have(firstobs=2 rename=var=_var) end=last;

  length Question $ 100

   A $ 100

   B $ 100

   C  $ 100

   D $ 100

    ;

retain A B C D question;

if _n_>1 and var='Question' or last then do;output;call missing (a, b, c, d);end;

   select (var);

   when("Question") do;question=_var;end;

   when ("A.") do;  A=_var;end;

   when ("B.") do;  B=_var;end;

   when ("C.") do;  C=_var;end;

   when ("D.") do;  D=_var;end;

   otherwise;

   end;

   drop _var var;

run;

Haikuo

View solution in original post

9 REPLIES 9
Astounding
PROC Star

You've illustrated the situation very clearly.  But you need to provide a little more guidance on your intended outcome.  For example, you could want an outcome that includes one observation per question, and these variables:

question

answer_a

answer_b

answer_c

answer_d

In that case, you might want to provide the largest possible number of answers per question.

Or you might want an outcome that includes many observations per question, one for each possible answer.  The variables might be:

question

answer_code (a, b, c, d ...)

answer_text

Or you might want to design something slightly different from either of these.

That sort of information would be helpful.

Ooooops ... looks like your diagram already maps that out.  So assuming that you want one observation per question, and that four answers will be the maximum ...

data want;

infile rawdata truncover end=alldone;

length test $ 8;

input test;

length question answer_a answer_b answer_c answer_d $ 100;

informat question answer_a answer_b answer_c answer_d $char100.;

retain question answer_a answer_b answer_c answer_d;

keep question answer_a answer_b answer_c answer_d;

if test = 'Question' then do;

   if _n_ > 1 then output;

   answer_a = ' ';

   answer_b = ' ';

   answer_c = ' ';

   answer_d = ' ';

   input question;

end;

else if test='A.' then input question_a;

else if test='B.' then input question_b;

else if test='C.' then input question_c;

else if test='D.' then input question_d;

if alldone then output;

run;

It's untested code, but should be at least approximatley right.

Good luck.

JAR
Obsidian | Level 7 JAR
Obsidian | Level 7

Dear Astounding,

This is works. Hats off to your writing untested code with near perfection.

This, however, has a problem I can't fix. When it reads a string with more than one word (such as a question), it reads only the first word.

Please tell me how can I read the whole line..

Thanks in advance,

Jijil

Astounding
PROC Star

For the original problem, reading just a single word, I thought the INFORMAT would take care of that.  I guess SAS needs more explicit instructions.  The INFORMAT statement isn't needed, but change 5 INPUT statements:

input question_a $char100.;

...

input question_d $char100.;

The issue about already having a SAS data set to begin is a different animal entirely.  What is the variable in your SAS data set?

JAR
Obsidian | Level 7 JAR
Obsidian | Level 7

There is only one variable named 'record', all of the above are strings are observations in it.

Jijil

Peter_C
Rhodochrosite | Level 12

If no one has explained it, have a look at the documentation of the INPUT statement and give particular attention to "the trailing @ sign"

JAR
Obsidian | Level 7 JAR
Obsidian | Level 7

Ooops... I have one more doubt:

Suppose all of rawdata is in SAS format under work.have (sas7bdaat format)

Question

What is the Capital of Malawi?

A.

Paris

B.

London

C.

Lilongwe

Question

How many prime numbers are there between 1 - 100?

A.

Three

B.

Four

C.

five

D.

Can't be determined

How do I use a "set" (instead of "infile") to achieve the same?

Thanks in advance,

Jijil Ramakrishnan

Haikuo
Onyx | Level 15

If it is already in SAS dataset, the following quick_n_dirty code may help:

data have;

  infile cards4 truncover;

  input var;

  cards4;

Question

What is the Capital of Malawi?

A.

Paris

B.

London

C.

Lilongwe

Question

How many prime numbers are there between 1 - 100?

A.

Three

B.

Four

C.

five

D.

Can't be determined

;;;;

data want;

  merge have have(firstobs=2 rename=var=_var) end=last;

  length Question $ 100

   A $ 100

   B $ 100

   C  $ 100

   D $ 100

    ;

retain A B C D question;

if _n_>1 and var='Question' or last then do;output;call missing (a, b, c, d);end;

   select (var);

   when("Question") do;question=_var;end;

   when ("A.") do;  A=_var;end;

   when ("B.") do;  B=_var;end;

   when ("C.") do;  C=_var;end;

   when ("D.") do;  D=_var;end;

   otherwise;

   end;

   drop _var var;

run;

Haikuo

sugeshnambiar
Fluorite | Level 6

Hi ,

if the raw file is the exactly in the form as you have mentioned

you could try below code :

date temp;

infile '/path/temp.txt'  DLM='09'x DSD MISSOVER;

input mydata $ 50;

run;

data temp_new;

lenght question $ 50 a $ 25 b $ 25 c $ 25 d $ 25 ;

do i=1 to nobs/10;

m=2+m;

set temp point=m nobs=nobs;
question=mydata;

m=m+2;

set temp point=m nobs=nobs;
a=mydata;

m=m+2;

set temp point=m nobs=nobs;
b=mydata;

m=m+2;

set temp point=m nobs=nobs;
c=mydata;

m=m+2;

set temp point=m nobs=nobs;
d=mydata;

retain m;

output;

if nobs then;

stop

run;

i believe this will do

sugeshnambiar
Fluorite | Level 6

you can try the below which is quite simple

data temp;

infile '/path/file.txt' scanover;

input @'question' question $char100.;

input @'a' a $char20.;

input @'b' b $char20.;

input @'c' c $char20.;

input @'d' d $char20.;

run;

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 1552 views
  • 4 likes
  • 5 in conversation