BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Wei2017
Calcite | Level 5

Hi ,

 

I have a unfriendly text file like below:

 

Math Exam
Student Report
Calculus I
Fall 2016
Student: Eric
ID: 1234567
Date of Birth: 7/1/1990
Grade: 95
Report Date: 12/20/2016
School: 123 Edison
District: 123 Edison

Description : This is an easy exam

 

Math Exam
Student Report
Calculus I
Fall 2016
Student: Amy
ID: 1234519
Date of Birth: 12/18/1990
Grade: 77
Report Date: 12/20/2016
School: 123 Edison
District: 123 Edison

Description : This is a real tough exam


Math Exam
Student Report
Calculus I
Fall 2016
Student: John
ID: 1234569
Date of Birth: 3/1/1991
Grade: 90
Report Date: 12/20/2016
School: 123 Edison
District: 123 Edison

Description : This is an easy exam

 

Math Exam
Student Report
Calculus I
Fall 2016
Student: Ava
ID: 1234523
Date of Birth: 9/13/1992
Grade: 89
Report Date: 12/20/2016
School: 123 Edison
District: 123 Edison

Description : This is a not a hard exam

 

----------------------------

I have read tons of papers to extract data from text file, like https://communities.sas.com/t5/SAS-Procedures/Extract-data-from-text-file/td-p/278012. I tried their ways,  but still have no idea how to extract

1 course name

2 student name

3 student id number 

4 student date of birth 

5 grade

6 report date 

from such text file, can anyone please help. 

 

 -------------------

my final goal is to obtain a dataset like below 

Calculus I   Eric   1234567  7/1/1990       95 12/20/2016

Calculus I   Amy  1234519  12/18/1990   77 12/20/2016

Calculus I   John  1234569  3/1/1991      90 12/20/2016

Calculus I   Ava   1234523   9/13/1992    89 12/20/2016

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Define your variables. Don't force SAS to guess what you meant them to be by how they first appear in the code.

Don't bother with all of those nested IF/THEN statements. Let the natural SAS processing do that work for you. Take advantage of the power of the INPUT statement.  

data temp;
  infile 'exam.txt' truncover dsd dlm=':';
  length Course $30 StudentName $30 Grade Reportdate 8 ID $10 SchoolCode $8 School $30 DoB 8 District $30 ;
  format ReportDate DoB mmddyy10. ;
  retain Course -- District; 
  length chk $100 ;
  input chk @; 
  if chk = 'Calculus' then input Course ;
  if chk = 'Student' then input StudentName ;
  if chk = 'ID' then input ID ;
  if chk = 'Grade' then input Grade ;
  if chk = 'Date of Birth' then input Dob mmddyy12. ;
  if chk = 'School' then input School ;
  if chk = 'District' then input District ;
  if chk ='Report Date' then do; 
    input Reportdate mmddyy12. ;
    schoolcode =scan(school,1,' ');
    school = substr(school,length(schoolcode)+2);
    output;
    call missing(of _all_);
  end;
run;

Note that SAS can normally read date values with 2 digit years and make the right choice is to what century is meant, but why would want to print your dates without the century?  You are just inviting trouble later.

View solution in original post

20 REPLIES 20
Reeza
Super User

Do the rows repeat at all, or are you reading a single record from a file?

I think you need to provide more information.

 

You can read from a multi lines in SAS using an INPUT statement.

Wei2017
Calcite | Level 5

Hi, I add more detailed information about this.

 

There are multiple students in the text file. I listed four students just as an example.

 

 

Reeza
Super User

Here's the mechanism for how you'd do this:

https://stats.idre.ucla.edu/sas/code/reading-a-multiple-line-per-subject-data-file/

 

If you've tried something make sure to include it, otherwise that will likely be the first suggestion you get and it's a waste of everyones time. 

HB
Barite | Level 11 HB
Barite | Level 11

@Reeza That's an excellent reference.  Saving that in case i need it!

 

@@@Wei2017  That text looks a lot like the structure of an XML document without the tags.  If whereever you are getting that text dump from could give you that as XML instead, it might read easier.

 

<exam_record>

<examtype>Math</examtype>
<reporttype>Student</reporttype>
<subject>Calculus I</subject>
<date>Fall 2016</date>
<studentname>Eric</studentname>
<studentid>1234567</studentid>
<dob>7/1/1990</dob>
<grade>95</grade>

<reportdate>12/20/2016</reportdate>
<school>123 Edison</school>
<district>123 Edison</district>

<description>This is an easy exam</description>

 </exam_record>

 

 

Wei2017
Calcite | Level 5

Thank you for your link, but that is not what i am looking for. 

I have tried https://communities.sas.com/t5/SAS-Procedures/Extract-data-from-text-file/td-p/278012, it is a little bit similar to my question.

error_prone
Barite | Level 11

The link posted by @Reeza has exactly what you need to solve the issue. One obersavation in your requested output dataset has the data of multiple lines from your source file. 

Reeza
Super User

If you don't post what you've tried then you're going to get suggestions that may not work. 

And it should work so you're likely doing something wrong, but we can't tell you since you haven't shown us any code. 

Or if you did do it correctly it would give us something to start with rather than starting from scratch. 

 


@Wei2017 wrote:

Thank you for your link, but that is not what i am looking for. 

I have tried https://communities.sas.com/t5/SAS-Procedures/Extract-data-from-text-file/td-p/278012, it is a little bit similar to my question.


 

Wei2017
Calcite | Level 5

Apologise for not posting my code and result part. 

 

@Reeza @HB @Tom @ballardw @error_prone

 

Here is my code and what i get in sas:

 Result.png

 

data temp;
infile 'exam.txt';
retain Course StudentName Grade Reportdate ID School DoB District;
input;
if _infile_ =: 'Calculus' then Course = scan(_infile_, 1, '\\');
  else if _infile_ =: 'Student' then StudentName = scan(_infile_, -1, ':');
   else if _infile_ =: 'ID' then ID = scan(_infile_, -1, ':');
    else if _infile_ =: 'Grade' then Grade = scan(_infile_, -1, ':');
     else if _infile_ =: 'Date of Birth' then Dob = scan(_infile_, -1, ':');
      else if _infile_ =: 'School' then School = scan(_infile_, -1, ':');
       else if _infile_ =: 'District' then District = scan(_infile_, -1, ':');
          else if (_infile_)=:'Report Date' then do 
            Reportdate = scan(_infile_,-1,':');
         output;
         end;
run;

proc print data = temp;
run;

 

I am still get lost in sas.

(1) Why is there something missing in the first row?

(2) How to convert Reportdate format into mmddyy? eg convert 12/20/2016 to 12/20/16

(3) Is there some ways that could split School into School code and School name?  eg "123 Edision" into "123" and "Edison"

 

Thank you for your time reading this. 

Tom
Super User Tom
Super User

Define your variables. Don't force SAS to guess what you meant them to be by how they first appear in the code.

Don't bother with all of those nested IF/THEN statements. Let the natural SAS processing do that work for you. Take advantage of the power of the INPUT statement.  

data temp;
  infile 'exam.txt' truncover dsd dlm=':';
  length Course $30 StudentName $30 Grade Reportdate 8 ID $10 SchoolCode $8 School $30 DoB 8 District $30 ;
  format ReportDate DoB mmddyy10. ;
  retain Course -- District; 
  length chk $100 ;
  input chk @; 
  if chk = 'Calculus' then input Course ;
  if chk = 'Student' then input StudentName ;
  if chk = 'ID' then input ID ;
  if chk = 'Grade' then input Grade ;
  if chk = 'Date of Birth' then input Dob mmddyy12. ;
  if chk = 'School' then input School ;
  if chk = 'District' then input District ;
  if chk ='Report Date' then do; 
    input Reportdate mmddyy12. ;
    schoolcode =scan(school,1,' ');
    school = substr(school,length(schoolcode)+2);
    output;
    call missing(of _all_);
  end;
run;

Note that SAS can normally read date values with 2 digit years and make the right choice is to what century is meant, but why would want to print your dates without the century?  You are just inviting trouble later.

Wei2017
Calcite | Level 5

Thank you for your code and exmplanation. The code is so awesome. I impleneted in my jupyter notebook, why some part is still missing like SchoolCode for Eric?  

 

Thank you.

 

Result_1.png

Tom
Super User Tom
Super User

@Wei2017 wrote:

Thank you for your code and exmplanation. The code is so awesome. I impleneted in my jupyter notebook, why some part is still missing like SchoolCode for Eric?  

 

Thank you.

 

 


Most likely the line with the SCHOOL on it is formatted slightly differently or does not exist for that block of lines. Watch out for changes in case, extra spaces, or things like tabs and other characters that look like spaces but aren't.

error_prone
Barite | Level 11

Can you explain "The expected results like below :" further? Do you need a dataset, a report in the posted format, the result printed in the log?

Wei2017
Calcite | Level 5

Yes. 

my final goal is to obtain a dataset like below 

Calculus I   Eric   1234567  7/1/1990       95 12/20/2016

Calculus I   Amy  1234519  12/18/1990   77 12/20/2016

Calculus I   John  1234569  3/1/1991      90 12/20/2016

Calculus I   Ava   1234523   9/13/1992    89 12/20/2016

 

I have tried example like https://communities.sas.com/t5/SAS-Procedures/Extract-data-from-text-file/td-p/278012, but still have no idea how to deal with this one. Thank you.

HB
Barite | Level 11 HB
Barite | Level 11

EDIT:  I don't know why this answer formatted this way, with no breaks in the SAS code and a scroll on the last piece of text. 

 

 

Given a text file like:

Math Exam
Student Report
Calculus I
Fall 2016
Student: Eric
ID: 1234567
Date of Birth: 7/1/1990
Grade: 95
Report Date: 12/20/2016
School: 123 Edison
District: 123 Edison
Description : This is an easy exam
Math Exam
Student Report
Calculus I
Fall 2016
Student: Amy
ID: 1234519
Date of Birth: 12/18/1990
Grade: 77
Report Date: 12/20/2016
School: 123 Edison
District: 123 Edison
Description : This is a real tough exam
Math Exam
Student Report
Calculus I
Fall 2016
Student: John
ID: 1234569
Date of Birth: 3/1/1991
Grade: 90
Report Date: 12/20/2016
School: 123 Edison
District: 123 Edison
Description : This is an easy exam
Math Exam
Student Report
Calculus I
Fall 2016
Student: Ava
ID: 1234523
Date of Birth: 9/13/1992
Grade: 89
Report Date: 12/20/2016
School: 123 Edison
District: 123 Edison
Description : This is a not a hard exam

SAS code like

data incoming_text;
infile 'J:\some location where your file is\unfriendly_text_file.txt';
input 
exam_type & $20.

#2 report_type & $20.

#3 subject & $20.

#4 mydate & $20.
#5 @10 student & $20.
#6 @5 id & $20.
#7 @16 dob & $20.
#8 @8 grade & $20.
#9 @14 reportdate & $20.
#10 @9 school & $20.
#11 @11 district & $20.
#12 @15 description & $45
.;
run;

will give you

exam_type report_type subject mydate student id dob grade reportdate school district description
Math Exam Student Report Calculus I Fall 2016 Eric 1234567 7/1/1990 95 12/20/2016 123 Edison 123 Edison This is an easy exam
Math Exam Student Report Calculus I Fall 2016 Amy 1234519 12/18/1990 77 12/20/2016 123 Edison 123 Edison This is a real tough exam
Math Exam Student Report Calculus I Fall 2016 John 1234569 3/1/1991 90 12/20/2016 123 Edison 123 Edison This is an easy exam
Math Exam Student Report Calculus I Fall 2016 Ava 1234523 9/13/1992 89 12/20/2016 123 Edison 123 Edison This is a not a hard exam

 

 

 

You can mess around with the read  instructions and read only the lines you want or just read everything and drop the variables you don't want.  You can also read things properly as dates and numbers and not generic character variables of length 20 as I have.

 

Maybe like

#6 @5 id & 7.
#7 @16 dob:mmddyy10.

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 20 replies
  • 2682 views
  • 2 likes
  • 6 in conversation