Calcite | Level 5

## Extract data from a unfriendly text file

Hi ,

I have a unfriendly text file like below：

Math Exam
Student Report
Calculus I
Fall 2016
Student: Eric
ID: 1234567
Date of Birth: 7/1/1990
Report Date: 12/20/2016
School: 123 Edison
District: 123 Edison

Description : This is an easy exam

Math Exam
Student Report
Calculus I
Fall 2016
Student: Amy
ID: 1234519
Date of Birth: 12/18/1990
Report Date: 12/20/2016
School: 123 Edison
District: 123 Edison

Description : This is a real tough exam

Math Exam
Student Report
Calculus I
Fall 2016
Student: John
ID: 1234569
Date of Birth: 3/1/1991
Report Date: 12/20/2016
School: 123 Edison
District: 123 Edison

Description : This is an easy exam

Math Exam
Student Report
Calculus I
Fall 2016
Student: Ava
ID: 1234523
Date of Birth: 9/13/1992
Report Date: 12/20/2016
School: 123 Edison
District: 123 Edison

Description : This is a not a hard exam

－－－－－－－－－－－－－－－－－－－－－－－－－－－－

I have read tons of papers to extract data from text file, like https://communities.sas.com/t5/SAS-Procedures/Extract-data-from-text-file/td-p/278012. I tried their ways,  but still have no idea how to extract

1 course name

2 student name

3 student id number

4 student date of birth

6 report date

－－－－－－－－－－－－－－－－－－－

my final goal is to obtain a dataset like below

Calculus I   Eric   1234567  7/1/1990       95 12/20/2016

Calculus I   Amy  1234519  12/18/1990   77 12/20/2016

Calculus I   John  1234569  3/1/1991      90 12/20/2016

Calculus I   Ava   1234523   9/13/1992    89 12/20/2016

1 ACCEPTED SOLUTION

Accepted Solutions
Super User

## Re: Extract data from a unfriendly text file

Define your variables. Don't force SAS to guess what you meant them to be by how they first appear in the code.

Don't bother with all of those nested IF/THEN statements. Let the natural SAS processing do that work for you. Take advantage of the power of the INPUT statement.

``````data temp;
infile 'exam.txt' truncover dsd dlm=':';
length Course \$30 StudentName \$30 Grade Reportdate 8 ID \$10 SchoolCode \$8 School \$30 DoB 8 District \$30 ;
format ReportDate DoB mmddyy10. ;
retain Course -- District;
length chk \$100 ;
input chk @;
if chk = 'Calculus' then input Course ;
if chk = 'Student' then input StudentName ;
if chk = 'ID' then input ID ;
if chk = 'Date of Birth' then input Dob mmddyy12. ;
if chk = 'School' then input School ;
if chk = 'District' then input District ;
if chk ='Report Date' then do;
input Reportdate mmddyy12. ;
schoolcode =scan(school,1,' ');
school = substr(school,length(schoolcode)+2);
output;
call missing(of _all_);
end;
run;``````

Note that SAS can normally read date values with 2 digit years and make the right choice is to what century is meant, but why would want to print your dates without the century?  You are just inviting trouble later.

20 REPLIES 20
Super User

## Re: Extract data from a unfriendly text file

Do the rows repeat at all, or are you reading a single record from a file?

You can read from a multi lines in SAS using an INPUT statement.

Calcite | Level 5

## Re: Extract data from a unfriendly text file

There are multiple students in the text file. I listed four students just as an example.

Super User

## Re: Extract data from a unfriendly text file

Here's the mechanism for how you'd do this:

If you've tried something make sure to include it, otherwise that will likely be the first suggestion you get and it's a waste of everyones time.

Barite | Level 11

## Re: Extract data from a unfriendly text file

@Reeza That's an excellent reference.  Saving that in case i need it!

@@@Wei2017  That text looks a lot like the structure of an XML document without the tags.  If whereever you are getting that text dump from could give you that as XML instead, it might read easier.

<exam_record>

<examtype>Math</examtype>
<reporttype>Student</reporttype>
<subject>Calculus I</subject>
<date>Fall 2016</date>
<studentname>Eric</studentname>
<studentid>1234567</studentid>
<dob>7/1/1990</dob>

<reportdate>12/20/2016</reportdate>
<school>123 Edison</school>
<district>123 Edison</district>

<description>This is an easy exam</description>

</exam_record>

Calcite | Level 5

## Re: Extract data from a unfriendly text file

Thank you for your link, but that is not what i am looking for.

I have tried https://communities.sas.com/t5/SAS-Procedures/Extract-data-from-text-file/td-p/278012, it is a little bit similar to my question.

Barite | Level 11

## Re: Extract data from a unfriendly text file

The link posted by @Reeza has exactly what you need to solve the issue. One obersavation in your requested output dataset has the data of multiple lines from your source file.

Super User

## Re: Extract data from a unfriendly text file

If you don't post what you've tried then you're going to get suggestions that may not work.

And it should work so you're likely doing something wrong, but we can't tell you since you haven't shown us any code.

Or if you did do it correctly it would give us something to start with rather than starting from scratch.

@Wei2017 wrote:

Thank you for your link, but that is not what i am looking for.

I have tried https://communities.sas.com/t5/SAS-Procedures/Extract-data-from-text-file/td-p/278012, it is a little bit similar to my question.

Calcite | Level 5

## Re: Extract data from a unfriendly text file

Apologise for not posting my code and result part.

Here is my code and what i get in sas:

```data temp;
infile 'exam.txt';
retain Course StudentName Grade Reportdate ID School DoB District;
input;
if _infile_ =: 'Calculus' then Course = scan(_infile_, 1, '\\');
else if _infile_ =: 'Student' then StudentName = scan(_infile_, -1, ':');
else if _infile_ =: 'ID' then ID = scan(_infile_, -1, ':');
else if _infile_ =: 'Date of Birth' then Dob = scan(_infile_, -1, ':');
else if _infile_ =: 'School' then School = scan(_infile_, -1, ':');
else if _infile_ =: 'District' then District = scan(_infile_, -1, ':');
else if (_infile_)=:'Report Date' then do
Reportdate = scan(_infile_,-1,':');
output;
end;
run;

proc print data = temp;
run;```

I am still get lost in sas.

(1) Why is there something missing in the first row?

(2) How to convert Reportdate format into mmddyy? eg convert 12/20/2016 to 12/20/16

(3) Is there some ways that could split School into School code and School name?  eg "123 Edision" into "123" and "Edison"

Super User

## Re: Extract data from a unfriendly text file

Define your variables. Don't force SAS to guess what you meant them to be by how they first appear in the code.

Don't bother with all of those nested IF/THEN statements. Let the natural SAS processing do that work for you. Take advantage of the power of the INPUT statement.

``````data temp;
infile 'exam.txt' truncover dsd dlm=':';
length Course \$30 StudentName \$30 Grade Reportdate 8 ID \$10 SchoolCode \$8 School \$30 DoB 8 District \$30 ;
format ReportDate DoB mmddyy10. ;
retain Course -- District;
length chk \$100 ;
input chk @;
if chk = 'Calculus' then input Course ;
if chk = 'Student' then input StudentName ;
if chk = 'ID' then input ID ;
if chk = 'Date of Birth' then input Dob mmddyy12. ;
if chk = 'School' then input School ;
if chk = 'District' then input District ;
if chk ='Report Date' then do;
input Reportdate mmddyy12. ;
schoolcode =scan(school,1,' ');
school = substr(school,length(schoolcode)+2);
output;
call missing(of _all_);
end;
run;``````

Note that SAS can normally read date values with 2 digit years and make the right choice is to what century is meant, but why would want to print your dates without the century?  You are just inviting trouble later.

Calcite | Level 5

## Re: Extract data from a unfriendly text file

Thank you for your code and exmplanation. The code is so awesome. I impleneted in my jupyter notebook, why some part is still missing like SchoolCode for Eric?

Thank you.

Super User

## Re: Extract data from a unfriendly text file

@Wei2017 wrote:

Thank you for your code and exmplanation. The code is so awesome. I impleneted in my jupyter notebook, why some part is still missing like SchoolCode for Eric?

Thank you.

Most likely the line with the SCHOOL on it is formatted slightly differently or does not exist for that block of lines. Watch out for changes in case, extra spaces, or things like tabs and other characters that look like spaces but aren't.

Barite | Level 11

## Re: Extract data from a unfriendly text file

Can you explain "The expected results like below :" further? Do you need a dataset, a report in the posted format, the result printed in the log?

Calcite | Level 5

## Re: Extract data from a unfriendly text file

Yes.

my final goal is to obtain a dataset like below

Calculus I   Eric   1234567  7/1/1990       95 12/20/2016

Calculus I   Amy  1234519  12/18/1990   77 12/20/2016

Calculus I   John  1234569  3/1/1991      90 12/20/2016

Calculus I   Ava   1234523   9/13/1992    89 12/20/2016

I have tried example like https://communities.sas.com/t5/SAS-Procedures/Extract-data-from-text-file/td-p/278012, but still have no idea how to deal with this one. Thank you.

Barite | Level 11

## Re: Extract data from a unfriendly text file

EDIT:  I don't know why this answer formatted this way, with no breaks in the SAS code and a scroll on the last piece of text.

Given a text file like:

```Math Exam
Student Report
Calculus I
Fall 2016
Student: Eric
ID: 1234567
Date of Birth: 7/1/1990
Report Date: 12/20/2016
School: 123 Edison
District: 123 Edison
Description : This is an easy exam
Math Exam
Student Report
Calculus I
Fall 2016
Student: Amy
ID: 1234519
Date of Birth: 12/18/1990
Report Date: 12/20/2016
School: 123 Edison
District: 123 Edison
Description : This is a real tough exam
Math Exam
Student Report
Calculus I
Fall 2016
Student: John
ID: 1234569
Date of Birth: 3/1/1991
Report Date: 12/20/2016
School: 123 Edison
District: 123 Edison
Description : This is an easy exam
Math Exam
Student Report
Calculus I
Fall 2016
Student: Ava
ID: 1234523
Date of Birth: 9/13/1992
Report Date: 12/20/2016
School: 123 Edison
District: 123 Edison
Description : This is a not a hard exam```

SAS code like

``````data incoming_text;
infile 'J:\some location where your file is\unfriendly_text_file.txt';
input
exam_type & \$20.

#2 report_type & \$20.

#3 subject & \$20.

#4 mydate & \$20.
#5 @10 student & \$20.
#6 @5 id & \$20.
#7 @16 dob & \$20.
#9 @14 reportdate & \$20.
#10 @9 school & \$20.
#11 @11 district & \$20.
#12 @15 description & \$45
.;
run;``````

will give you

 exam_type report_type subject mydate student id dob grade reportdate school district description Math Exam Student Report Calculus I Fall 2016 Eric 1234567 7/1/1990 95 12/20/2016 123 Edison 123 Edison This is an easy exam Math Exam Student Report Calculus I Fall 2016 Amy 1234519 12/18/1990 77 12/20/2016 123 Edison 123 Edison This is a real tough exam Math Exam Student Report Calculus I Fall 2016 John 1234569 3/1/1991 90 12/20/2016 123 Edison 123 Edison This is an easy exam Math Exam Student Report Calculus I Fall 2016 Ava 1234523 9/13/1992 89 12/20/2016 123 Edison 123 Edison This is a not a hard exam

You can mess around with the read  instructions and read only the lines you want or just read everything and drop the variables you don't want.  You can also read things properly as dates and numbers and not generic character variables of length 20 as I have.

Maybe like

#6 @5 id & 7.
#7 @16 dob:mmddyy10.

Discussion stats
• 20 replies
• 2361 views
• 2 likes
• 6 in conversation