Hi ,
I have a unfriendly text file like below:
Math Exam
Student Report
Calculus I
Fall 2016
Student: Eric
ID: 1234567
Date of Birth: 7/1/1990
Grade: 95
Report Date: 12/20/2016
School: 123 Edison
District: 123 Edison
Description : This is an easy exam
Math Exam
Student Report
Calculus I
Fall 2016
Student: Amy
ID: 1234519
Date of Birth: 12/18/1990
Grade: 77
Report Date: 12/20/2016
School: 123 Edison
District: 123 Edison
Description : This is a real tough exam
Math Exam
Student Report
Calculus I
Fall 2016
Student: John
ID: 1234569
Date of Birth: 3/1/1991
Grade: 90
Report Date: 12/20/2016
School: 123 Edison
District: 123 Edison
Description : This is an easy exam
Math Exam
Student Report
Calculus I
Fall 2016
Student: Ava
ID: 1234523
Date of Birth: 9/13/1992
Grade: 89
Report Date: 12/20/2016
School: 123 Edison
District: 123 Edison
Description : This is a not a hard exam
----------------------------
I have read tons of papers to extract data from text file, like https://communities.sas.com/t5/SAS-Procedures/Extract-data-from-text-file/td-p/278012. I tried their ways, but still have no idea how to extract
1 course name
2 student name
3 student id number
4 student date of birth
5 grade
6 report date
from such text file, can anyone please help.
-------------------
my final goal is to obtain a dataset like below
Calculus I Eric 1234567 7/1/1990 95 12/20/2016
Calculus I Amy 1234519 12/18/1990 77 12/20/2016
Calculus I John 1234569 3/1/1991 90 12/20/2016
Calculus I Ava 1234523 9/13/1992 89 12/20/2016
Define your variables. Don't force SAS to guess what you meant them to be by how they first appear in the code.
Don't bother with all of those nested IF/THEN statements. Let the natural SAS processing do that work for you. Take advantage of the power of the INPUT statement.
data temp;
  infile 'exam.txt' truncover dsd dlm=':';
  length Course $30 StudentName $30 Grade Reportdate 8 ID $10 SchoolCode $8 School $30 DoB 8 District $30 ;
  format ReportDate DoB mmddyy10. ;
  retain Course -- District; 
  length chk $100 ;
  input chk @; 
  if chk = 'Calculus' then input Course ;
  if chk = 'Student' then input StudentName ;
  if chk = 'ID' then input ID ;
  if chk = 'Grade' then input Grade ;
  if chk = 'Date of Birth' then input Dob mmddyy12. ;
  if chk = 'School' then input School ;
  if chk = 'District' then input District ;
  if chk ='Report Date' then do; 
    input Reportdate mmddyy12. ;
    schoolcode =scan(school,1,' ');
    school = substr(school,length(schoolcode)+2);
    output;
    call missing(of _all_);
  end;
run;Note that SAS can normally read date values with 2 digit years and make the right choice is to what century is meant, but why would want to print your dates without the century? You are just inviting trouble later.
Do the rows repeat at all, or are you reading a single record from a file?
I think you need to provide more information.
You can read from a multi lines in SAS using an INPUT statement.
Hi, I add more detailed information about this.
There are multiple students in the text file. I listed four students just as an example.
Here's the mechanism for how you'd do this:
https://stats.idre.ucla.edu/sas/code/reading-a-multiple-line-per-subject-data-file/
If you've tried something make sure to include it, otherwise that will likely be the first suggestion you get and it's a waste of everyones time.
@Reeza That's an excellent reference. Saving that in case i need it!
@@@Wei2017 That text looks a lot like the structure of an XML document without the tags. If whereever you are getting that text dump from could give you that as XML instead, it might read easier.
<exam_record>
<examtype>Math</examtype>
<reporttype>Student</reporttype>
<subject>Calculus I</subject>
<date>Fall 2016</date>
<studentname>Eric</studentname>
<studentid>1234567</studentid>
<dob>7/1/1990</dob>
<grade>95</grade>
<reportdate>12/20/2016</reportdate>
<school>123 Edison</school>
<district>123 Edison</district>
<description>This is an easy exam</description>
</exam_record>
Thank you for your link, but that is not what i am looking for.
I have tried https://communities.sas.com/t5/SAS-Procedures/Extract-data-from-text-file/td-p/278012, it is a little bit similar to my question.
The link posted by @Reeza has exactly what you need to solve the issue. One obersavation in your requested output dataset has the data of multiple lines from your source file.
If you don't post what you've tried then you're going to get suggestions that may not work.
And it should work so you're likely doing something wrong, but we can't tell you since you haven't shown us any code.
Or if you did do it correctly it would give us something to start with rather than starting from scratch.
@Wei2017 wrote:
Thank you for your link, but that is not what i am looking for.
I have tried https://communities.sas.com/t5/SAS-Procedures/Extract-data-from-text-file/td-p/278012, it is a little bit similar to my question.
Apologise for not posting my code and result part.
@Reeza @HB @Tom @ballardw @error_prone
Here is my code and what i get in sas:
 
data temp;
infile 'exam.txt';
retain Course StudentName Grade Reportdate ID School DoB District;
input;
if _infile_ =: 'Calculus' then Course = scan(_infile_, 1, '\\');
  else if _infile_ =: 'Student' then StudentName = scan(_infile_, -1, ':');
   else if _infile_ =: 'ID' then ID = scan(_infile_, -1, ':');
    else if _infile_ =: 'Grade' then Grade = scan(_infile_, -1, ':');
     else if _infile_ =: 'Date of Birth' then Dob = scan(_infile_, -1, ':');
      else if _infile_ =: 'School' then School = scan(_infile_, -1, ':');
       else if _infile_ =: 'District' then District = scan(_infile_, -1, ':');
          else if (_infile_)=:'Report Date' then do 
            Reportdate = scan(_infile_,-1,':');
         output;
         end;
run;
proc print data = temp;
run;
I am still get lost in sas.
(1) Why is there something missing in the first row?
(2) How to convert Reportdate format into mmddyy? eg convert 12/20/2016 to 12/20/16
(3) Is there some ways that could split School into School code and School name? eg "123 Edision" into "123" and "Edison"
Thank you for your time reading this.
Define your variables. Don't force SAS to guess what you meant them to be by how they first appear in the code.
Don't bother with all of those nested IF/THEN statements. Let the natural SAS processing do that work for you. Take advantage of the power of the INPUT statement.
data temp;
  infile 'exam.txt' truncover dsd dlm=':';
  length Course $30 StudentName $30 Grade Reportdate 8 ID $10 SchoolCode $8 School $30 DoB 8 District $30 ;
  format ReportDate DoB mmddyy10. ;
  retain Course -- District; 
  length chk $100 ;
  input chk @; 
  if chk = 'Calculus' then input Course ;
  if chk = 'Student' then input StudentName ;
  if chk = 'ID' then input ID ;
  if chk = 'Grade' then input Grade ;
  if chk = 'Date of Birth' then input Dob mmddyy12. ;
  if chk = 'School' then input School ;
  if chk = 'District' then input District ;
  if chk ='Report Date' then do; 
    input Reportdate mmddyy12. ;
    schoolcode =scan(school,1,' ');
    school = substr(school,length(schoolcode)+2);
    output;
    call missing(of _all_);
  end;
run;Note that SAS can normally read date values with 2 digit years and make the right choice is to what century is meant, but why would want to print your dates without the century? You are just inviting trouble later.
Thank you for your code and exmplanation. The code is so awesome. I impleneted in my jupyter notebook, why some part is still missing like SchoolCode for Eric?
Thank you.
@Wei2017 wrote:
Thank you for your code and exmplanation. The code is so awesome. I impleneted in my jupyter notebook, why some part is still missing like SchoolCode for Eric?
Thank you.
Most likely the line with the SCHOOL on it is formatted slightly differently or does not exist for that block of lines. Watch out for changes in case, extra spaces, or things like tabs and other characters that look like spaces but aren't.
Can you explain "The expected results like below :" further? Do you need a dataset, a report in the posted format, the result printed in the log?
Yes.
my final goal is to obtain a dataset like below
Calculus I Eric 1234567 7/1/1990 95 12/20/2016
Calculus I Amy 1234519 12/18/1990 77 12/20/2016
Calculus I John 1234569 3/1/1991 90 12/20/2016
Calculus I Ava 1234523 9/13/1992 89 12/20/2016
I have tried example like https://communities.sas.com/t5/SAS-Procedures/Extract-data-from-text-file/td-p/278012, but still have no idea how to deal with this one. Thank you.
EDIT: I don't know why this answer formatted this way, with no breaks in the SAS code and a scroll on the last piece of text.
Given a text file like:
Math Exam Student Report Calculus I Fall 2016 Student: Eric ID: 1234567 Date of Birth: 7/1/1990 Grade: 95 Report Date: 12/20/2016 School: 123 Edison District: 123 Edison Description : This is an easy exam Math Exam Student Report Calculus I Fall 2016 Student: Amy ID: 1234519 Date of Birth: 12/18/1990 Grade: 77 Report Date: 12/20/2016 School: 123 Edison District: 123 Edison Description : This is a real tough exam Math Exam Student Report Calculus I Fall 2016 Student: John ID: 1234569 Date of Birth: 3/1/1991 Grade: 90 Report Date: 12/20/2016 School: 123 Edison District: 123 Edison Description : This is an easy exam Math Exam Student Report Calculus I Fall 2016 Student: Ava ID: 1234523 Date of Birth: 9/13/1992 Grade: 89 Report Date: 12/20/2016 School: 123 Edison District: 123 Edison Description : This is a not a hard exam
SAS code like
data incoming_text;
infile 'J:\some location where your file is\unfriendly_text_file.txt';
input 
exam_type & $20.
#2 report_type & $20.
#3 subject & $20.
#4 mydate & $20.
#5 @10 student & $20.
#6 @5 id & $20.
#7 @16 dob & $20.
#8 @8 grade & $20.
#9 @14 reportdate & $20.
#10 @9 school & $20.
#11 @11 district & $20.
#12 @15 description & $45
.;
run;will give you
| exam_type | report_type | subject | mydate | student | id | dob | grade | reportdate | school | district | description | 
| Math Exam | Student Report | Calculus I | Fall 2016 | Eric | 1234567 | 7/1/1990 | 95 | 12/20/2016 | 123 Edison | 123 Edison | This is an easy exam | 
| Math Exam | Student Report | Calculus I | Fall 2016 | Amy | 1234519 | 12/18/1990 | 77 | 12/20/2016 | 123 Edison | 123 Edison | This is a real tough exam | 
| Math Exam | Student Report | Calculus I | Fall 2016 | John | 1234569 | 3/1/1991 | 90 | 12/20/2016 | 123 Edison | 123 Edison | This is an easy exam | 
| Math Exam | Student Report | Calculus I | Fall 2016 | Ava | 1234523 | 9/13/1992 | 89 | 12/20/2016 | 123 Edison | 123 Edison | This is a not a hard exam | 
You can mess around with the read instructions and read only the lines you want or just read everything and drop the variables you don't want. You can also read things properly as dates and numbers and not generic character variables of length 20 as I have.
Maybe like
#6 @5 id & 7.
#7 @16 dob:mmddyy10.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
