BookmarkSubscribeRSS Feed
nickb
Calcite | Level 5
I'm working on a fixed format file where the file contains multiple entities. For example, the file may contain data for a student from row 1-99 and then another student will start at 100. I was thinking about loading the whole file in a dataset and parse it with proc sql. Below is a very brief example of 2 students.

StudentA

ADVISOR:
PLACEMENT TESTS: DISMISSALS: 0
CURRENT STANDING: GS

Completion of program will be determined by completion of each
requirement, required credit hours and final Institutional GPA.

Program Status: In Progress
Current.......... Anticipated(*).......
Required Earned Remaining Additional Remaining
Institutional Credits: 30.00 87.00 0.00 0.00 0.00
Institutional GPA....: 2.000 3.132 Met
Combined Credits: 94.00 87.00 7.00 0.00 7.00
Combined GPA....: 2.000 3.132 Met


StudentB

Completion of program will be determined by completion of each
requirement, required credit hours and final Institutional GPA.

Program Status: In Progress
Current.......... Anticipated(*).......
Required Earned Remaining Additional Remaining
Institutional Credits: 30.00 49.00 0.00 8.00 0.00
Institutional GPA....: 2.000 3.174 Met
Combined Credits: 106.00 49.00 57.00 8.00 49.00
Combined GPA....: 2.000 3.174 Met
20 REPLIES 20
sbb
Lapis Lazuli | Level 10 sbb
Lapis Lazuli | Level 10
PROC SQL?

Recommend using a SAS DATA step approach, where you load the current input record buffer, using IF THEN DO / END logic, detect what specify "record format type" it is, then inside the DO/END portion have INPUT statement code to parse each "type".

Use LENGTH or ATTRIB statements to declare your SAS CHARACTER and NUMERIC formatted variables.

Also, you will likely need to use a RETAIN statement for those variables that will be input in records/rows prior to the last one where you would do your OUTPUT to create a composite observation.

Scott Barry
SBBWorks, Inc.

DATA ;
* define LENGTH/ATTRIB and RETAIN statement info here. ;
input @;
if _infile_ =: 'A' then do;
* input code here for row-type A. ;
end;
else if _infile_ =: 'B' then do;
* input code here for row-type B. ;
end;
else do;
* output logic goes here - depends on what detects no ;
* more input for the current item - maybe a blank record? ;
OUTPUT;
end;
nickb
Calcite | Level 5
What if there is unknown amount of students? In my example I listed 2 but in reality there could be 22,000 students.
Patrick
Opal | Level 21
Hi

"What if there is unknown amount of students?"

Read Scott's example carefully. It fully covers this. You might have to look up the function of an '@' used in an input statement.

Cheers
Patrick
DBailey
Lapis Lazuli | Level 10
SBR's solution seems to hard code for each student. Is there something in the input file that you could use as a trigger to indicate when a new student's records is starting?
nickb
Calcite | Level 5
In looking into the file deeper it looks like after the second row of "=====" is the "meat" of the file.

COM-206 INTERPERSONAL COM.......... 92/WI B 3
================================================================================
N) 7: MULTICULTURAL - 3 HOURS REQUIRED
> SELECT ONE COURSE FROM:
> GEO 102, 201, 202
> LIT 217, 234
> PLS 200
> PSY 225
04/04/11 Page 3
-------------------------------------------------------------------------------
> SOC 145, 215
_____________________________________________________ 1 course needed
================================================================================
N) 8: FRESHMAN EXPERIENCE - 2 HOURS REQUIRED
> ASE 101
_____________________________________________________ 1 course needed
================================================================================
C) 9: ELECTIVES - 15 HOURS REQUIRED
> ELECTIVE COURSES MAY BE SELECTED FROM THE FOLLOWING AREAS:
> - THE EMPHASIS AREA LIST
DBailey
Lapis Lazuli | Level 10
As this appears to be sort of an old school printed report...you may have to resort to old school tactics....

I'd have to look for indicators of what kind of data you have on a certain row. I can't tell where the student # is, but if you can map out the exact logic, SAS can parse files like this.

Back in the day..entire software programs were created to parse printed output files. You had to map the layouts and then the programs would intepret the map.
nickb
Calcite | Level 5
I'm starting to try to parse this and need a few tips on finding text.

For example, one piece of data that I need is a line that looks like this:
Student............: John Doe (0012345)

The row will always start with the word 'Student' and the value after the first parenthesis will be a length of 7. All I need is the number inside the parens.

data work.audit;
infile 'D:\temp\nick_tmp\DA.txt' length=len;
input row $varying200. len;
IF _infile_ ='' then delete;

IF Studentid= indexc(row, '(') then do;
input Studentid ;
END


run;
DBailey
Lapis Lazuli | Level 10
how about:

if substr(row,1,8)='Student.' then StudentId=mid(row,index(row,'(')+1,7);
Ksharp
Super User
[pre]
data temp(keep=studentid where=(studentid is not missing));
infile datalines length=len ;
input row $varying200. len;
if upcase(row) eq: 'STUDENT' then studentid=substr(row,findc(row,'(')+1,7);
datalines;
Student............: John Doe (0012345)
C) 9: ELECTIVES - 15 HOURS REQUIRED
;
run;
[/pre]

Ksharp
nickb
Calcite | Level 5
Worked great. I'm working on the last key pieces of data and could use some advice. The data that I ultimately need is the line that has data like this:
MAT-116 COLLEGE ALGEBRA............ 93/FA C 5
and so on.


(*) Anticipates completion of in-progress and registered courses
================================================================================
Statuses: W=Waived, C=Complete, I=In progress, N=Not started
P=Pending completion of unfinished activity
================================================================================
C) 1: ENGLISH COMPOSITION - 9 HOURS REQUIRED
> ENG 111, 112, 113
Credits: 9
ENG-111 ENGLISH COMPOSITION I...... 92/WI B 3
ENG-112 ENGLISH COMPOSITION II..... 92/SP B 3
ENG-113 ENGLISH COMPOSITION III.... 92/FA B 3
================================================================================
C) 2: MATHEMATICS/NATURAL & PHYSICAL SCIENCES
Credits: 31
Complete all 3 subrequirements:
C) A: MATHEMATICS
> MINIMUM ONE COURSE FROM THE FOLLOWING: MAT 116, 117, 122,
> 201, 202, 203, 204, 215, 216 OR 218
MAT-116 COLLEGE ALGEBRA............ 93/FA C 5

C) B: SCIENCE SEQUENCE
> COMPLETE A MINIMUM OF ONE SEQUENCE WITH LABS:
Credits: 12
C) BIO 111, 112, 113 WITH LABS
Credits: 12
BIO-111 GENERAL BIOLOGY I.......... 93/SP C 4
BIO-112 GENERAL BIOLOGY II......... 93/FA C 4
BIO-113 GENERAL BIOLOGY III........ 94/WI C 4

C) C: ADD MATH & SCIENCE
> ADDITIONAL HOURS OF MATH AND/OR SCIENCE TO TOTAL 29:
04/04/11 Ayanna A. Archie Page 2
-------------------------------------------------------------------------------
Credits: 14
MAT-117 TRIGONOMETRY............... 94/WI C 4
CHE-151 GENERAL CHEMISTRY I........ 94/FA B 5
MAT-201 CALCULUS + ANALY GEOM I.... 94/FA C 5
================================================================================
C) 3: SOCIAL & BEHAVIORAL SCIENCES - 15 HOURS REQUIRED
> SELECT 15 HOURS FROM AT LEAST TWO SUBJECT AREAS:
> ECO 201, 202, 203
> GEO 101, 102, 201, 202
> PLS 101, 102, 103, 104, 200, 201
> PSY 121 AND 122 OR PSY 119
> PSY 205 AND/OR 206 - OR PSY 208
> PSY 207, 217, 223, 225, 228, 242
> SOC 111 AND 112 OR SOC 120
> SOC 145, 160, 205, 208, 215, 226
Credits: 15
PSY-121 GENERAL PSYCHOLOGY I....... 92/WI C 3
PLS-101 AMERICAN FEDERAL GOVT I.... 92/SP C 3
ECO-201 PRIN OF ECONOMICS I........ 92/SP B 3
ECO-202 PRIN OF ECONOMICS II....... 93/WI C 3
PLS-102 AMERICAN FEDERAL GOVT II... 93/WI B 3
================================================================================
I) 4: ARTS & HUMANITIES - 15 HOURS REQUIRED
> SELECT 15 HOURS FROM AT LEAST TWO SUBJECT AREAS:
> ART 101, 102, 125, 231, 232, 233, 235, 236
> DAN 155, 157
> HIS 101, 102, 103, 105, 111, 112, 113, 214, 215, 216, 217
> HUM 125, 130, 131, 255
> LIT 201, 202, 203, 211, 212, 213, 217, 227, 230, 234
> MUS 115, 131, 132, 133
> PHI 204, 205, 206
> REL 111, 112, 135, 204
> THE 105, 201, 202, 203
Credits: 9 Required: 15 Remaining: 6
MUS-115 MUSIC APPRECIATION......... 93/WI B 3
HIS-111 WESTERN CIV (0-1300)....... 96/WI B 3
HIS-112 WEST CIV (1300-1815)....... 96/SP B 3
____________________________________________________ 6 credits needed
================================================================================
C) 5: COMPUTER COMPETENCY - 3 HOURS REQUIRED
> TAKE ONE COURSE FROM THE FOLLOWING:
> CIS 111, CIS/BIS 119, CIS/BIS/0IS 160
> CHE 152
> MAT 220
CIS-119 PC APPLICATIONS IN BUS..... 92/FA B 3
================================================================================
Ksharp
Super User
You do not post what you want output dataset looks like?
If you only want obs start with MIT then use:
[pre]
if upcase(row) eq: 'MIT' then output;
[/pre]
to clean data.
If you want much more obs different from 'MIT' ,Emmmm there might be to need more code.

Ksharp
nickb
Calcite | Level 5
If the data matches this format:

BIO-111 GENERAL BIOLOGY I.......... 93/SP C 4
BIO-112 GENERAL BIOLOGY II......... 93/FA C 4
BIO-113 GENERAL BIOLOGY III........ 94/WI C 4

I need a dataset that contains:

ID Name Term Status credit
BIO-111 General Biology 93/SP C 4
sbb
Lapis Lazuli | Level 10 sbb
Lapis Lazuli | Level 10
You have the DATA step DO/END with an internal INPUT statement logic, as demonstrated below, given the "data section border" you revealed earlier:

* earlier DATA step code omitted. ;
DONE=0;
DO UNTIL(DONE);
INPUT @;
IF _INFILE_ NE: '======' THEN DO;
* SOME INPUT STATEMENT CODE GOES HERE. ;
* SOME OUTPUT STATEMENT CODE GOES HERE. ;
END;
ELSE DONE=1;
END;
* later DATA step code omitted. ;

Scott Barry
SBBWorks, Inc.
nickb
Calcite | Level 5
Thanks everyone for all the help. I have everything parsed except for 1 more item.

from file: HIS-103 U.S. HIST (1919-PRESENT)... 08/FA B 3

I need this: U.S. HIST (1919-PRESENT)

I know the starting position but with this example the course name has periods in it and that through off my approach using substr and findc to find the first occurence of '..' and -1 as the end point. My new approach is to use REVERSE but it isn't working out.

IF index(row, '-')=14 and index(row, '*')=0 then
do
revrow=reverse(trim(row));
Course_Description = substr(revrow,findc(revrow, '..')+1, findc(revrow, '-')-5);
c=reverse(trim(course_description));

*Course_Description=substr(row,Course_Desc_Start,(Course_Desc_End - Course_Desc_Start));
end;

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 20 replies
  • 2945 views
  • 0 likes
  • 6 in conversation