Solved: Re: reading multiple records per obs from a csv file

GreggB · Posted 09-06-2013 12:50 PM

I have delimited data with multiple obs per record. Plus, the number of records may vary from 3 to 8. The unique identifier only appears once per record. I want the output to read as such:

T00000000 Tom Bryson Freshman ENG 101 MAT 110

Here's a sample of the csv file

T00000000, Tom, Bryson, Freshman,

ENG 101,

MAT 110,

T00000001, Gregg, Smith, Senior,

ACC 101,

ENG 101,

MAT 110,

MAT 120,

T00000002, Roger Sparks, Junior,

ART 101

Tom · Posted 09-06-2013 01:31 PM

That is easy, you just need to read enough to figure out whether the line has the student information or the course name.

Note that your sample was poorly formed. I added a comma between first and last names for Roger Sparks. If you real data is that inconsistent then your program will need to be more complex.

data want ;

infile cards dsd truncover ;

length id $20 fname lname $25 class $10 course $10 dummy $10;

retain id fname lname class ;

input course dummy @1 @ ;

drop dummy ;

if dummy ne ' ' then input id fname lname class ;

else output;

cards;

T00000000, Tom, Bryson, Freshman,

ENG 101,

MAT 110,

T00000001, Gregg, Smith, Senior,

ACC 101,

ENG 101,

MAT 110,

MAT 120,

T00000002, Roger, Sparks, Junior,

ART 101

run;

proc print; run;

That will generate one observation per course, which is probably what you really want in order to be able to do anything with it. But if you really want to create multiple course variables so that you can get down to one observation per student then you could just add a PROC TRANSPOSE step to do that for you.

View solution in original post

ballardw · Posted 09-06-2013 12:59 PM

Does the first line actually vary in the number of elements as shown by your third record with Roger Sparks or was that just a typo in the formum and should read: Roger, Sparks ?

What do you want your output to look like?

Vince28_Statcan · Posted 09-06-2013 01:07 PM

I recommend giving a read to Bruno's response in

https://communities.sas.com/message/179168#179168

There is a lot of information as to how to handle reading multiline inputs with some dummy variables and whatnot.

I would opt for a totally different approach of reading the text as a giant stream and reformating it using PRX to match T\d{8} as a delimiter when parsing (basically reading a text file and outputing a new text file with appropriate format to read) and then read it as a normal csv, by line with enough dummy variables for the classes each student follows.

art297 · Posted 09-06-2013 01:24 PM

If you plan to do further analyses on the data, I'd think that you would be better off with a different resulting output file, namely multiple records for each id. The following approach first produces that kind of file and then transposes it into the kind of output file you indicated that you wanted:

filename have 'c:\art\have.csv';

data need;

infile have;

retain id first last grade;

input;

if substr(_infile_,1,1) eq 'T' then do;

id=scan(_infile_,1,',');

first=scan(_infile_,2,',');

last=scan(_infile_,3,',');

grade=scan(_infile_,4,',');

end;

else do;

course=compress(strip(_infile_),',');

output;

end;

run;

proc transpose data=need out=want (drop=_:) prefix=course;

var course;

by id first last grade notsorted;

run;

Tom · Posted 09-06-2013 01:31 PM

That is easy, you just need to read enough to figure out whether the line has the student information or the course name.

Note that your sample was poorly formed. I added a comma between first and last names for Roger Sparks. If you real data is that inconsistent then your program will need to be more complex.

data want ;

infile cards dsd truncover ;

length id $20 fname lname $25 class $10 course $10 dummy $10;

retain id fname lname class ;

input course dummy @1 @ ;

drop dummy ;

if dummy ne ' ' then input id fname lname class ;

else output;

cards;

T00000000, Tom, Bryson, Freshman,

ENG 101,

MAT 110,

T00000001, Gregg, Smith, Senior,

ACC 101,

ENG 101,

MAT 110,

MAT 120,

T00000002, Roger, Sparks, Junior,

ART 101

run;

proc print; run;

That will generate one observation per course, which is probably what you really want in order to be able to do anything with it. But if you really want to create multiple course variables so that you can get down to one observation per student then you could just add a PROC TRANSPOSE step to do that for you.

GreggB · Posted 09-06-2013 01:50 PM

This one worked. Thank you. I haven't tried the other replies yet.

reading multiple records per obs from a csv file

Re: reading multiple records per obs from a csv file

Re: reading multiple records per obs from a csv file

Re: reading multiple records per obs from a csv file

Re: reading multiple records per obs from a csv file

Re: reading multiple records per obs from a csv file

Re: reading multiple records per obs from a csv file

Registration is open

SAS Training: Just a Click Away