BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
GreggB
Pyrite | Level 9

I have delimited data with multiple obs per record.  Plus, the number of records may vary from 3 to 8.  The unique identifier only appears once per record.  I want the output to read as such:

T00000000 Tom Bryson  Freshman ENG 101 MAT 110

Here's a sample of the csv file

T00000000, Tom, Bryson, Freshman,

                    ENG 101,

                  MAT 110,

T00000001, Gregg, Smith, Senior,

                  ACC 101,

                  ENG  101,

                  MAT 110,

                  MAT 120,

T00000002, Roger Sparks, Junior,

                  ART  101

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

That is easy, you just need to read enough to figure out whether the line has the student information or the course name.

Note that your sample was poorly formed.  I added a comma between first and last names for Roger Sparks.  If you real data is that inconsistent then your program will need to be more complex.

data want ;

  infile cards dsd truncover ;

  length id $20 fname lname $25 class $10 course $10 dummy $10;

  retain id fname lname class ;

  input course dummy @1 @ ;

  drop dummy ;

  if dummy ne ' ' then input id fname lname class ;

  else output;

cards;

T00000000, Tom, Bryson, Freshman,

                    ENG 101,

                  MAT 110,

T00000001, Gregg, Smith, Senior,

                  ACC 101,

                  ENG  101,

                  MAT 110,

                  MAT 120,

T00000002, Roger, Sparks, Junior,

                  ART  101

run;

proc print; run;

That will generate one observation per course, which is probably what you really want in order to be able to do anything with it.  But if you really want to create multiple course variables so that you can get down to one observation per student then you could just add a PROC TRANSPOSE step to do that for you.

View solution in original post

5 REPLIES 5
ballardw
Super User

Does the first line actually vary in the number of elements as shown by your third record with Roger Sparks or was that just a typo in the formum and should read: Roger, Sparks ?

What do you want your output to look like?


Vince28_Statcan
Quartz | Level 8

I recommend giving a read to Bruno's response in

https://communities.sas.com/message/179168#179168

There is a lot of information as to how to handle reading multiline inputs with some dummy variables and whatnot.

I would opt for a totally different approach of reading the text as a giant stream and reformating it using PRX to match T\d{8} as a delimiter when parsing (basically reading a text file and outputing a new text file with appropriate format to read) and then read it as a normal csv, by line with enough dummy variables for the classes each student follows.

art297
Opal | Level 21

If you plan to do further analyses on the data, I'd think that you would be better off with a different resulting output file, namely multiple records for each id. The following approach first produces that kind of file and then transposes it into the kind of output file you indicated that you wanted:

filename have 'c:\art\have.csv';

data need;

  infile have;

  retain id first last grade;

  input;

  if substr(_infile_,1,1) eq 'T' then do;

    id=scan(_infile_,1,',');

    first=scan(_infile_,2,',');

    last=scan(_infile_,3,',');

    grade=scan(_infile_,4,',');

  end;

  else do;

    course=compress(strip(_infile_),',');

    output;

  end;

run;

proc transpose data=need out=want (drop=_:) prefix=course;

  var course;

  by id first last grade notsorted;

run;

Tom
Super User Tom
Super User

That is easy, you just need to read enough to figure out whether the line has the student information or the course name.

Note that your sample was poorly formed.  I added a comma between first and last names for Roger Sparks.  If you real data is that inconsistent then your program will need to be more complex.

data want ;

  infile cards dsd truncover ;

  length id $20 fname lname $25 class $10 course $10 dummy $10;

  retain id fname lname class ;

  input course dummy @1 @ ;

  drop dummy ;

  if dummy ne ' ' then input id fname lname class ;

  else output;

cards;

T00000000, Tom, Bryson, Freshman,

                    ENG 101,

                  MAT 110,

T00000001, Gregg, Smith, Senior,

                  ACC 101,

                  ENG  101,

                  MAT 110,

                  MAT 120,

T00000002, Roger, Sparks, Junior,

                  ART  101

run;

proc print; run;

That will generate one observation per course, which is probably what you really want in order to be able to do anything with it.  But if you really want to create multiple course variables so that you can get down to one observation per student then you could just add a PROC TRANSPOSE step to do that for you.

GreggB
Pyrite | Level 9

This one worked.  Thank you.  I haven't tried the other replies yet.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 2161 views
  • 2 likes
  • 5 in conversation