DATA Step, Macro, Functions and more

Reading messy raw data from a tab-delimited txt file

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 6
Accepted Solution

Reading messy raw data from a tab-delimited txt file

Hi, I'm attempting to read a tab delimited text file into SAS.  This particular file contains manufacturer names, location, registration#, year and product lists (all data is public).  For a given company name there may anywhere from 1 to 15 listed products.  Each product listing is followed by a description of the manufacturer type.  Here is a sample (raw data file is also attached):


Airgas USA, LLC

NC/USA

1054016

2014

Gas, Calibration (Specified Concentration) - Blood Gas/ Lung Diffusion Gas

Contract Manufacturer; Manufacturer

Gas, Laser Generating - Laser Gas

Contract Manufacturer; Manufacturer

Anaerobic Box Glove - Biological Atmosphere (Anaerobic)

Contract Manufacturer; Manufacturer

Bath, Incubators/Water, All - Biological Atmosphere (Aerobic)

Contract Manufacturer; Manufacturer



I begin by using the infile and input statements to create variables called Company, Location, Registration#, Year:


data mfr;

     infile 'file-specification'  dsd  dlm='09'X  firstobs=5;

     input Company  :$CHAR15.  Location  $CHAR6.  Registration#  Year  @;



At this point I would like to create two variables (one called Product and ProductDescr) from the first line which lists the product and a variables (called MfrType) from the second line which lists the kind of manufacturer.  To do this I first read the entire line and then break it up based on certain delimiting characters:


input  ProductListing  :$100.  @ ;        /*reads entire line listing the product and product description*/

     where  find(ProductListing , "-" , "t" , 1 )  ~=  0;        /*tells SAS to check if a "-" character is present in ProductListing*/

           ProductDescr = SCAN(ProductListing, 1, '-') & Product = SCAN(ProductListing, -2, '-' );       /*creates two variables using "-" character as the delimiter*/


input MfrListing :$CHAR55. @ ;        /*reads entire line listing the manufacturer information*/

     if find (MfrListing, "Contract Manufacturer" ,"i", 1) ~= 0 

          ||   find(MfrListing,"Manufacturer", "i", 1) THEN MfrType = scan(MfrListing, 1, ';' );       /*tells SAS to look for words "Manufacturer" or "Contract Manufacturer" and create variable called MfrType*/



At this point there may be several more listings of product and manufacturer.  I do not want SAS to return to the top of the data step until it has read through all of the products/manufacturers, so it seems like I would use a loop (e.g. do while, do until) but I haven't been able to make that work.  I am relatively new to SAS so I may be making this more complicated than it needs to be.  I am using SAS University Edition.  I greatly appreciate any suggestions, assistance you can provide.

Attachment

Accepted Solutions
Solution
‎11-24-2014 01:06 PM
Super User
Super User
Posts: 6,497

Re: Reading messy raw data from a tab-delimited txt file

Here is one way to read it that seems to work.  The idea is to read the first four fields from the the first two lines for each company and then loop reading PRODUCT strings until an input line has a tab (indicating the beginning of the next companies information).

data want ;

  infile "&path/&fname" firstobs=5  dsd dlm='09'x truncover end=eof ;

  length company $100 location $20 registration $20 year $4 ;

  length product $600 ;

  input company location / registration year / @@ ;

  do until ( index(_infile_,'09'x) ) ;

    product=_infile_;

    output;

    input ;

    if eof then leave;

    else input @@ ;

  end;

run;

View solution in original post


All Replies
Solution
‎11-24-2014 01:06 PM
Super User
Super User
Posts: 6,497

Re: Reading messy raw data from a tab-delimited txt file

Here is one way to read it that seems to work.  The idea is to read the first four fields from the the first two lines for each company and then loop reading PRODUCT strings until an input line has a tab (indicating the beginning of the next companies information).

data want ;

  infile "&path/&fname" firstobs=5  dsd dlm='09'x truncover end=eof ;

  length company $100 location $20 registration $20 year $4 ;

  length product $600 ;

  input company location / registration year / @@ ;

  do until ( index(_infile_,'09'x) ) ;

    product=_infile_;

    output;

    input ;

    if eof then leave;

    else input @@ ;

  end;

run;

Occasional Contributor
Posts: 6

Re: Reading messy raw data from a tab-delimited txt file

Thanks for your help Tom!  That seems to have worked pretty well. 

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 2 replies
  • 386 views
  • 0 likes
  • 2 in conversation