Re: Handling incomplete data rows in unstructured data

dunga · Posted 07-15-2017 10:50 PM

Hi SAS Forum,
I am having the attached “.txt” file that should be converted to a SAS data set.

It has following 8 variables. And has 5 observations..

Category
SurName
First_Name
salary_1
salary_2
company
Profession
Year

Q:

When I imported using "Import Data" functionality in SAS, first 3 records get correctly imported.

Howerver, a part of values for "profession" variable for last 3 records get running into the next variable, which is "Income".

I have attahced an Excel table to show the wrong and correct answers.

Could you help me to correctly import the ".txt" file.

Thanks

Mirisa

dunga · Posted 07-15-2017 10:53 PM

Hi Forum again,

This is the txt file needs to be converted to a SAS data set.

Mirisa

Jagadishkatam · Posted 07-16-2017 03:33 AM

With the below code i was able to generate the dataset want. The delimiter need to be used is tab like delimiter='09'x

the only exemption i observed is that the last three observations do not have a year, the people held multiple professions. So had to slightly update the txt file as the multiple professions were separated by tab instead i updated to a single space. due to which now i get year blank. Hope this is what you are expecting.

proc import datafile='~path\post_2017_Jul_15.txt' out=want
dbms=dlm replace;
delimiter='09'x;
getnames=yes;
run;

Thanks,
Jag

LinusH · Posted 07-16-2017 03:32 AM

Since there are tons of ways to import data in SAS, please clarify how you are doing the import, attaching the code and log if possible.

Data never sleeps

ballardw · Posted 07-16-2017 03:41 AM

What options did you use when importing the text?

It appears that you text may be tab delimited text. And that you have tabs beford "Member (Part-time) / Membre (à temps partiel)" and "Director of Clinical Training" which why those appear under the column heading "Year".

It helps to talk about the column labels as appearing in the data NOT a concept like "Income" which does not appear in the data

If you used the wizard to import the data the log should contain data step code used to read the file. Post that along with any messages from from log. Please use a code box opened with the forum {i} menu icon to preserve formatting of the code as the message windows will do some text reformatting.

dunga · Posted 07-16-2017 11:04 AM

/* --------------------------------------------------------------------
   Code generated by a SAS task
   
   Generated on Sunday, July 16, 2017 at 10:37:09 AM
   By task:     Import Data Wizard
   
   Source file: xxx:\post_2017_Jul_15.txt
   Server:      Local File System
   
   Output data: WORK.post_2017_Jul_151
   Server:      
   
   Note: In preparation for running the following code, the Import
   Data wizard has used internal routines to transfer the source data
   file from the local file system to xxxxxxx. There is no SAS code
   available to represent this action.
   -------------------------------------------------------------------- */

DATA WORK.post_2017_Jul_151;
    LENGTH
        Category         $ 14
        SurName          $ 6
        First_Name       $ 7
        salary_1           8
        salary_2           8
        company          $ 25
        Profession       $ 63
        Year             $ 45 ;
    FORMAT
        Category         $CHAR14.
        SurName          $CHAR6.
        First_Name       $CHAR7.
        salary_1         DOLLAR12.2
        salary_2         DOLLAR7.2
        company          $CHAR25.
        Profession       $CHAR63.
        Year             $CHAR45. ;
    INFORMAT
        Category         $CHAR14.
        SurName          $CHAR6.
        First_Name       $CHAR7.
        salary_1         DOLLAR12.
        salary_2         DOLLAR7.
        company          $CHAR25.
        Profession       $CHAR63.
        Year             $CHAR45. ;
    INFILE 'T:\SASWork\_xxxxxxxxxxxx_\#LN00038'
        LRECL=181
        ENCODING="WLATIN1"
        TERMSTR=CRLF
        DLM='7F'x
        MISSOVER
        DSD ;
    INPUT
        Category         : $CHAR14.
        SurName          : $CHAR6.
        First_Name       : $CHAR7.
        salary_1         : ?? DOLLAR12.
        salary_2         : ?? DOLLAR7.
        company          : $CHAR25.
        Profession       : $CHAR63.
        Year             : $CHAR45. ;
RUN;

Hi Jag, LinusH, Ballrdw,

Thanks to each one of you. Please see the log attahced. Also attahced screenshots to show how I have imported.

Hi Jag,

Actually last three observations should not have a year, which is what I want.

I think you have manually updated the txt file to a single space to get the correct answer (as the multiple professions were separated by tab instead you updated to a single space).

However, my original file is having over million records, so I cannot update it to a single space manually.

Could you help me.

art297 · Posted 07-16-2017 12:48 PM

Not sure if you want to cleanup the non-ascii characters in Profession, or what the maximum length of that field is, but the following will let you read all of the fields and let you adjust as necessary:

data want (drop=csalary_1 position length);
  infile '/folders/myfolders/post_2017_Jul_15.txt' dlm='09'x lrecl=32767 firstobs=2;
  input @;
  if not anyalpha(scan(_infile_,-1,'09'x)) then do;
    year=input(scan(_infile_,-1,'09'x),4.);
    CALL SCAN(_infile_, -1, position, length, '09'x);
    _infile_=substr(_infile_,1,position-2);
  end;
  informat
    Category         $14.
    SurName          $20.
    First_Name       $20.
    csalary_1         $14.
    salary_2         DOLLAR7.
    company          $25.
  ;
  length profession $512;
  input Category SurName First_Name csalary_1 salary_2 company@;
  salary_1=input(compress(csalary_1,'"'),dollar12.);
  CALL SCAN(_infile_, 6, position, length, '09'x);
  _infile_=translate(substr(_infile_,position+length+1),'/','09'x);
  profession=_infile_;
  input;
run;

Art, CEO, AnalystFinder.com

dunga · Posted 07-18-2017 08:58 AM

Thanks Art for your code.

It works correctly for the sample dataset I have attahced.

However, I was not able to adjust it to my big dataset (but it is my problem).

Thanks

Mirisa

art297 · Posted 07-18-2017 09:15 AM

In order for anyone to be able to help you would have to post a couple of the lines that didn't import correctly.

Art, CEO, AnalystFinder.com

Register Today!

SAS Training: Just a Click Away