DATA Step, Macro, Functions and more

Handling incomplete data rows in unstructured data

Reply
Contributor
Posts: 33

Handling incomplete data rows in unstructured data

Hi SAS Forum,
I am having the attached “.txt” file that should be converted to a SAS data set.

It has following 8 variables. And has 5 observations..


Category
SurName 
First_Name 
salary_1 
salary_2 
company 
Profession 
Year

 

Q:

When I imported using "Import Data" functionality in SAS, first 3 records get correctly imported.

Howerver, a part of values for "profession" variable for last 3 records get running into the next variable, which is "Income".

I have attahced an Excel table to show the wrong and correct answers.

 

Could you help me to correctly import the ".txt" file.

Thanks

 

Mirisa

Contributor
Posts: 33

Re: Handling incomplete data rows in unstructured data

Hi Forum again,

This is the txt file needs to be converted to a SAS data set.

Mirisa

Trusted Advisor
Posts: 1,131

Re: Handling incomplete data rows in unstructured data

With the below code i was able to generate the dataset want. The delimiter need to be used is tab like delimiter='09'x

 

the only exemption i observed is that the last three observations do not have a year, the people held multiple professions. So had to slightly update the txt file as the multiple professions were separated by tab instead i updated to a single space. due to which now i get year blank. Hope this is what you are expecting.

 

proc import datafile='~path\post_2017_Jul_15.txt' out=want
dbms=dlm replace;
delimiter='09'x;
getnames=yes;
run;

 

Thanks,
Jag
Super User
Posts: 5,260

Re: Handling incomplete data rows in unstructured data

Since there are tons of ways to import data in SAS, please clarify how you are doing the import, attaching the code and log if possible.
Data never sleeps
Super User
Posts: 10,538

Re: Handling incomplete data rows in unstructured data

What options did you use when importing the text?

It appears that you text may be tab delimited text. And that you have tabs beford "Member (Part-time) / Membre (à temps partiel)" and "Director of Clinical Training" which why those appear under the column heading "Year".

 

It helps to talk about the column labels as appearing in the data NOT a concept like "Income" which does not appear in the data

 

If you used the wizard to import the data the log should contain data step code used to read the file. Post that along with any messages from from log. Please use a code box opened with the forum {i} menu icon to preserve formatting of the code as the message windows will do some text reformatting.

 

 

Contributor
Posts: 33

Re: Handling incomplete data rows in unstructured data

 

/* --------------------------------------------------------------------
   Code generated by a SAS task
   
   Generated on Sunday, July 16, 2017 at 10:37:09 AM
   By task:     Import Data Wizard
   
   Source file: xxx:\post_2017_Jul_15.txt
   Server:      Local File System
   
   Output data: WORK.post_2017_Jul_151
   Server:      
   
   Note: In preparation for running the following code, the Import
   Data wizard has used internal routines to transfer the source data
   file from the local file system to xxxxxxx. There is no SAS code
   available to represent this action.
   -------------------------------------------------------------------- */

DATA WORK.post_2017_Jul_151;
    LENGTH
        Category         $ 14
        SurName          $ 6
        First_Name       $ 7
        salary_1           8
        salary_2           8
        company          $ 25
        Profession       $ 63
        Year             $ 45 ;
    FORMAT
        Category         $CHAR14.
        SurName          $CHAR6.
        First_Name       $CHAR7.
        salary_1         DOLLAR12.2
        salary_2         DOLLAR7.2
        company          $CHAR25.
        Profession       $CHAR63.
        Year             $CHAR45. ;
    INFORMAT
        Category         $CHAR14.
        SurName          $CHAR6.
        First_Name       $CHAR7.
        salary_1         DOLLAR12.
        salary_2         DOLLAR7.
        company          $CHAR25.
        Profession       $CHAR63.
        Year             $CHAR45. ;
    INFILE 'T:\SASWork\_xxxxxxxxxxxx_\#LN00038'
        LRECL=181
        ENCODING="WLATIN1"
        TERMSTR=CRLF
        DLM='7F'x
        MISSOVER
        DSD ;
    INPUT
        Category         : $CHAR14.
        SurName          : $CHAR6.
        First_Name       : $CHAR7.
        salary_1         : ?? DOLLAR12.
        salary_2         : ?? DOLLAR7.
        company          : $CHAR25.
        Profession       : $CHAR63.
        Year             : $CHAR45. ;
RUN;

Hi Jag, LinusH, Ballrdw,

 

 

Thanks to each one of you. Please see the log attahced. Also attahced screenshots to show how I have imported.

 

Hi Jag,

Actually last three observations should not have a year, which is what I want.

I think you have manually updated the txt file to a single space to get the correct answer (as the multiple professions were separated by tab instead you updated to a single space).

 

However, my original file is having over million records, so I cannot update it to a single space manually.

 

Could you help me.

 

 

 

 

 

PROC Star
Posts: 7,364

Re: Handling incomplete data rows in unstructured data

Not sure if you want to cleanup the non-ascii characters in Profession, or what the maximum length of that field is, but the following will let you read all of the fields and let you adjust as necessary:

data want (drop=csalary_1 position length);
  infile '/folders/myfolders/post_2017_Jul_15.txt' dlm='09'x lrecl=32767 firstobs=2;
  input @;
  if not anyalpha(scan(_infile_,-1,'09'x)) then do;
    year=input(scan(_infile_,-1,'09'x),4.);
    CALL SCAN(_infile_, -1, position, length, '09'x);
    _infile_=substr(_infile_,1,position-2);
  end;
  informat
    Category         $14.
    SurName          $20.
    First_Name       $20.
    csalary_1         $14.
    salary_2         DOLLAR7.
    company          $25.
  ;
  length profession $512;
  input Category SurName First_Name csalary_1 salary_2 company@;
  salary_1=input(compress(csalary_1,'"'),dollar12.);
  CALL SCAN(_infile_, 6, position, length, '09'x);
  _infile_=translate(substr(_infile_,position+length+1),'/','09'x);
  profession=_infile_;
  input;
run;

Art, CEO, AnalystFinder.com

 

Contributor
Posts: 33

Re: Handling incomplete data rows in unstructured data

Thanks Art for your code.

It works correctly for the sample dataset I have attahced.

However, I was not able to adjust it to my big dataset (but it is my problem).

Thanks

Mirisa

 

PROC Star
Posts: 7,364

Re: Handling incomplete data rows in unstructured data

In order for anyone to be able to help you would have to post a couple of the lines that didn't import correctly.

 

Art, CEO, AnalystFinder.com

 

Ask a Question
Discussion stats
  • 8 replies
  • 246 views
  • 2 likes
  • 5 in conversation