Re: Read complex text data in SAS

Saurabh_Amar · Posted 04-24-2017 03:19 PM

I am trying to read data coming from AP systems in company. The system throws some random rows repeating in the data. I need to delete those rows and then read the remaining data. In the below example, text ID to Country along-with '_____' is repeated multiple times. i need to remove that, and then i can see a pattern of 3 rows. 3 rows in this below data should ideally be 1 row as output in sas dataset. there are certain scenarios where 2nd and 3rd rows may be missing as well. but ID will always be present. I have close 13,000 records with more files as well Can anyone please help?

ID Name Phone extension Address City State Country __________________________________________________________ 11050ATR 1105ABCCORP (999) 999-9999 Ext. 0000
PO Box 99999
Los Angeles CA 9999 USA 11050ATS 1105ABCCORP1 (999) 999-9998 Ext. 0000
PO Box 888
Los Angeles CA 9999 USA 11050ATQ 1105ABCCORPr (999) 999-9999 Ext. 0000

11050ATC 1105ABCCORPq (999) 999-9999 Ext. 0000
PO Box 0000
Los Angeles CA 9999 USA

**ID Name Phone extension

Address

City State Country

___________________________________________________________**

11050ATQ 1105ABCCORPr (999) 999-9999 Ext. 0000

11050ATC 1105ABCCORPq (999) 999-9999 Ext. 0000
PO Box 0000
Los Angeles CA 9999 USA

thomp7050 · Posted 04-24-2017 03:27 PM

In a data step, retain the id so that you have a column of IDs in addition to the rows of data you have below. So then you would have two columns: one with the ID (repeated) and one with the data you have below. Once you do this, select the distinct responses in your data below, grouped by the ID. That way, each response will only appear once. Then, delete the rows where the IDs in your ID column match the data in your dataset.

ballardw · Posted 04-24-2017 03:31 PM

Posting data from a file like this should be done in a code box opened with the forum {i} icon. The main message window will reformat things making it harder to understand in terms of the actual problem.

For instance, it appears as if this bit of text repeats as a header to each record:

ID              Name             Phone          extension
Address
City            State            Country
__________________________________________________________

Is that a correct understanding of the file layout?

You should provide some way to know what the desired output it should actually look like. For instance you say "where 2nd and 3rd rows may be missing". Which are the rows that may be missing? There are multiple rows of values:

11050ATR	1105ABCCORP	(999) 999-9999  Ext. 0000		
PO Box 99999					
Los Angeles	CA	9999	USA
11050ATS	1105ABCCORP1	(999) 999-9998  Ext. 0000		
PO Box 888					
Los Angeles	CA	9999	USA
11050ATQ	1105ABCCORPr	(999) 999-9999  Ext. 0000		


11050ATC	1105ABCCORPq	(999) 999-9999  Ext. 0000		
PO Box 0000					
Los Angeles	CA	9999	USA

WHICH one is going to be there, which are "2nd and 3rd", and if they are included in the output data what would they look like.

Saurabh_Amar · Posted 04-24-2017 03:42 PM

Apologies for not being able to format the question properly.

To your questions

1) Yes you are correct this text (rather group of text) appears multiple times but not with every record - it may be a random repeat.

2) Desired out put would be 8 columns with values of ID, Name, Phone, Extension, address, City, State, Country - We may have anything missing but for ID. Id will always be present. So records may look like

101C0	101ABC	(000) 000-0000  Ext. 0000
(missing)
(missing)

		
101C0	101ABC	(000) 000-0000  Ext. 0000
(missing)
(missing)

101C0	101ABC	(000) 000-0000  Ext. 0000
2/48 Newways 
Los Angeles	CA	90189	USA

HB · Posted 04-24-2017 04:54 PM

The "desired results" include records with an ID of 101c0 which isn't even shown in the originally posted data.

I'm sure people would be better able to assist in a solution if a more complete presentation of the problem was offered.

Reeza · Posted 04-24-2017 09:50 PM

I doubt it's random, more likely whatever it's coming from is having a formatted report converted to a txt file and the original file has this is a page footer/header. Check if the 'random' text is the same and repeated throughout the document. If it is, you can probably write some quick code to delete all the occurences.

Saurabh_Amar · Posted 04-24-2017 11:17 PM

Yes the text is always the same but occurrence is random. No fixed pattern there

Reeza · Posted 04-24-2017 11:38 PM

Have you considered processing the file twice?

Once to remove the repeated text/standardize the format and in the second pass, actually read the data in the file.

PGStats · Posted 04-24-2017 11:39 PM

You could try filtering out the extra lines with:

data _null_;
infile "&sasforum\datasets\Amar_test.txt" truncover;
file "&sasforum\datasets\Amar_test_out.txt";
input @;
if _infile_ = " " then delete;
if _infile_ =: "____" then delete;
input line1 $200. / line2 $200. / line3 $200.;
if substr(line1,1,2) ne "ID" then put line1 / line2 / line3;
run;

please provide a more elaborate test file if that doesn't work.

PG

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away