BookmarkSubscribeRSS Feed
Saurabh_Amar
Calcite | Level 5

I am trying to read data coming from AP systems in company. The system throws some random rows repeating in the data. I need to delete those rows and then read the remaining data. In the below example, text ID to Country along-with '_____' is repeated multiple times. i need to remove that, and then i can see a pattern of 3 rows. 3 rows in this below data should ideally be 1 row as output in sas dataset. there are certain scenarios where 2nd and 3rd rows may be missing as well. but ID will always be present. I have close 13,000 records with more files as well Can anyone please help?

 

ID Name Phone extension Address City State Country __________________________________________________________ 11050ATR 1105ABCCORP (999) 999-9999 Ext. 0000
PO Box 99999
Los Angeles CA 9999 USA 11050ATS 1105ABCCORP1 (999) 999-9998 Ext. 0000
PO Box 888
Los Angeles CA 9999 USA 11050ATQ 1105ABCCORPr (999) 999-9999 Ext. 0000

11050ATC 1105ABCCORPq (999) 999-9999 Ext. 0000
PO Box 0000
Los Angeles CA 9999 USA

**ID Name Phone extension

Address

City State Country

___________________________________________________________**

11050ATQ 1105ABCCORPr (999) 999-9999 Ext. 0000

11050ATC 1105ABCCORPq (999) 999-9999 Ext. 0000
PO Box 0000
Los Angeles CA 9999 USA

8 REPLIES 8
thomp7050
Pyrite | Level 9

In a data step, retain the id so that you have a column of IDs in addition to the rows of data you have below.  So then you would have two columns: one with the ID (repeated) and one with the data you have below.  Once you do this, select the distinct responses in your data below, grouped by the ID.  That way, each response will only appear once.  Then, delete the rows where the IDs in your ID column match the data in your dataset.

 

 

ballardw
Super User

Posting data from a file like this should be done in a code box opened with the forum {i} icon. The main message window will reformat things making it harder to understand in terms of the actual problem.

For instance, it appears as if this bit of text repeats as a header to each record:

ID              Name             Phone          extension
Address
City            State            Country
__________________________________________________________

Is that a correct understanding of the file layout?

 

 

You should provide some way to know what the desired output it should actually look like. For instance you say "where 2nd and 3rd rows may be missing". Which are the rows that may be missing? There are multiple rows of values:

11050ATR	1105ABCCORP	(999) 999-9999  Ext. 0000		
PO Box 99999					
Los Angeles	CA	9999	USA
11050ATS	1105ABCCORP1	(999) 999-9998  Ext. 0000		
PO Box 888					
Los Angeles	CA	9999	USA
11050ATQ	1105ABCCORPr	(999) 999-9999  Ext. 0000		


11050ATC	1105ABCCORPq	(999) 999-9999  Ext. 0000		
PO Box 0000					
Los Angeles	CA	9999	USA

WHICH one is going to be there, which are "2nd and 3rd", and if they are included in the output data what would they look like.

 

Saurabh_Amar
Calcite | Level 5

Apologies for not being able to format the question properly.

 

To your questions

 

1) Yes you are correct this text (rather group of text) appears multiple times but not with every record - it may be a random repeat.

 

2) Desired out put would be 8 columns with values of ID, Name, Phone, Extension, address, City, State, Country - We may have anything missing but for ID. Id will always be present. So records may look like

 

101C0	101ABC	(000) 000-0000  Ext. 0000
(missing)
(missing)

		
101C0	101ABC	(000) 000-0000  Ext. 0000
(missing)
(missing)

101C0	101ABC	(000) 000-0000  Ext. 0000
2/48 Newways 
Los Angeles	CA	90189	USA		
					

 

 

HB
Barite | Level 11 HB
Barite | Level 11

The "desired results" include records with an ID of 101c0 which isn't even shown in the originally posted data.

 

I'm sure people would be better able to assist in a solution if a more complete presentation of the problem was offered.

 

Reeza
Super User

I doubt it's random, more likely whatever it's coming from is having a formatted report converted to a txt file and the original file has this is a page footer/header. Check if the 'random' text is the same and repeated throughout the document. If it is, you can probably write some quick code to delete all the occurences. 

 

 

 

 

Saurabh_Amar
Calcite | Level 5
Yes the text is always the same but occurrence is random. No fixed pattern there
Reeza
Super User

Have you considered processing the file twice? 

Once to remove the repeated text/standardize the format and in the second pass, actually read the data in the file. 

 

PGStats
Opal | Level 21

You could try filtering out the extra lines with:

 

data _null_;
infile "&sasforum\datasets\Amar_test.txt" truncover;
file "&sasforum\datasets\Amar_test_out.txt";
input @;
if _infile_ = " " then delete;
if _infile_ =: "____" then delete;
input line1 $200. / line2 $200. / line3 $200.;
if substr(line1,1,2) ne "ID" then put line1 / line2 / line3;
run; 

please provide a more elaborate test file if that doesn't work.

PG

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1164 views
  • 0 likes
  • 6 in conversation