I am trying to read data coming from AP systems in company. The system throws some random rows repeating in the data. I need to delete those rows and then read the remaining data. In the below example, text ID to Country along-with '_____' is repeated multiple times. i need to remove that, and then i can see a pattern of 3 rows. 3 rows in this below data should ideally be 1 row as output in sas dataset. there are certain scenarios where 2nd and 3rd rows may be missing as well. but ID will always be present. I have close 13,000 records with more files as well Can anyone please help?
ID Name Phone extension Address City State Country __________________________________________________________ 11050ATR 1105ABCCORP (999) 999-9999 Ext. 0000
PO Box 99999
Los Angeles CA 9999 USA 11050ATS 1105ABCCORP1 (999) 999-9998 Ext. 0000
PO Box 888
Los Angeles CA 9999 USA 11050ATQ 1105ABCCORPr (999) 999-9999 Ext. 0000
11050ATC 1105ABCCORPq (999) 999-9999 Ext. 0000
PO Box 0000
Los Angeles CA 9999 USA
**ID Name Phone extension
Address
City State Country
___________________________________________________________**
11050ATQ 1105ABCCORPr (999) 999-9999 Ext. 0000
11050ATC 1105ABCCORPq (999) 999-9999 Ext. 0000
PO Box 0000
Los Angeles CA 9999 USA
In a data step, retain the id so that you have a column of IDs in addition to the rows of data you have below. So then you would have two columns: one with the ID (repeated) and one with the data you have below. Once you do this, select the distinct responses in your data below, grouped by the ID. That way, each response will only appear once. Then, delete the rows where the IDs in your ID column match the data in your dataset.
Posting data from a file like this should be done in a code box opened with the forum {i} icon. The main message window will reformat things making it harder to understand in terms of the actual problem.
For instance, it appears as if this bit of text repeats as a header to each record:
ID Name Phone extension Address City State Country __________________________________________________________
Is that a correct understanding of the file layout?
You should provide some way to know what the desired output it should actually look like. For instance you say "where 2nd and 3rd rows may be missing". Which are the rows that may be missing? There are multiple rows of values:
11050ATR 1105ABCCORP (999) 999-9999 Ext. 0000 PO Box 99999 Los Angeles CA 9999 USA 11050ATS 1105ABCCORP1 (999) 999-9998 Ext. 0000 PO Box 888 Los Angeles CA 9999 USA 11050ATQ 1105ABCCORPr (999) 999-9999 Ext. 0000 11050ATC 1105ABCCORPq (999) 999-9999 Ext. 0000 PO Box 0000 Los Angeles CA 9999 USA
WHICH one is going to be there, which are "2nd and 3rd", and if they are included in the output data what would they look like.
Apologies for not being able to format the question properly.
To your questions
1) Yes you are correct this text (rather group of text) appears multiple times but not with every record - it may be a random repeat.
2) Desired out put would be 8 columns with values of ID, Name, Phone, Extension, address, City, State, Country - We may have anything missing but for ID. Id will always be present. So records may look like
101C0 101ABC (000) 000-0000 Ext. 0000 (missing) (missing) 101C0 101ABC (000) 000-0000 Ext. 0000 (missing) (missing) 101C0 101ABC (000) 000-0000 Ext. 0000 2/48 Newways Los Angeles CA 90189 USA
The "desired results" include records with an ID of 101c0 which isn't even shown in the originally posted data.
I'm sure people would be better able to assist in a solution if a more complete presentation of the problem was offered.
I doubt it's random, more likely whatever it's coming from is having a formatted report converted to a txt file and the original file has this is a page footer/header. Check if the 'random' text is the same and repeated throughout the document. If it is, you can probably write some quick code to delete all the occurences.
Have you considered processing the file twice?
Once to remove the repeated text/standardize the format and in the second pass, actually read the data in the file.
You could try filtering out the extra lines with:
data _null_;
infile "&sasforum\datasets\Amar_test.txt" truncover;
file "&sasforum\datasets\Amar_test_out.txt";
input @;
if _infile_ = " " then delete;
if _infile_ =: "____" then delete;
input line1 $200. / line2 $200. / line3 $200.;
if substr(line1,1,2) ne "ID" then put line1 / line2 / line3;
run;
please provide a more elaborate test file if that doesn't work.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.