BookmarkSubscribeRSS Feed
buddha_d
Pyrite | Level 9

Dear Community,

         I have a messy real time data and that almost looks like this data. I am trying to delete rows where caller (name) is distributed among 4 fields on the same row (caller_info1 - caller_info4). If this caller name is there in between concatenated 4 fields then I want to delete those rows.

   (eg: TULSA REAL ESTATES HOLDING COMPANY LLC  is distributed with additional information in caller_info1 - caller_info4 fields).

I am proving a program and I listed the output as well .

data customer;
INFILE CARDS dsd DeLiMiter= '  ' missover ;
input caller & $38. caller_info1 & $18. caller_info2 & $10. caller_info3 & $10. caller_info4 & $20. ; 
cards;
TULSA REAL ESTATES HOLDING COMPANY LLC  FOR:TULSA REAL EST  ATES HOLDI  NG COMPANY  LLC
Siva T Brown  For further credit  SIVA BROWN  1110  1234
*Christopher F Lindner 2010 special Tr  RE:Christopher F L  indner 201  0 special   Trust dtd 11-4-02
SILVER LINING PARTNERS LLC  FOR FINAL CREDIT  FOR SILVER   LINING   PARTNERS LLC
RONNIE B MILLER  MEMO: RON MILLER 
Warren K Buffet  Reference #1234
;
run;

/*
I want to see the output window which contains only this row

Warren K Buffet  Reference #1234
*/

 I used compress function with 'ka' option and other stuff. But, my attempts were futile. Any solutions or thoughts to solve this issue please. 

Thanks 

5 REPLIES 5
Shmuel
Garnet | Level 18

Use option firstobs=n to skip n lines as in:

data customer;
 set customer(firstobs=6);
run;

or if you want only this line use where to select required lines:

data customer;
 set customer;
    where index(name,"Warren") > 0; 
run;

 

buddha_d
Pyrite | Level 9

Shmuel, I am not just trying to go to line 6 and get the result. I have more data and I am trying to filter it out by matching word pattern. Thanks for your response. 

SASKiwi
PROC Star

I don't think this is easy. However, first off I would read each data line into one variable so you have a single text string. By reading a line with multiple variables you are making assumptions about where word boundaries are, so it is best not to create a problem like this.

 

Then I would try to write a rule in plain English like: find the first word, delimited by blanks, in the string and then search for the same pattern of characters in the rest of the string. This rule works for data lines 1,2,4 and 6, if you ignore case, but not 3 (*Christopher doesn't match with Christopher) and 5 (RONNIE doesn't match with RON).

 

If you add - * - as a word delimiter to the rule then that will fix line 3. To match line 5 you would have to compare the third word in the string (MILLER) and see if the same character pattern exists in the rest of the string.

 

I'm guessing that there are many examples in your data like this that you haven't posted that will cause even more problems trying to match. If that is the case then it is going to be extremely difficult to cater for all possible scenarios and you may have to live with a solution that works less than 100 percent of the time.

buddha_d
Pyrite | Level 9

SASKiwi, Thanks for the reply. Even if it works for 1,2,4 and 6 that would help me a lot. Like I said, I am trying to delete most of the matching data. I read your comments, but didn't get what you are trying to convey ("I don't think this is easy. However, first off I would read each data line into one variable so you have a single text string. By reading a line with multiple variables you are making assumptions about where word boundaries are, so it is best not to create a problem like this.

Then I would try to write a rule in plain English like: find the first word, delimited by blanks, in the string and then search for the same pattern of characters in the rest of the string"). Could you help me here. 

Thanks !!

ballardw
Super User

If possible I might go back to the source of this and ask if it can be prepared in a more consistent form such as a CSV file.

Or if there is a document that describes what the content of this file should look like such as start/end columns for values, lengths of variables or something.

 

IF there is a specific value (key word such as "Reference #") that occurs in the data you might be able to search for that such as

data customer;
INFILE CARDS dsd DeLiMiter= '  ' missover ;
input string $ 1-100; 
if index(string,"Reference #");
cards;
TULSA REAL ESTATES HOLDING COMPANY LLC  FOR:TULSA REAL EST  ATES HOLDI  NG COMPANY  LLC
Siva T Brown  For further credit  SIVA BROWN  1110  1234
*Christopher F Lindner 2010 special Tr  RE:Christopher F L  indner 201  0 special   Trust dtd 11-4-02
SILVER LINING PARTNERS LLC  FOR FINAL CREDIT  FOR SILVER   LINING   PARTNERS LLC
RONNIE B MILLER  MEMO: RON MILLER 
Warren K Buffet  Reference #1234
;
run;

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 587 views
  • 3 likes
  • 4 in conversation