DATA Step, Macro, Functions and more

Isolating text in a variable field that contains both characters and numeric values?

Reply
New Contributor
Posts: 2

Isolating text in a variable field that contains both characters and numeric values?

If someone can help with this problem, you will make life!

I have a variable ("paragraph") that has random output data with both characters and numeric value. For example, one observation looks like this:

CUSIP NO. 90130N 10 3 - --------------------- - -------------------------------------------------------------------------------- (1) Name of Reporting PersonS.S. or I.R.S. Identification No. of Above Person American International Group, Inc. (I.R.S. Identification No. 13-2592361) - --------------------------------------------------------------------------------

I've tried parse macros and explode macros but the data seems to be too messy for both of these. The only information I need is the underlined part (American International Group, Inc.), but I have no idea how to tell SAS to do this. Furthermore, there's no standardization between observations and the information I need from each observation changes. Said differently, I need the name of each reporting person, which changes with each observation.

Any help would be very much appreciated!

Thanks so much!

Super Contributor
Posts: 275

Re: Isolating text in a variable field that contains both characters and numeric values?

data have;

infile cards truncover;

input;

name=prxchange('s/.* Person (.*) \(.*/$1/io',-1,_infile_);

cards;

CUSIP NO. 90130N 10 3 - -------------- (1) Name of Reporting PersonS.S. or I.R.S. Identification No. of Above Person American International Group, Inc. (I.R.S. Identification No. 13-2592361) - ---

run;

Respected Advisor
Posts: 3,894

Re: Isolating text in a variable field that contains both characters and numeric values?

Using Regular Expressions feels like the way to go. In order to come up with a RegEx realistic for your data could you please provide some more sample data (as heterogeneous as possible).

Super User
Super User
Posts: 7,403

Re: Isolating text in a variable field that contains both characters and numeric values?

Irrespective of solution, e.g. RegEx, or index+substr or something else, you would need to have some indicator of where the data you want starts and ends.  If it is the word person and (, then its pretty straightforward,

substr(text,index(text,"Person")+1,length(text)-index(text,"(I.R.S"));

Respected Advisor
Posts: 3,894

Re: Isolating text in a variable field that contains both characters and numeric values?

Yep - that's why I'm asking for more sample data so that we can get an idea if there is a pattern at all which allows us to identify the wanted sub-string.

New Contributor
Posts: 2

Re: Isolating text in a variable field that contains both characters and numeric values?

Thank you for your help with this request! Some additional examples are below. I've underlined the part I need. It seems like the text I need either follows "(entities only)" or is sandwiched between "name of reporting persons." and "I.R.S. Identification."

1 NAMES OF REPORTING PERSONS CLARUS CAPITAL GROUP MANAGEMENT LPI.R.S. IDENTIFICATION NO. OF ABOVE PERSON (ENTITIES ONLY)20-8098367

SCHEDULE 13D CUSIP No. 068306109 1) NAMES OF REPORTING PERSONS I.R.S. IDENTIFICATION NOS. OF ABOVE PERSONS (ENTITIES ONLY) Bernard C. Sherman 2) CHECK THE APPROPRIATE BOX IF A MEMBER OF A GROUP (SEE INSTRUCTIONS)

1. Names of Reporting Persons. I.R.S. Identification Nos. of above persons (entities only) Textron Inc.

1 NAMES OF REPORTING PERSONS Lonnie J. Stout II


1. Names of Reporting Persons. P STYLEmargin-top0pxmargin-bottom0pxI.R.S. Identification Nos. of above persons (entities only) P STYLEmargin-top0pxmargin-bottom1pxWal-Mart Stores, Inc.

Super User
Posts: 9,681

Re: Isolating text in a variable field that contains both characters and numeric values?

data have;

length a $ 200;

a='1 NAMES OF REPORTING PERSONS CLARUS CAPITAL GROUP MANAGEMENT LPI.R.S. IDENTIFICATION NO. OF ABOVE PERSON (ENTITIES ONLY)20-8098367 ';output;

a='SCHEDULE 13D CUSIP No. 068306109 1) NAMES OF REPORTING PERSONS I.R.S. IDENTIFICATION NOS. OF ABOVE PERSONS (ENTITIES ONLY) Bernard C. Sherman 2) CHECK THE APPROPRIATE BOX IF A MEMBER OF A GROUP (SEE INSTRUCTIONS)';output;

run;

data want;

set have;

re = prxparse('/NAMES OF REPORTING PERSONS(.+)I\.R\.S\./io');

if prxmatch(re, a) then first = prxposn(re, 1, a);

if missing(first) then do;

re = prxparse('/ENTITIES ONLY\)(\D+)/io');

if prxmatch(re, a) then first = prxposn(re, 1, a);

end;

run;

Xia Keshan

Ask a Question
Discussion stats
  • 6 replies
  • 305 views
  • 0 likes
  • 5 in conversation