DATA Step, Macro, Functions and more

Extract information between strings

Accepted Solution Solved
Reply
New Contributor
Posts: 4
Accepted Solution

Extract information between strings

We have a very old system that tracks notes entered by several different systems. These notes are in one way or another surrounded by **'s. All of the notes are in a single column and string.

 

EXAMPLE:

|  ID_NUMBER  |  NOTE 

|   12345678      |  **system message**    **system update**     **mm/dd/yy - blablablablabla**

|   12345679      |  **system message**     **mm/dd/yy - blablabla **   **mm/dd/yy - wahwahwahwah**

 

Each of these NOTES have a UNIQUE 8 digit number.

 

I am trying to find a DATA STEP that will create a new table with EACH of the notes seperated into their own row.

 

something like:

 

|  ID_NUMBER   |   NOTE

|   12345678       |  system message

|   12345678       |  system update

|   12345678       |  mm/dd/yyyy - blablablablabla

|   12345679       |  system message

|   12345679       | mm/dd/yyyy - blablablabla

|   12345679       |  mm/dd/yyyy -  wahwahwahwah

 

 

Thanks,

 


Accepted Solutions
Solution
‎06-16-2017 04:04 PM
PROC Star
Posts: 7,363

Re: Extract information between strings

Without seeing an example of the actual file we can only guess at possible solutions.

 

If your starting file looks like the one below, then the code I suggest below might work:

libname library '/folders/myfolders';
data library.sg_notes;
  informat notes $100.;
  input case_no notes &;
  notes=dequote(notes);
  cards;
12345678    "**system message** **system update** **mm/dd/yy - blablablablabla**"
12345679    "**system message** **mm/dd/yy - blablabla ** **mm/dd/yy - wahwahwahwah**"
;

data LIBRARY.SG_NOTES_TEST (KEEP=CASE_NO NOTE);
  SET LIBRARY.SG_NOTES;
  NOTES=transtrn(NOTES,'**','"');
  do i=0 to 7;
    note=scan(notes,(1+(i*2)),'"');
    if not missing(note) then output;
  end;
run;

Art, CEO, AnalystFinder.com

 

View solution in original post


All Replies
PROC Star
Posts: 7,363

Re: Extract information between strings

Here is one way:

data want (keep=id_number note);
  infile cards firstobs=2;
  informat note note1-note3 $40.;
  array notes(*) $ note1-note3;
  input @;
  _infile_=compress(transtrn(_infile_,'**','"'),'|');
  input id_number (note1-note3) (&);
  do i=1 to 3;
    note=dequote(notes(i));
    output;
  end;
  cards;
|  ID_NUMBER  |  NOTE 
|   12345678      |  **system message**    **system update**     **mm/dd/yy - blablablablabla**
|   12345679      |  **system message**     **mm/dd/yy - blablabla **   **mm/dd/yy - wahwahwahwah**
;

Art, CEO, AnalystFinder.com

 

SAS Employee
Posts: 7

Re: Extract information between strings

[ Edited ]

Here is a next way

data have;

ID_NUMBER= 12345678;

NOTE ='**system message** **system update** **mm/dd/yy - blablablablabla**';

output;

ID_NUMBER= 12345679;

NOTE ='**system message** **mm/dd/yy - blablabla ** **mm/dd/yy - wahwahwahwah';

output;

run;

 

data want;

set have;

 

ExpressionID = prxparse('/[A-Z0-9._%+\/]+(\s+|\s+\-\s+)[A-Z0-9._%+\/]+/i');

start = 1;

stop = length(NOTE);

/* Use PRXNEXT to find the first instance of the pattern, */

/* then use DO WHILE to find all further instances. */

/* PRXNEXT changes the start parameter so that searching */

/* begins again after the last match. */

call prxnext(ExpressionID, start, stop, NOTE, position, length);

do while (position > 0);

found = substr(NOTE, position, length);

put found= position= length=;

call prxnext(ExpressionID, start, stop, NOTE, position, length);

output;

end;

drop ExpressionID start stop position length;

run;

Super User
Posts: 5,083

Re: Extract information between strings

If we can assume that this version of the HAVE data set exists, here's another way to break out the NOTE values:

 

data want;

set have;

all_notes = note;

drop all_notes;

do _n_=1 to 99 by 2 until (note=' ');

   note = scan(all_notes, _n_, '*');

   if note > ' ' then output;

end;

run;

 

It does run into trouble if one of the notes actually contains an asterisk.

New Contributor
Posts: 4

Re: Extract information between strings

[ Edited ]

@art297 Sorry, I'm still learning SAS data steps.

This solution appears to do exactly what I need. Instead of using the "cards;" as my input, how would I go about calling my DATA that I have in a SAS Library to this solution? I attempted to do a SET library.table; and replace the relevant field names, but I get an error in regards to INFILE.

PROC Star
Posts: 7,363

Re: Extract information between strings

You have to declare the location where the SAS dataset resides. You can do that with a libname statement.

 

Can't answer your other questions until you let us know what that file looks like. i.e., how many variables does the file have and what are the variable names?

 

Art, CEO, AnalystFinder.com

New Contributor
Posts: 4

Re: Extract information between strings

[ Edited ]

Sure, it contains a column with a unique Case Number, and each call number has a notes column which contains system notes surrounded by two asterisks on each side. Each call number is unique, and each Note string can contain as many as 8 different notes within it.

 

There are around 500000 rows, but only about 10,000 new records will need to be processed at a time after the initial run.

 

So 1 table, two columns (case_number, note).

Case_number is a unique number, and a Note is several hundred characters with two astericks on either side of a string containing a individual note. There could be as many as 8 notes in a row, so I expanded your note1-note3 logic to 8. Here's what I have right now, but it is obviously throwing errors.

 

data LIBRARY.SG_NOTES_TEST (KEEP=CASE_NO NOTE);
SET LIBRARY.SG_NOTES;
  informat NOTE note1-note8 $225.;
  array notes(*) $ note1-note8;
  NOTE=compress(transtrn(NOTE,'**','"'));
  input CASE_NO (note1-note8) (&);
  do i=1 to 8;
    NOTE=dequote(notes(i));
    output;
  end;

Thanks again!

 

Solution
‎06-16-2017 04:04 PM
PROC Star
Posts: 7,363

Re: Extract information between strings

Without seeing an example of the actual file we can only guess at possible solutions.

 

If your starting file looks like the one below, then the code I suggest below might work:

libname library '/folders/myfolders';
data library.sg_notes;
  informat notes $100.;
  input case_no notes &;
  notes=dequote(notes);
  cards;
12345678    "**system message** **system update** **mm/dd/yy - blablablablabla**"
12345679    "**system message** **mm/dd/yy - blablabla ** **mm/dd/yy - wahwahwahwah**"
;

data LIBRARY.SG_NOTES_TEST (KEEP=CASE_NO NOTE);
  SET LIBRARY.SG_NOTES;
  NOTES=transtrn(NOTES,'**','"');
  do i=0 to 7;
    note=scan(notes,(1+(i*2)),'"');
    if not missing(note) then output;
  end;
run;

Art, CEO, AnalystFinder.com

 

New Contributor
Posts: 4

Re: Extract information between strings

Great, this appears to do exactly what I want, and VERY fast! thank you!
☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 183 views
  • 8 likes
  • 4 in conversation