Solved: Re: Extract information between strings

SGrisham · Posted 06-16-2017 11:41 AM

We have a very old system that tracks notes entered by several different systems. These notes are in one way or another surrounded by **'s. All of the notes are in a single column and string.

EXAMPLE:

| ID_NUMBER | NOTE

| 12345678 | **system message** **system update** **mm/dd/yy - blablablablabla**

| 12345679 | **system message** **mm/dd/yy - blablabla ** **mm/dd/yy - wahwahwahwah**

Each of these NOTES have a UNIQUE 8 digit number.

I am trying to find a DATA STEP that will create a new table with EACH of the notes seperated into their own row.

something like:

| ID_NUMBER | NOTE

| 12345678 | system message

| 12345678 | system update

| 12345678 | mm/dd/yyyy - blablablablabla

| 12345679 | system message

| 12345679 | mm/dd/yyyy - blablablabla

| 12345679 | mm/dd/yyyy - wahwahwahwah

Thanks,

art297 · Posted 06-16-2017 03:29 PM

Without seeing an example of the actual file we can only guess at possible solutions.

If your starting file looks like the one below, then the code I suggest below might work:

libname library '/folders/myfolders';
data library.sg_notes;
  informat notes $100.;
  input case_no notes &;
  notes=dequote(notes);
  cards;
12345678    "**system message** **system update** **mm/dd/yy - blablablablabla**"
12345679    "**system message** **mm/dd/yy - blablabla ** **mm/dd/yy - wahwahwahwah**"
;

data LIBRARY.SG_NOTES_TEST (KEEP=CASE_NO NOTE);
  SET LIBRARY.SG_NOTES;
  NOTES=transtrn(NOTES,'**','"');
  do i=0 to 7;
    note=scan(notes,(1+(i*2)),'"');
    if not missing(note) then output;
  end;
run;

Art, CEO, AnalystFinder.com

View solution in original post

art297 · Posted 06-16-2017 12:12 PM

Here is one way:

data want (keep=id_number note);
  infile cards firstobs=2;
  informat note note1-note3 $40.;
  array notes(*) $ note1-note3;
  input @;
  _infile_=compress(transtrn(_infile_,'**','"'),'|');
  input id_number (note1-note3) (&);
  do i=1 to 3;
    note=dequote(notes(i));
    output;
  end;
  cards;
|  ID_NUMBER  |  NOTE 
|   12345678      |  **system message**    **system update**     **mm/dd/yy - blablablablabla**
|   12345679      |  **system message**     **mm/dd/yy - blablabla **   **mm/dd/yy - wahwahwahwah**
;

Art, CEO, AnalystFinder.com

WoolieKyat · Posted 06-16-2017 12:37 PM

Here is a next way

data have;

ID_NUMBER= 12345678;

NOTE ='**system message** **system update** **mm/dd/yy - blablablablabla**';

output;

ID_NUMBER= 12345679;

NOTE ='**system message** **mm/dd/yy - blablabla ** **mm/dd/yy - wahwahwahwah';

output;

run;

data want;

set have;

ExpressionID = prxparse('/[A-Z0-9._%+\/]+(\s+|\s+\-\s+)[A-Z0-9._%+\/]+/i');

start = 1;

stop = length(NOTE);

/* Use PRXNEXT to find the first instance of the pattern, */

/* then use DO WHILE to find all further instances. */

/* PRXNEXT changes the start parameter so that searching */

/* begins again after the last match. */

call prxnext(ExpressionID, start, stop, NOTE, position, length);

do while (position > 0);

found = substr(NOTE, position, length);

put found= position= length=;

call prxnext(ExpressionID, start, stop, NOTE, position, length);

output;

end;

drop ExpressionID start stop position length;

run;

Astounding · Posted 06-16-2017 01:22 PM

If we can assume that this version of the HAVE data set exists, here's another way to break out the NOTE values:

data want;

set have;

all_notes = note;

drop all_notes;

do _n_=1 to 99 by 2 until (note=' ');

note = scan(all_notes, _n_, '*');

if note > ' ' then output;

end;

run;

It does run into trouble if one of the notes actually contains an asterisk.

SGrisham · Posted 06-16-2017 02:25 PM

@art297 Sorry, I'm still learning SAS data steps.

This solution appears to do exactly what I need. Instead of using the "cards;" as my input, how would I go about calling my DATA that I have in a SAS Library to this solution? I attempted to do a SET library.table; and replace the relevant field names, but I get an error in regards to INFILE.

art297 · Posted 06-16-2017 02:38 PM

You have to declare the location where the SAS dataset resides. You can do that with a libname statement.

Can't answer your other questions until you let us know what that file looks like. i.e., how many variables does the file have and what are the variable names?

Art, CEO, AnalystFinder.com

SGrisham · Posted 06-16-2017 02:51 PM

Sure, it contains a column with a unique Case Number, and each call number has a notes column which contains system notes surrounded by two asterisks on each side. Each call number is unique, and each Note string can contain as many as 8 different notes within it.

There are around 500000 rows, but only about 10,000 new records will need to be processed at a time after the initial run.

So 1 table, two columns (case_number, note).

Case_number is a unique number, and a Note is several hundred characters with two astericks on either side of a string containing a individual note. There could be as many as 8 notes in a row, so I expanded your note1-note3 logic to 8. Here's what I have right now, but it is obviously throwing errors.

data LIBRARY.SG_NOTES_TEST (KEEP=CASE_NO NOTE);
SET LIBRARY.SG_NOTES;
  informat NOTE note1-note8 $225.;
  array notes(*) $ note1-note8;
  NOTE=compress(transtrn(NOTE,'**','"'));
  input CASE_NO (note1-note8) (&);
  do i=1 to 8;
    NOTE=dequote(notes(i));
    output;
  end;

Thanks again!

art297 · Posted 06-16-2017 03:29 PM

Without seeing an example of the actual file we can only guess at possible solutions.

If your starting file looks like the one below, then the code I suggest below might work:

libname library '/folders/myfolders';
data library.sg_notes;
  informat notes $100.;
  input case_no notes &;
  notes=dequote(notes);
  cards;
12345678    "**system message** **system update** **mm/dd/yy - blablablablabla**"
12345679    "**system message** **mm/dd/yy - blablabla ** **mm/dd/yy - wahwahwahwah**"
;

data LIBRARY.SG_NOTES_TEST (KEEP=CASE_NO NOTE);
  SET LIBRARY.SG_NOTES;
  NOTES=transtrn(NOTES,'**','"');
  do i=0 to 7;
    note=scan(notes,(1+(i*2)),'"');
    if not missing(note) then output;
  end;
run;

Art, CEO, AnalystFinder.com

SGrisham · Posted 06-16-2017 04:05 PM

Great, this appears to do exactly what I want, and VERY fast! thank you!

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away