Solved: Re: Extract words between strings/characters

monali · Posted 08-06-2021 06:11 PM

I am trying to get the given name and family name from the following string:

[{"name": [{"given": ["TESTG "], "family": "TESTF"}], "identifier": [{"value": "TESTNO", "system": "TESTSTRING"}]}]

The following code works for the above string:

First_Name = compress(strip(scan(authors,5,':[]')),'"');
Last_Name = compress(strip(scan(authors,7,':[]}')),'"');

However, the characters are not consistent, so in some cases I have:

[{"name": [{"given": ["TESTG1 "], "given": ["TESTG2 "], "family": "TESTF1", "family": "TESTF2"}], "identifier": [{"value": "TESTNO", "system": "TESTSTRING"}]}]

Is there any way I can extract the first and last names for the first person and then the first and last names for the second person?

Thanks

ChrisNZ · Posted 08-07-2021 02:30 AM

My take:

data HAVE;
A='[{"name": [{"given": ["TESTG1 "], "given": ["TESTG2 "], "family": "TESTF1", "family": "TESTF2"}], "identifier": [{"value": "TESTNO", "system": "TESTSTRING"}]}]';
output;
A='[{"name": [{"given": ["TESTG "], "family": "TESTF"}], "identifier": [{"value": "TESTNO", "system": "TESTSTRING"}]}]';
output;
run;

data WANT;
  set HAVE;  
  B=compress(A,'"{}[] ');
  POS1=1;
  POS2=1;
  do until(POS1=0);
    POS1=find(B,'given:',POS1);
    if POS1 then do;
      FIRST=scan(substr(B,POS1),2,':,'); 
      POS2 =find(B,'family:',POS2);
      LAST =scan(substr(B,POS2),2,':,'); 
      output;
      POS1+1;
      POS2+1;
    end;
  end;
  keep FIRST LAST;
run;

1

TESTG1

TESTF1

2

TESTG2

TESTF2

3

TESTG

TESTF

High-Performance SAS Coding - Third Edition

View solution in original post

Reeza · Posted 08-06-2021 06:27 PM

Have you tried parsing your file with the XML or JSON libraries instead of manually?

monali · Posted 08-06-2021 06:56 PM

Nope, this is already parsed and this is how I see it in my table.

jimbarbour · Posted 08-06-2021 08:03 PM

I would key off of "given" and "family". See code, below. In the second type, there are two given names and two family names. I'm just giving you the first, but you could modify the code if you need both.

I went step by step. It's late in the day here, and I'm a bit tired. Brain not working well. You could nest the functions and make a much more compact program. No doubt @ChrisNZ or someone who knows regular expressions far better than I do will come along and do it all in 2 lines or less. 🙂

%LET	Author1	=	%BQUOTE([{"name": [{"given": ["TESTG "], "family": "TESTF"}], "identifier": [{"value": "TESTNO", "system": "TESTSTRING"}]}]);
%LET	Author2	=	%BQUOTE([{"name": [{"given": ["TESTG1 "], "given": ["TESTG2 "], "family": "TESTF1", "family": "TESTF2"}], "identifier": [{"value": "TESTNO", "system": "TESTSTRING"}]}]);

DATA	Want;
	DROP	_:;

	_Authors="&Author1";
	_Temp		=	COMPRESS(_Authors,'[]{} :,');
	_Start		=	INDEX(LOWCASE(STRIP(_Temp)), 'given');
	_Start		=	_Start	+	7;
	_Temp2		=	SUBSTR(_Temp, _Start);
	_Length		=	INDEX(_Temp2, '"') - 1;
	First_Name	=	SUBSTR(STRIP(_Temp), _Start, _Length);
	_Start		=	INDEX(LOWCASE(STRIP(_Temp)), 'family');
	_Start		=	_Start	+	8;
	_Temp2		=	SUBSTR(_Temp, _Start);
	_Length		=	INDEX(_Temp2, '"') - 1;
	Last_Name	=	SUBSTR(STRIP(_Temp), _Start, _Length);
	PUTLOG	"NOTE:  1. "  First_Name=  Last_Name=;
	OUTPUT;

	_Authors="&Author2";
	_Temp		=	COMPRESS(_Authors,'[]{} :,');
	_Start		=	INDEX(LOWCASE(STRIP(_Temp)), 'given');
	_Start		=	_Start	+	7;
	_Temp2		=	SUBSTR(_Temp, _Start);
	_Length		=	INDEX(_Temp2, '"') - 1;
	First_Name	=	SUBSTR(STRIP(_Temp), _Start, _Length);
	_Start		=	INDEX(LOWCASE(STRIP(_Temp)), 'family');
	_Start		=	_Start	+	8;
	_Temp2		=	SUBSTR(_Temp, _Start);
	_Length		=	INDEX(_Temp2, '"') - 1;
	Last_Name	=	SUBSTR(STRIP(_Temp), _Start, _Length);
	PUTLOG	"NOTE:  2. "  First_Name=  Last_Name=;
	PUTLOG	"NOTE-  ";
	OUTPUT;
RUN;

Results:

Jim

ChrisNZ · Posted 08-07-2021 02:30 AM

My take:

data HAVE;
A='[{"name": [{"given": ["TESTG1 "], "given": ["TESTG2 "], "family": "TESTF1", "family": "TESTF2"}], "identifier": [{"value": "TESTNO", "system": "TESTSTRING"}]}]';
output;
A='[{"name": [{"given": ["TESTG "], "family": "TESTF"}], "identifier": [{"value": "TESTNO", "system": "TESTSTRING"}]}]';
output;
run;

data WANT;
  set HAVE;  
  B=compress(A,'"{}[] ');
  POS1=1;
  POS2=1;
  do until(POS1=0);
    POS1=find(B,'given:',POS1);
    if POS1 then do;
      FIRST=scan(substr(B,POS1),2,':,'); 
      POS2 =find(B,'family:',POS2);
      LAST =scan(substr(B,POS2),2,':,'); 
      output;
      POS1+1;
      POS2+1;
    end;
  end;
  keep FIRST LAST;
run;

1

TESTG1

TESTF1

2

TESTG2

TESTF2

3

TESTG

TESTF

High-Performance SAS Coding - Third Edition

monali · Posted 08-09-2021 09:28 AM

thank you! this worked perfectly!

ChrisNZ · Posted 08-08-2021 06:06 AM

Another way I remember reading about, that would be appropriate for parsing this string: Using the @'string' syntax of the input statement. It can probably be coded better.

data T;
   infile cards truncover;
   input @;
   do until(LASTOBS);
     set SASHELP.CLASS end=LASTOBS;
     _INFILE_=catt('age:', AGE, ',name:', upcase(NAME));
     input @1 @'name:' NM $ @;
     output;
   end;
   keep NAME NM;
cards;
dummy
run;

The input statement is doing all the positioning and extracting work here.

High-Performance SAS Coding - Third Edition

jimbarbour · Posted 08-09-2021 06:31 PM

@ChrisNZ,

Interesting technique using a "dummy" datalines (cards) to generate a test file. Kind of handy. I had not seen that before.

Also an interesting technique using the INPUT with @'literal'. I didn't even think of that although I have (rarely) seen it used before.

I noticed that you used a comma as a separator. I believe that if there were additional data beyond the name that you would pull in that data. For example:

I modified it to a space, and it worked as coded:

data T;
   infile cards truncover;
   input @;
   do until(LASTOBS);
     set SASHELP.CLASS end=LASTOBS;
	 LENGTH	NM	$32;
     _INFILE_=catt('age:', AGE, ' name:', upcase(NAME), ' sex:', UPCASE(SEX));
	 PUTLOG	"NOTE:  "  _INFILE_;
     input @1 @'name:' NM $ @;
     output;
   end;
   keep NAME NM;
   LENGTH	NM	$32;
cards;
dummy
run;

Alternatively, one could specify a comma delimiter.

INFILE	Cards	DSD	DLM=',';

Yes, sorry. Fussing over small details, I know, but I like to know how things work. I was curious about the comma and wondered if I understood the syntax of

input @1 @'name:' NM $ @;

so I just had to play with it. 🙂

Jim

monali · Posted 08-09-2021 09:29 AM

This solution works too!! thank you so much.

Ksharp · Posted 08-07-2021 07:09 AM

data HAVE;
A='[{"name": [{"given": ["TESTG1 "], "given": ["TESTG2 "], "family": "TESTF1", "family": "TESTF2"}], "identifier": [{"value": "TESTNO", "system": "TESTSTRING"}]}]';
output;
A='[{"name": [{"given": ["TESTG "], "family": "TESTF"}], "identifier": [{"value": "TESTNO", "system": "TESTSTRING"}]}]';
output;
run;
data want;
 set have;
 id+1;
 do i=1 to countw(a,'{},');
  temp=scan(a,i,'{},');
  vname=dequote(strip(scan(strip(temp),1,':[]')));
  value=dequote(strip(scan(strip(temp),-1,':[]')));
  if vname in ('given' 'family') then output;
 end;
 keep id vname value;
run;

Registration is open

SAS Training: Just a Click Away