split variable which include () or '' to different variables

Alexxxxxxx · Posted 03-08-2019 08:02 PM

Dear all,

How can I split variable to different variables based on () and ''?

For example, for table a, I expect to get table b.

table a

me (DJI) Kim

juice 'apple' org

apple 'jan' five (ooo) KKK

apple 'jan' (ooo) KKK

I expect to get

table b

name, name_amonge_brackets, name_amonge_single_quotation_marks

me Kim, DJI

juice org,,apple

apple five KKK,ooo,jan

data a;
  infile datalines dlm=',' truncover;
informat HRM_L2_Step6 $1000. ;
input HRM_L2_Step6 ;
datalines;
UNVL.CERAMIC MATL.
CATHAY INTL.HDG. PREV GOLDEN HDG.
PILLAR PR.
IRISH CONT.GP.UTS
ASHANTI GOLDFIELD (ISE)
ASUSTEK CMP. GDR (XSQ)REG S
ALCATEL 'A' (XSQ)
ALL NIPPON AIRWAYS (XSQ)
ALPS ELECTRIC (XSQ)
AMER 'A' (XSQ)
ATLAS COPCO 'B' (XYZ)
AU.MIKRO SYSTEME (XSQ)
AUTOLIV (XSQ)
AVESTA SHEFFIELD (XSQ)
BNP (XSQ)
BNP CERTS (XSQ)
BANCA DI ROMA (XSQ)
BANCAIRE CIE (XSQ)
BNC.BILBAO ADS (XSQ)
BARLOW (XSQ)
BCI (XSQ)
BENETTON (XSQ)
BERGESEN 'B'N/V (XSQ)
BLYVOORUITIZICHT (XSQ)
BMW (XSQ)DUPL.SEE 740095
BOUYGUES (XSQ)
BUFETE ADS (XSQ)
CS HDG.BR (XSQ)
CSM (XSQ)
CANAL + (XSQ)
;
run;

Could you please give me some suggestions about this? thanks in advance.

Reeza · Posted 03-08-2019 11:01 PM

SCAN() will get you the word in the middle.

TRANWRD() will remove that word once you've found it.

Having demo variables as KKK probably isn't a great choice by the way.

@Alexxxxxxx wrote:

Dear all,

How can I split variable to different variables based on () and ''?

For example, for table a, I expect to get table b.

table a

me (DJI) Kim

juice 'apple' org

apple 'jan' five (ooo) KKK

apple 'jan' (ooo) KKK

I expect to get

table b

name, name_amonge_brackets, name_amonge_single_quotation_marks

me Kim, DJI

juice org,,apple

apple five KKK,ooo,jan

apple five KKK,ooo,jan
data a;
  infile datalines dlm=',' truncover;
informat HRM_L2_Step6 $1000. ;
input HRM_L2_Step6 ;
datalines;
UNVL.CERAMIC MATL.
CATHAY INTL.HDG. PREV GOLDEN HDG.
PILLAR PR.
IRISH CONT.GP.UTS
ASHANTI GOLDFIELD (ISE)
ASUSTEK CMP. GDR (XSQ)REG S
ALCATEL 'A' (XSQ)
ALL NIPPON AIRWAYS (XSQ)
ALPS ELECTRIC (XSQ)
AMER 'A' (XSQ)
ATLAS COPCO 'B' (XYZ)
AU.MIKRO SYSTEME (XSQ)
AUTOLIV (XSQ)
AVESTA SHEFFIELD (XSQ)
BNP (XSQ)
BNP CERTS (XSQ)
BANCA DI ROMA (XSQ)
BANCAIRE CIE (XSQ)
BNC.BILBAO ADS (XSQ)
BARLOW (XSQ)
BCI (XSQ)
BENETTON (XSQ)
BERGESEN 'B'N/V (XSQ)
BLYVOORUITIZICHT (XSQ)
BMW (XSQ)DUPL.SEE 740095
BOUYGUES (XSQ)
BUFETE ADS (XSQ)
CS HDG.BR (XSQ)
CSM (XSQ)
CANAL + (XSQ)
;
run;
Could you please give me some suggestions about this? thanks in advance.

Patrick · Posted 03-09-2019 12:18 AM

Below code works for your sample data. It might or might not work for your real data (i.e. if you've got an uneven number of single quotes in a string).

data have;
  infile datalines dlm=',' truncover;
  informat HRM_L2_Step6 $1000.;
  input HRM_L2_Step6;
  datalines;
UNVL.CERAMIC MATL.
CATHAY INTL.HDG. PREV GOLDEN HDG.
PILLAR PR.
IRISH CONT.GP.UTS
ASHANTI GOLDFIELD (ISE)
ASUSTEK CMP. GDR (XSQ)REG S
ALCATEL 'A' (XSQ)
ALL NIPPON 'b' AIRWAYS (XSQ) 'A' (XYZ)  
ALPS ELECTRIC (XSQ)
AMER 'A' (XSQ)
ATLAS COPCO 'B' (XYZ)
AU.MIKRO SYSTEME (XSQ)
AUTOLIV (XSQ)
AVESTA SHEFFIELD (XSQ)
BNP (XSQ)
BNP CERTS (XSQ)
BANCA DI ROMA (XSQ)
BANCAIRE CIE (XSQ)
BNC.BILBAO ADS (XSQ)
BARLOW (XSQ)
BCI (XSQ)
BENETTON (XSQ)
BERGESEN 'B'N/V (XSQ)
BLYVOORUITIZICHT (XSQ)
BMW (XSQ)DUPL.SEE 740095
BOUYGUES (XSQ)
BUFETE ADS (XSQ)
CS HDG.BR (XSQ)
CSM (XSQ)
CANAL + (XSQ)
;
run;


data want(drop=_:);

  set have;
  HRM_L2_Step6_orig=HRM_L2_Step6;
  length str_in_brackets str_in_quotes $1000;

  if _n_=1 then
    do;
      _rx_bracket_id=prxparse('/\([^\)]*\)/oi');
      _rx_quote_id=prxparse("/'[^']*'/oi");
      retain _rx_bracket_id _rx_quote_id;
    end;

  do while(1);
    call prxsubstr(_rx_bracket_id, left(hrm_l2_step6), _pos, _len);
    if _len=0 then leave;
    str_in_brackets=catx(' ',str_in_brackets,substrn(HRM_L2_Step6,_pos+1,_len-2));
    HRM_L2_Step6=substrn(HRM_L2_Step6,1,_pos-1)||substrn(HRM_L2_Step6,_pos+_len);
  end;

  do while(1);
    call prxsubstr(_rx_quote_id, left(hrm_l2_step6), _pos, _len);
    if _len=0 then leave;
    str_in_quotes=catx(' ',str_in_quotes,substrn(HRM_L2_Step6,_pos+1,_len-2));
    HRM_L2_Step6=substrn(HRM_L2_Step6,1,_pos-1)||substrn(HRM_L2_Step6,_pos+_len);
  end;

run;

If the code is not sufficient for your real data then try to amend the Regular Expressions to fit what you're dealing with.

PGStats · Posted 03-09-2019 12:52 AM

You can go at it quite systematically with pattern matching functions:

data b;
if not patQ then patQ + prxParse("/'.*'/o");
if not patB then patB + prxParse("/\(.*\)/o");
set a;
length name nameQ nameB $1000;
name = HRM_L2_Step6;
start = 1;
call prxNext(patQ, start, -1, name, pos, len);
if len >= 2 then do;
    nameQ = substrn(name, pos+1, len-2);
    name = catx(" ", substrn(name, 1, pos-1), substr(name, pos+len));
    end;
start = 1;
call prxNext(patB, start, -1, name, pos, len);
if len >= 2 then do;
    nameB = substrn(name, pos+1, len-2);
    name = catx(" ", substrn(name, 1, pos-1), substr(name, pos+len));
    end;
drop patQ patB start pos len;
run;

PG

split variable which include () or '' to different variables

Re: split variable which include () or '' to different variables

Re: split variable which include () or '' to different variables

Re: split variable which include () or '' to different variables

split variable which include () or '' to different variables

Re: split variable which include () or '' to different variables

Re: split variable which include () or '' to different variables

Re: split variable which include () or '' to different variables

Click image to register for webinar

Classroom Training Available!