Solved: capture data within brackets into respective column along with alignme...

Abraham · Posted 02-22-2019 11:45 AM

Hello Everyone!

In the attached excel file, I need to split the sentence into line based on semicolon(;) and then capture every word present in square brackets as separate column.

My input data follow two formats Ifirst one with six enclosed parenthesis and then comment column, second format with two parenthesis and one comment col) which need to be inserted into seven columns to be named like below

It means all keyword that end with 'info' must align in column number 3 like below

[Biscuit] [Sweet] [Product info] [Marigold] [Quality] [Good] - It is made from 'Maida' to 'Suji';

[Manufacture info] [Goods It] - Iron sold in very less price

Output

EID	Prod_info	Instruct	Tab_info	product	issue	Rating	Comment
101	Biscuit	Sweet	Product info	Marigold	Quality	Good	It is made from 'Maida' to 'Suji'
102			Manufacture info	Goods It			Iron sold in very less price

Thanks in advance

Patrick · Posted 02-27-2019 02:10 AM

@Abraham

Below code works in my SAS Windows environment:

data sample(drop=_:);
  length empid $5 _string $200;
  array nodes {6} $20. Prod_info Instruct Tab_info product issue Rating;
  length Comments $80;
  call missing(of _all_);
  infile "c:\temp\specialchar.csv" truncover dlm=',;' firstobs=2 termstr=crlf lrecl=1000;
  input Empid:$10. @;
  do while(1);
    input _string:$200. @;
    if missing(_string) then leave;

    /* remove enclosing double quotes if any */
    _string=prxchange('s/^"(.*)"$/\1/',1,strip(_string));

    _n_nodes=countc(_string,'[');
    /* find node with keyword "info" */
    do _i=1 to 2*_n_nodes by 2;
      if findw(scan(_string,_i,'[]'),'info',' ','i')>0 then 
        do;
          _info_node=(_i+1)/2;
          leave;
        end;
    end;

    /* populate node variables */
    do _i=1 to 2*_n_nodes by 2;
      nodes[(_i+1)/2 + (3-_info_node)]=scan(_string,_i,'[]');
    end;
    /* populate comments */
    comments=substrn(_string,findc(_string,'-')+1);
    output;
  end;

run;

There are three things which might have caused issues for you:

1. The double quotes around the source string. That's something I didn't code for. I've now added logic for this (the PRXCHANGE() bit)

"[HR] [QUERY] [Dept info] [ISSUE] [Quality] [NO IDEA] - Decide your ""salary"";

2. Your data starts on the 2nd row only. I've added FIRSTOBS=2 to the infile statement to deal with this.

3. Not sure if this just happened when transferring your .csv to my environment or if this is in your actual data. What I've got is CRLF as line delimiter.

The path you're using indicates that your OS is Unix/Linux. There the default line delimiter is LF only. I've added TERMSTR=crlf to the infile statement to deal with this. Remove it in case your actual file only uses LF.

View solution in original post

Reeza · Posted 02-22-2019 12:05 PM

COUNTC() to count characters (such as semicolons)

SCAN() to separate components

I would recommend two do loops, one to partition out each line and the second to partition out each item.

How do you know where an item maps to what columns? For example with your second ID, and the value "Manufacture Info", it was put under Tab Info column? How did you know to do that? How can we tell the computer to do that?

@Abraham wrote:

Hello Everyone!

In the attached excel file, I need to split the sentence into line based on semicolon(;) and then capture every word present in square brackets as separate column.

My input data follow two formats Ifirst one with six enclosed parenthesis and then comment column, second format with two parenthesis and one comment col) which need to be inserted into seven columns to be named like below

It means all keyword that end with 'info' must align in column number 3 like below

[Biscuit] [Sweet] [Product info] [Marigold] [Quality] [Good] - It is made from 'Maida' to 'Suji';

[Manufacture info] [Goods It] - Iron sold in very less price

Output

EID Prod_info Instruct Tab_info product issue Rating Comment

101 Biscuit Sweet Product info Marigold Quality Good It is made from 'Maida' to 'Suji'

102 Manufacture info Goods It Iron sold in very less price

Thanks in advance

Abraham · Posted 02-22-2019 07:19 PM

Thanks Reeza for the guidance.

As the first format contain information of two columns ([Biscuit] [Sweet]), I want all word end with 'info' to be the third column renamed with 'tab_info' . It is because the second format always start with 'info' text [Manufacture info] which need to be inserted into tab_info column.

The last column (comment) contain the line starting from hyphen till end . Please check the input file for more example. I am stuck and not able to proceed further

Patrick · Posted 02-22-2019 08:54 PM

@Abraham

Below code will only work if you've always got a bracket with an "info" word in it AND this bracket can always get mapped to output column Tab_Info. If that's not the case then the code below will likely generate array out of range errors and the like.

data sample(drop=_:);
  length empid $5 _string $200;
  array nodes {6} $20. Prod_info Instruct Tab_info product issue Rating;
  length Comments $80;
  call missing(of _all_);

  infile datalines truncover dlm='|;';
  input Empid:$10. @;

  do while(1);
    input _string:$200. @;
    if missing(_string) then leave;

    _n_nodes=countc(_string,'[');
    /* find node with keyword "info" */
    do _i=1 to 2*_n_nodes by 2;
      if findw(scan(_string,_i,'[]'),'info',' ','i')>0 then 
        do;
          _info_node=(_i+1)/2;
          leave;
        end;
    end;

    /* populate node variables */
    do _i=1 to 2*_n_nodes by 2;
      nodes[(_i+1)/2 + (3-_info_node)]=scan(_string,_i,'[]');
    end;
    /* populate comments */
    comments=substrn(_string,findc(_string,'-')+1);
  end;

  input;

  datalines4;
101|[Biscuit] [Sweet] [Product info] [Marigold] [Quality] [Good] - It is made from 'Maida' to 'Suji'; 
102|[Manufacture info] [Goods It] - Iron sold in very less price
103|[Soap] [Tless] [ABX Info] [Lux] [SVD] [Moisturiser] - Smoothening skin
104|[ITEM] [Content] [Goods info] [Sugar] [chs] [Good] - excess cause diabetes
105|[Biscuit] [Sweet] [Product info] [Marigold] [Quality] [Good] - It is made from 'Maida' to 'Suji'; [ITEM1] [Content2] [Goods21 info] [Jaggery] [chs] [Medic] - Healthier;
106|[Cust info] [Employee] - 300 resigned today
107|[QUERY] [Dept info] [ISSUE] [Quality] [NO IDEA] - Decide your "salary"; 
;;;;
run;

Abraham · Posted 02-25-2019 12:06 AM

Thank you very much Partrick for your time and work on my query.

I was working on the program but did not find the expected output. If you check EID 105 , I need two records to be displayed as every ID contain two records separated by semicolon.

The program should split into number of lines based on semicolon and then store all data in a table format. Please have a look into the attachment where you can check the expected output sheet .

Thanks in advance

105	Chocolate	Sweet	Product info	Cadburry	Quality	Good	It is made from cococa seeds
105	ITEM1	Content2	Goods21 info	Jaggery	chs	Medic	Healthier

Patrick · Posted 02-25-2019 01:11 AM

@Abraham

Ooops... that was a last second change which I didn't test anymore.

All you need to do is add an OUTPUT statement at the end of the loop and things should work.

...

comments=substrn(_string,findc(_string,'-')+1);
output;
end;

...

Below the fixed code version.

data sample(drop=_:);
  length empid $5 _string $200;
  array nodes {6} $20. Prod_info Instruct Tab_info product issue Rating;
  length Comments $80;
  call missing(of _all_);

  infile datalines truncover dlm='|;';
  input Empid:$10. @;

  do while(1);
    input _string:$200. @;
    if missing(_string) then leave;

    _n_nodes=countc(_string,'[');
    /* find node with keyword "info" */
    do _i=1 to 2*_n_nodes by 2;
      if findw(scan(_string,_i,'[]'),'info',' ','i')>0 then 
        do;
          _info_node=(_i+1)/2;
          leave;
        end;
    end;

    /* populate node variables */
    do _i=1 to 2*_n_nodes by 2;
      nodes[(_i+1)/2 + (3-_info_node)]=scan(_string,_i,'[]');
    end;
    /* populate comments */
    comments=substrn(_string,findc(_string,'-')+1);
    output;
  end;

  input;

  datalines4;
101|[Biscuit] [Sweet] [Product info] [Marigold] [Quality] [Good] - It is made from 'Maida' to 'Suji'; 
102|[Manufacture info] [Goods It] - Iron sold in very less price
103|[Soap] [Tless] [ABX Info] [Lux] [SVD] [Moisturiser] - Smoothening skin
104|[ITEM] [Content] [Goods info] [Sugar] [chs] [Good] - excess cause diabetes
105|[Biscuit] [Sweet] [Product info] [Marigold] [Quality] [Good] - It is made from 'Maida' to 'Suji'; [ITEM1] [Content2] [Goods21 info] [Jaggery] [chs] [Medic] - Healthier;
106|[Cust info] [Employee] - 300 resigned today
107|[QUERY] [Dept info] [ISSUE] [Quality] [NO IDEA] - Decide your "salary"; 
;;;;
run;

Abraham · Posted 02-26-2019 12:18 PM

Thank you Patrick once again for your great help. Your program is amazing. Its working perfectly now.

As I am new comer to SAS, if possible, can you please guide how to insert the same program if I use proc import to pull the input file directly like below. I use it like below but I am missing somewhere.

Thanks once again for your extreme support

proc import datafile="/proj/sastmp/pvdm/01_QC/specialchar.csv"
     out=test_output
     dbms=csv
     replace;
     getnames=yes;	 
	 guessingrows=32767;
run;

data testing1;
length empid $5 _string $200;
length Comments $80;
set test_output;
array nodes {6} $20. Prod_info Instruct Tab_info product issue Rating;
do while(1);
    input _string:$200. @;
    if missing(_string) then leave;

    _n_nodes=countc(_string,'[');
    /* find node with keyword "info" */
    do _i=1 to 2*_n_nodes by 2;
      if findw(scan(_string,_i,'[]'),'info',' ','i')>0 then 
        do;
          _info_node=(_i+1)/2;
          leave;
        end;
    end;

    /* populate node variables */
    do _i=1 to 2*_n_nodes by 2;
      nodes[(_i+1)/2 + (3-_info_node)]=scan(_string,_i,'[]');
    end;
    /* populate comments */
    comments=substrn(_string,findc(_string,'-')+1);
    output;
  end;
run;

Patrick · Posted 02-26-2019 03:42 PM

@Abraham

Don't use Proc Import here but an INFILE/INPUT statement as this gives you much more control over reading the data. In doing so almost no code changes are required. You just need to point to the external .csv and define appropriate delimiting characters - i.e. dlm=',;'

  infile "/proj/sastmp/pvdm/01_QC/specialchar.csv" truncover dlm='|;';
  input Empid:$10. @;

If your .csv contains more then two columns AND the "_string" is not the last column then please share a few lines of the actual .csv structure (as attachment) if you need help.

Abraham · Posted 02-26-2019 07:15 PM

Thanks Patrick.

My CSV file contain two columns only, attachment for your reference.

The code is not working. I change the infile path and input statement.

data sample(drop=_:);
  length empid $5 _string $200;
  array nodes {6} $20. Prod_info Instruct Tab_info product issue Rating;
  length Comments $80;
  call missing(of _all_);
  infile "/Saquee/final/specialchar.csv" truncover dlm='|;';
  input Empid:$10. @;
  do while(1);
    input _string:$200. @;
    if missing(_string) then leave;

    _n_nodes=countc(_string,'[');
    /* find node with keyword "info" */
    do _i=1 to 2*_n_nodes by 2;
      if findw(scan(_string,_i,'[]'),'info',' ','i')>0 then 
        do;
          _info_node=(_i+1)/2;
          leave;
        end;
    end;

    /* populate node variables */
    do _i=1 to 2*_n_nodes by 2;
      nodes[(_i+1)/2 + (3-_info_node)]=scan(_string,_i,'[]');
    end;
    /* populate comments */
    comments=substrn(_string,findc(_string,'-')+1);
    output;
  end;

run;

he code is not working. I think there is an issue with import file.

Patrick · Posted 02-27-2019 02:10 AM

@Abraham

Below code works in my SAS Windows environment:

data sample(drop=_:);
  length empid $5 _string $200;
  array nodes {6} $20. Prod_info Instruct Tab_info product issue Rating;
  length Comments $80;
  call missing(of _all_);
  infile "c:\temp\specialchar.csv" truncover dlm=',;' firstobs=2 termstr=crlf lrecl=1000;
  input Empid:$10. @;
  do while(1);
    input _string:$200. @;
    if missing(_string) then leave;

    /* remove enclosing double quotes if any */
    _string=prxchange('s/^"(.*)"$/\1/',1,strip(_string));

    _n_nodes=countc(_string,'[');
    /* find node with keyword "info" */
    do _i=1 to 2*_n_nodes by 2;
      if findw(scan(_string,_i,'[]'),'info',' ','i')>0 then 
        do;
          _info_node=(_i+1)/2;
          leave;
        end;
    end;

    /* populate node variables */
    do _i=1 to 2*_n_nodes by 2;
      nodes[(_i+1)/2 + (3-_info_node)]=scan(_string,_i,'[]');
    end;
    /* populate comments */
    comments=substrn(_string,findc(_string,'-')+1);
    output;
  end;

run;

There are three things which might have caused issues for you:

1. The double quotes around the source string. That's something I didn't code for. I've now added logic for this (the PRXCHANGE() bit)

"[HR] [QUERY] [Dept info] [ISSUE] [Quality] [NO IDEA] - Decide your ""salary"";

2. Your data starts on the 2nd row only. I've added FIRSTOBS=2 to the infile statement to deal with this.

3. Not sure if this just happened when transferring your .csv to my environment or if this is in your actual data. What I've got is CRLF as line delimiter.

The path you're using indicates that your OS is Unix/Linux. There the default line delimiter is LF only. I've added TERMSTR=crlf to the infile statement to deal with this. Remove it in case your actual file only uses LF.

Abraham · Posted 02-27-2019 05:37 AM

Perfet Patick. Thanks once again for your time in looking into my issue.

If possible, check that I am getting one duplicate records with quote in the last row which is not required. Based on your suggestion, I will convert all double quote ("") with ("""")

Thanks once again

capture data within brackets into respective column along with alignment

Re: capture data within brackets into respective column along with alignment

Re: capture data within brackets into respective column along with alignment

Re: capture data within brackets into respective column along with alignment

Re: capture data within brackets into respective column along with alignment

Re: capture data within brackets into respective column along with alignment

Re: capture data within brackets into respective column along with alignment

Re: capture data within brackets into respective column along with alignment

Re: capture data within brackets into respective column along with alignment

Re: capture data within brackets into respective column along with alignment

Re: capture data within brackets into respective column along with alignment

Re: capture data within brackets into respective column along with alignment

Registration is open

Registration is open

SAS Training: Just a Click Away