BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Abraham
Obsidian | Level 7

Hello Everyone!

In the attached excel file, I need to split the sentence into line based on semicolon(;) and then capture every word present in square brackets as separate column.

My input data follow two formats Ifirst one with six enclosed parenthesis and then comment column, second format with two parenthesis and one comment col) which need to be inserted into seven columns to be named like below

It means all keyword that end with 'info' must align in column number 3 like below

 

[Biscuit] [Sweet] [Product info] [Marigold] [Quality] [Good] - It is made from 'Maida' to 'Suji'; 
[Manufacture info] [Goods It] - Iron sold in very less price

 

Output

EID Prod_info Instruct Tab_info product issue Rating Comment
101 Biscuit Sweet  Product info Marigold  Quality  Good  It is made from 'Maida' to 'Suji'
102     Manufacture info Goods It     Iron sold in very less price

 

Thanks in advance

1 ACCEPTED SOLUTION

Accepted Solutions
Patrick
Opal | Level 21

@Abraham 

Below code works in my SAS Windows environment:

data sample(drop=_:);
  length empid $5 _string $200;
  array nodes {6} $20. Prod_info Instruct Tab_info product issue Rating;
  length Comments $80;
  call missing(of _all_);
  infile "c:\temp\specialchar.csv" truncover dlm=',;' firstobs=2 termstr=crlf lrecl=1000;
  input Empid:$10. @;
  do while(1);
    input _string:$200. @;
    if missing(_string) then leave;

    /* remove enclosing double quotes if any */
    _string=prxchange('s/^"(.*)"$/\1/',1,strip(_string));

    _n_nodes=countc(_string,'[');
    /* find node with keyword "info" */
    do _i=1 to 2*_n_nodes by 2;
      if findw(scan(_string,_i,'[]'),'info',' ','i')>0 then 
        do;
          _info_node=(_i+1)/2;
          leave;
        end;
    end;

    /* populate node variables */
    do _i=1 to 2*_n_nodes by 2;
      nodes[(_i+1)/2 + (3-_info_node)]=scan(_string,_i,'[]');
    end;
    /* populate comments */
    comments=substrn(_string,findc(_string,'-')+1);
    output;
  end;

run;

There are three things which might have caused issues for you:

1. The double quotes around the source string. That's something I didn't code for. I've now added logic for this (the PRXCHANGE() bit)

"[HR] [QUERY] [Dept info] [ISSUE] [Quality] [NO IDEA] - Decide your ""salary"";

2. Your data starts on the 2nd row only. I've added FIRSTOBS=2 to the infile statement to deal with this.

3. Not sure if this just happened when transferring your .csv to my environment or if this is in your actual data. What I've got is CRLF as line delimiter.

Capture.JPG

The path you're using indicates that your OS is Unix/Linux. There the default line delimiter is LF only. I've added TERMSTR=crlf to the infile statement to deal with this. Remove it in case your actual file only uses LF.

View solution in original post

10 REPLIES 10
Reeza
Super User

COUNTC() to count characters (such as semicolons)

SCAN() to separate components

 

I would recommend two do loops, one to partition out each line and the second to partition out each item. 

 

How do you know where an item maps to what columns? For example with your second ID, and the value "Manufacture Info", it was put under Tab Info column? How did you know to do that? How can we tell the computer to do that?


@Abraham wrote:

Hello Everyone!

In the attached excel file, I need to split the sentence into line based on semicolon(;) and then capture every word present in square brackets as separate column.

My input data follow two formats Ifirst one with six enclosed parenthesis and then comment column, second format with two parenthesis and one comment col) which need to be inserted into seven columns to be named like below

It means all keyword that end with 'info' must align in column number 3 like below

 

[Biscuit] [Sweet] [Product info] [Marigold] [Quality] [Good] - It is made from 'Maida' to 'Suji'; 
[Manufacture info] [Goods It] - Iron sold in very less price

 

Output

EID Prod_info Instruct Tab_info product issue Rating Comment
101 Biscuit Sweet  Product info Marigold  Quality  Good  It is made from 'Maida' to 'Suji'
102     Manufacture info Goods It     Iron sold in very less price

 

Thanks in advance


 

Abraham
Obsidian | Level 7

Thanks Reeza for the guidance.

As the first format contain information of two columns ([Biscuit] [Sweet]), I want all word end with 'info' to be the third column renamed with 'tab_info' . It is because the second format always start with 'info' text [Manufacture info] which need to be inserted into tab_info column.

 

The last column (comment) contain the line starting from hyphen till end . Please check the input file for more example. I am stuck and not able to proceed further

 

 

 

 

Patrick
Opal | Level 21

@Abraham 

Below code will only work if you've always got a bracket with an "info" word in it AND this bracket can always get mapped to output column Tab_Info. If that's not the case then the code below will likely generate array out of range errors and the like.

data sample(drop=_:);
  length empid $5 _string $200;
  array nodes {6} $20. Prod_info Instruct Tab_info product issue Rating;
  length Comments $80;
  call missing(of _all_);

  infile datalines truncover dlm='|;';
  input Empid:$10. @;

  do while(1);
    input _string:$200. @;
    if missing(_string) then leave;

    _n_nodes=countc(_string,'[');
    /* find node with keyword "info" */
    do _i=1 to 2*_n_nodes by 2;
      if findw(scan(_string,_i,'[]'),'info',' ','i')>0 then 
        do;
          _info_node=(_i+1)/2;
          leave;
        end;
    end;

    /* populate node variables */
    do _i=1 to 2*_n_nodes by 2;
      nodes[(_i+1)/2 + (3-_info_node)]=scan(_string,_i,'[]');
    end;
    /* populate comments */
    comments=substrn(_string,findc(_string,'-')+1);
  end;

  input;

  datalines4;
101|[Biscuit] [Sweet] [Product info] [Marigold] [Quality] [Good] - It is made from 'Maida' to 'Suji'; 
102|[Manufacture info] [Goods It] - Iron sold in very less price
103|[Soap] [Tless] [ABX Info] [Lux] [SVD] [Moisturiser] - Smoothening skin
104|[ITEM] [Content] [Goods info] [Sugar] [chs] [Good] - excess cause diabetes
105|[Biscuit] [Sweet] [Product info] [Marigold] [Quality] [Good] - It is made from 'Maida' to 'Suji'; [ITEM1] [Content2] [Goods21 info] [Jaggery] [chs] [Medic] - Healthier;
106|[Cust info] [Employee] - 300 resigned today
107|[QUERY] [Dept info] [ISSUE] [Quality] [NO IDEA] - Decide your "salary"; 
;;;;
run;
Abraham
Obsidian | Level 7

Thank you very much Partrick for your time and work on my query.  

I was working on the program but did not find the expected output. If you check EID 105 , I need two records to be displayed as every ID contain two records separated by semicolon.

 

The program should split into number of lines based on semicolon and then store all data in a table format. Please have a look into the attachment where you can check the expected output sheet .

 

Thanks in advance

 

105 Chocolate  Sweet  Product info Cadburry  Quality  Good  It is made from cococa seeds
105 ITEM1  Content2  Goods21 info Jaggery chs  Medic  Healthier

 

Patrick
Opal | Level 21

@Abraham 

Ooops... that was a last second change which I didn't test anymore.

All you need to do is add an OUTPUT statement at the end of the loop and things should work.

...

  comments=substrn(_string,findc(_string,'-')+1);
  output;
end;

...

 

Below the fixed code version.

data sample(drop=_:);
  length empid $5 _string $200;
  array nodes {6} $20. Prod_info Instruct Tab_info product issue Rating;
  length Comments $80;
  call missing(of _all_);

  infile datalines truncover dlm='|;';
  input Empid:$10. @;

  do while(1);
    input _string:$200. @;
    if missing(_string) then leave;

    _n_nodes=countc(_string,'[');
    /* find node with keyword "info" */
    do _i=1 to 2*_n_nodes by 2;
      if findw(scan(_string,_i,'[]'),'info',' ','i')>0 then 
        do;
          _info_node=(_i+1)/2;
          leave;
        end;
    end;

    /* populate node variables */
    do _i=1 to 2*_n_nodes by 2;
      nodes[(_i+1)/2 + (3-_info_node)]=scan(_string,_i,'[]');
    end;
    /* populate comments */
    comments=substrn(_string,findc(_string,'-')+1);
    output;
  end;

  input;

  datalines4;
101|[Biscuit] [Sweet] [Product info] [Marigold] [Quality] [Good] - It is made from 'Maida' to 'Suji'; 
102|[Manufacture info] [Goods It] - Iron sold in very less price
103|[Soap] [Tless] [ABX Info] [Lux] [SVD] [Moisturiser] - Smoothening skin
104|[ITEM] [Content] [Goods info] [Sugar] [chs] [Good] - excess cause diabetes
105|[Biscuit] [Sweet] [Product info] [Marigold] [Quality] [Good] - It is made from 'Maida' to 'Suji'; [ITEM1] [Content2] [Goods21 info] [Jaggery] [chs] [Medic] - Healthier;
106|[Cust info] [Employee] - 300 resigned today
107|[QUERY] [Dept info] [ISSUE] [Quality] [NO IDEA] - Decide your "salary"; 
;;;;
run;
Abraham
Obsidian | Level 7

Thank you Patrick once again for your great help. Your program is amazing. Its working perfectly now.

 

As  I am new comer to SAS, if possible, can you please guide how to insert the same program if I use proc import to pull the input file directly like below. I use it like below but I am missing somewhere.

 

Thanks once again for your extreme support

 

proc import datafile="/proj/sastmp/pvdm/01_QC/specialchar.csv"
     out=test_output
     dbms=csv
     replace;
     getnames=yes;	 
	 guessingrows=32767;
run;

data testing1;
length empid $5 _string $200;
length Comments $80;
set test_output;
array nodes {6} $20. Prod_info Instruct Tab_info product issue Rating;
do while(1);
    input _string:$200. @;
    if missing(_string) then leave;

    _n_nodes=countc(_string,'[');
    /* find node with keyword "info" */
    do _i=1 to 2*_n_nodes by 2;
      if findw(scan(_string,_i,'[]'),'info',' ','i')>0 then 
        do;
          _info_node=(_i+1)/2;
          leave;
        end;
    end;

    /* populate node variables */
    do _i=1 to 2*_n_nodes by 2;
      nodes[(_i+1)/2 + (3-_info_node)]=scan(_string,_i,'[]');
    end;
    /* populate comments */
    comments=substrn(_string,findc(_string,'-')+1);
    output;
  end;
run;
Patrick
Opal | Level 21

@Abraham 

Don't use Proc Import here but an INFILE/INPUT statement as this gives you much more control over reading the data. In doing so almost no code changes are required. You just need to point to the external .csv and define appropriate delimiting characters - i.e. dlm=',;'

  infile "/proj/sastmp/pvdm/01_QC/specialchar.csv" truncover dlm='|;';
  input Empid:$10. @;

If your .csv contains more then two columns AND the "_string" is not the last column then please share a few lines of the actual .csv structure (as attachment) if you need help.

Abraham
Obsidian | Level 7

Thanks Patrick.

My CSV file contain two columns only, attachment for your reference.

The code is not working. I change the infile path and input statement.

data sample(drop=_:);
  length empid $5 _string $200;
  array nodes {6} $20. Prod_info Instruct Tab_info product issue Rating;
  length Comments $80;
  call missing(of _all_);
  infile "/Saquee/final/specialchar.csv" truncover dlm='|;';
  input Empid:$10. @;
  do while(1);
    input _string:$200. @;
    if missing(_string) then leave;

    _n_nodes=countc(_string,'[');
    /* find node with keyword "info" */
    do _i=1 to 2*_n_nodes by 2;
      if findw(scan(_string,_i,'[]'),'info',' ','i')>0 then 
        do;
          _info_node=(_i+1)/2;
          leave;
        end;
    end;

    /* populate node variables */
    do _i=1 to 2*_n_nodes by 2;
      nodes[(_i+1)/2 + (3-_info_node)]=scan(_string,_i,'[]');
    end;
    /* populate comments */
    comments=substrn(_string,findc(_string,'-')+1);
    output;
  end;

run;

he code is not working. I think there is an issue with import file.

 

Patrick
Opal | Level 21

@Abraham 

Below code works in my SAS Windows environment:

data sample(drop=_:);
  length empid $5 _string $200;
  array nodes {6} $20. Prod_info Instruct Tab_info product issue Rating;
  length Comments $80;
  call missing(of _all_);
  infile "c:\temp\specialchar.csv" truncover dlm=',;' firstobs=2 termstr=crlf lrecl=1000;
  input Empid:$10. @;
  do while(1);
    input _string:$200. @;
    if missing(_string) then leave;

    /* remove enclosing double quotes if any */
    _string=prxchange('s/^"(.*)"$/\1/',1,strip(_string));

    _n_nodes=countc(_string,'[');
    /* find node with keyword "info" */
    do _i=1 to 2*_n_nodes by 2;
      if findw(scan(_string,_i,'[]'),'info',' ','i')>0 then 
        do;
          _info_node=(_i+1)/2;
          leave;
        end;
    end;

    /* populate node variables */
    do _i=1 to 2*_n_nodes by 2;
      nodes[(_i+1)/2 + (3-_info_node)]=scan(_string,_i,'[]');
    end;
    /* populate comments */
    comments=substrn(_string,findc(_string,'-')+1);
    output;
  end;

run;

There are three things which might have caused issues for you:

1. The double quotes around the source string. That's something I didn't code for. I've now added logic for this (the PRXCHANGE() bit)

"[HR] [QUERY] [Dept info] [ISSUE] [Quality] [NO IDEA] - Decide your ""salary"";

2. Your data starts on the 2nd row only. I've added FIRSTOBS=2 to the infile statement to deal with this.

3. Not sure if this just happened when transferring your .csv to my environment or if this is in your actual data. What I've got is CRLF as line delimiter.

Capture.JPG

The path you're using indicates that your OS is Unix/Linux. There the default line delimiter is LF only. I've added TERMSTR=crlf to the infile statement to deal with this. Remove it in case your actual file only uses LF.

Abraham
Obsidian | Level 7

Perfet Patick. Thanks once again for your time in looking into my issue.

 

If possible, check that I am getting one duplicate records with quote in the last row which is not required. Based on your suggestion, I will convert all double quote ("") with ("""")

duplicate.jpg

 

 

Thanks once again

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 5232 views
  • 2 likes
  • 3 in conversation