BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
agbpilot
Obsidian | Level 7

Hi, I'm new to arrays and do loops and I've read through the posts but can't seem to get my code to work as intended.  I'm trying to create new variables using an array.  Basically, I'm trying to extract the words from a string of words contained within owner_city field.  Since the # of words fluctuates across obs I've derived a field, num_words_own_city, using countw.  Now, I'm attempting to use this field, num_words_own_city, to define the last element in the range of the array.  Can the last element of the array be dynamic across obs based upon a variable count of words within num_words_own_city?

 

I'm hoping to create new columns (parsed words) vs. new rows.  Any ideas or assistance would be much appreciated.

 

Andy

 

data intl;
set stacked;
/*length word $60.; set longest length of word*/

where owner_type2 = 'OUT_CNTRY' /*; test*/
and prpty_adrs = '3317 CANYON VALLEY TRL' ;

/*count words delimited by a space within the owner city field*/

delim = ' ';
num_words_own_city = countw(owner_city, delim);

/*create an array to represent the words extracted from the owner_city field - count of words is variable*/

array words_parsed {*} $60 words_parsed1-words_parsed(num_words_own_city) ;

/*iterate through owner city field parsing out word by word untill iterations > the count of words*/

do i=1 to dim(words_parsed);
words_parsed{i} = strip(upcase(scan(owner_city, i, delim)));
/* output;*/
end;

run;

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

An example of @Reeza's approach number 2.

 

Obviously untested code as I don't have your data set.

data _null_;
   set stacked (keep= owner_city);
   retain maxwords;
   num_words_own_city = countw(owner_city, ' ');
   maxwords=max(maxwords,num_words_own_city);
   Call symputx('maxwords',maxwords);
run;

data intl;
   set stacked;
   /*create an array to represent the words extracted from the owner_city field - count of words is variable*/
   array words_parsed {&maxwords.} $60 ;
   /*iterate through owner city field parsing out word by word untill iterations > the count of words*/
   do i=1 to countw(owner_city);
      words_parsed{i} = strip(upcase(scan(owner_city, i,' ')));
   end;
run;

Absolutely no claims to efficiency but the data _null_ to find the largest number of words should be simple to follow.

 

Not quite sure what this gains you but it is your data. I just hope you don't have values I found in some addresses like "see the woman in the back apartment".

View solution in original post

14 REPLIES 14
Reeza
Super User
Can the last element of the array be dynamic across obs based upon a variable count of words within num_words_own_city?

Unfortunately the answer to that is no 😞

Some alternatives, in order of preference:
1. Output each word to a unique line and transpose after via PROC TRANSPOSE (fully dynamic)
2. Find the maximum number of words in a prior step and set the array using that value
3. Make the array arbitrarily large and then delete empty variables after the fact

agbpilot
Obsidian | Level 7

Hi Reeza, thanks for your reply.  I've never used proc transpose.  Could you propose sample code that could make option 1 you suggest work?

 

Andy

Astounding
PROC Star

If you want the last word, rather than a count of how many words, that's easy:

lastword = scan(long_string, -1);

It's not clear what advantage you get by splitting the text into separate words, however.

ballardw
Super User

An example of @Reeza's approach number 2.

 

Obviously untested code as I don't have your data set.

data _null_;
   set stacked (keep= owner_city);
   retain maxwords;
   num_words_own_city = countw(owner_city, ' ');
   maxwords=max(maxwords,num_words_own_city);
   Call symputx('maxwords',maxwords);
run;

data intl;
   set stacked;
   /*create an array to represent the words extracted from the owner_city field - count of words is variable*/
   array words_parsed {&maxwords.} $60 ;
   /*iterate through owner city field parsing out word by word untill iterations > the count of words*/
   do i=1 to countw(owner_city);
      words_parsed{i} = strip(upcase(scan(owner_city, i,' ')));
   end;
run;

Absolutely no claims to efficiency but the data _null_ to find the largest number of words should be simple to follow.

 

Not quite sure what this gains you but it is your data. I just hope you don't have values I found in some addresses like "see the woman in the back apartment".

agbpilot
Obsidian | Level 7

Ok great, thank you very much!  I really appreciate the detailed sample code.  Ultimately I'm trying to parse out words within this field which I later compare to an international file, which contains intl city and country names.  This will be very helpful b/c these particular intl records are "sloppy" as the intl addresses are being squeezed into U.S. address formats, which they don't naturally fit within.  Hence why I have to parse out the words, and then later match the parsed words against the intl file.  Let me give it a test run, and again, thank you very much!

 

Andy

agbpilot
Obsidian | Level 7
that did the trick, thanks again!
agbpilot
Obsidian | Level 7

Hi, thanks again for all of your help.  As follow up to this string, I'm attempting to define another array from the first array, words_parsed.  The second array, search_fields, I'm hoping to use the full range of words being parsed, which is based upon &maxwords macro variable.  However, in this new array I want to add an additional field, owner_city, to be a part of the elements.  So I created a separate macro, num_elements, which is just &maxwords + 1.  However, I'm encountering below log error.  I'm not sure how to reference the &maxwords macro when defining the elements in the array.  The elements I want to include in the search_fields array are: owner_city words_parsed1-words_parsed(# based upon &maxwords. macro) .  Here's the code as well.

 

Any assistance would be much appreciated.

 

Andy

 


data _null_;
set stacked (keep= owner_city owner_type2);
where owner_type2 = 'OUT_CNTRY';

retain maxwords;

num_words_own_city = countw(owner_city, ' ');
maxwords=max(maxwords,num_words_own_city);
num_elements = maxwords + 1;

call symput('maxwords',maxwords);
call symput('num_elements',num_elements);
run;

%put
max word count: &maxwords
number of search elements: &num_elements
;

*-------------------------------------------------------------------------------------------------------*;
* Based upon max word count, parse & extract words to new columns for comparison against intl file *;
*-------------------------------------------------------------------------------------------------------*;

data intl;
set stacked;
length owner_country $2.;
where owner_type2 = 'OUT_CNTRY';

/*create an array to represent the words extracted from the owner_city field - count of words is variable*/

array words_parsed {&maxwords.} $60 ;

/*iterate through owner city field parsing out word by word untill iterations > the count of words*/

do i=1 to countw(owner_city, ' ');
words_parsed{i} = strip(upcase(scan(owner_city, i,' ')));
end;

array search_fields {&num_elements.} $60 /*owner_city words_parsed1 words_parsed2 words_parsed3*/ owner_city words_parsed1-words_parsed{&maxwords.};

run;

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

39 array search_fields {&num_elements.} $60 /*owner_city words_parsed1 words_parsed2 words_parsed3*/ owner_city
39 ! words_parsed1-words_parsed{&maxwords.};
_
22
76
ERROR 22-322: Syntax error, expecting one of the following: a name, (, ;, _ALL_, _CHARACTER_, _CHAR_, _NUMERIC_.

ERROR 76-322: Syntax error, statement will be ignored.

 

 

 

mkeintz
PROC Star

In the statement

 

array search_fields {&num_elements.} $60
  /*owner_city words_parsed1 words_parsed2 words_parsed3*/
    owner_city words_parsed1-words_parsed{&maxwords.};

 

  get rid of the braces at the end. 

 

Change it to

 

 

array search_fields {&num_elements.} $60
  /*owner_city words_parsed1 words_parsed2 words_parsed3*/
    owner_city words_parsed1-words_parsed&maxwords.;

 

 

You can't have namelists of the from    X1-X{30}.     It should be  X1-X30.
 words_parsed1-words_parsed&maxwords.;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
agbpilot
Obsidian | Level 7

@mkeintz  @ballardw  Thank you for your helpful suggestions.  It's probably operator error...  I attempted implementing the revised code, see below.  However, the log error makes it seem that the macro variable, &maxwords., is not being referenced properly in the numbered variable list within the array.

 

Revised code:

data intl;
set stacked;
length owner_country $2.;
where owner_type2 = 'OUT_CNTRY';

/*create an array to represent the words extracted from the owner_city field - count of words is variable*/

array words_parsed {&maxwords.} $60 ;

/*iterate through owner city field parsing out word by word untill iterations > the count of words*/

do i=1 to countw(owner_city, ' ');
words_parsed{i} = strip(upcase(scan(owner_city, i,' ')));
end;

array search_fields {&num_elements.} $60 owner_city words_parsed1-words_parsed&maxwords.;

run;

 

Log:

24 data intl;
25 set stacked;
26 length owner_country $2.;
27 where owner_type2 = 'OUT_CNTRY';
28
29 /*create an array to represent the words extracted from the owner_city field - count of words is variable*/
30
31 array words_parsed {&maxwords.} $60 ;
32
33 /*iterate through owner city field parsing out word by word untill iterations > the count of words*/
34
35 do i=1 to countw(owner_city, ' ');
36 words_parsed{i} = strip(upcase(scan(owner_city, i,' ')));
37 end;
38
39 array search_fields {&num_elements.} $60 owner_city words_parsed1-words_parsed&maxwords;
NOTE: Line generated by the macro variable "MAXWORDS".
39 words_parsed 3
_
22
200
ERROR: Missing numeric suffix on a numbered variable list (words_parsed1-words_parsed).
ERROR: Too few variables defined for the dimension(s) specified for the array search_fields.
ERROR 22-322: Syntax error, expecting one of the following: a name, (, ;, _ALL_, _CHARACTER_, _CHAR_, _NUMERIC_.

ERROR 200-322: The symbol is not recognized and will be ignored.

ballardw
Super User

I'm afraid that you need to provide the starting data and how you created the Maxwords and Num_element macro variables. It appears as if somehow you have created &maxwords to include a leading space. The space wouldn't cause an issue with an array statement as the index size as SAS will treat  { 3} {3} { 3 } the same. However if you make a variable name such as "words_parsed&maxwords" the space would yield "words_parsed 3" which is not a valid variable.

agbpilot
Obsidian | Level 7

Thank you for your reply and insight.  Here is the code in its entirety starting from where I create the macro variable.

 

*-------------------------------------------------------------------------------------------*;
* Create macro based upon max word count within owner city field *;
*-------------------------------------------------------------------------------------------*;

data _null_;
set stacked (keep= owner_city owner_type2);
where owner_type2 = 'OUT_CNTRY';

retain maxwords;

num_words_own_city = countw(owner_city, ' ');
maxwords=max(maxwords,num_words_own_city);
num_elements = maxwords + 1;

call symput('maxwords',maxwords);
call symput('num_elements',num_elements);
run;

%put
max word count: &maxwords
number of search elements: &num_elements
;

*-------------------------------------------------------------------------------------------------------*;
* Based upon max word count, parse & extract words to new columns for comparison against intl file *;
*-------------------------------------------------------------------------------------------------------*;

data intl;
set stacked;
length owner_country $2.;
where owner_type2 = 'OUT_CNTRY';

/*create an array to represent the words extracted from the owner_city field - count of words is variable*/

array words_parsed {&maxwords.} $60 ;

/*iterate through owner city field parsing out word by word untill iterations > the count of words*/

do i=1 to countw(owner_city, ' ');
words_parsed{i} = strip(upcase(scan(owner_city, i,' ')));
end;

array search_fields {&num_elements.} $60 owner_city words_parsed1-words_parsed&maxwords.;

run;

Reeza
Super User
Use call symputX not call symput() to remove leading and trailing spaces.
ballardw
Super User

@agbpilot wrote:

Thank you for your reply and insight.  Here is the code in its entirety starting from where I create the macro variable.

 

*-------------------------------------------------------------------------------------------*;
* Create macro based upon max word count within owner city field *;
*-------------------------------------------------------------------------------------------*;

data _null_;
set stacked (keep= owner_city owner_type2);
where owner_type2 = 'OUT_CNTRY';

retain maxwords;

num_words_own_city = countw(owner_city, ' ');
maxwords=max(maxwords,num_words_own_city);
num_elements = maxwords + 1;

call symput('maxwords',maxwords);
call symput('num_elements',num_elements);
run;

%put
max word count: &maxwords
number of search elements: &num_elements
;

*-------------------------------------------------------------------------------------------------------*;
* Based upon max word count, parse & extract words to new columns for comparison against intl file *;
*-------------------------------------------------------------------------------------------------------*;

data intl;
set stacked;
length owner_country $2.;
where owner_type2 = 'OUT_CNTRY';

/*create an array to represent the words extracted from the owner_city field - count of words is variable*/

array words_parsed {&maxwords.} $60 ;

/*iterate through owner city field parsing out word by word untill iterations > the count of words*/

do i=1 to countw(owner_city, ' ');
words_parsed{i} = strip(upcase(scan(owner_city, i,' ')));
end;

array search_fields {&num_elements.} $60 owner_city words_parsed1-words_parsed&maxwords.;

run;


If you look closely at the code I provided earlier uses CALL SYMPUTX to prevent leading spaces from appearing. What CALL SYMPUT('maxwords', maxwords) does, in effect, is create a string value from the maxwords with an implied Put(maxwords,<some format>). The format generally used will likely be BEST12. if you don't supply one. Which means the value has a LOT of blanks. Since you posted the log in the main message window the text from your log was reformatted by removing duplicate blanks.

Here is some example code that shows this behavior. The %put to display the macro variables is sandwiched between pipes to see the start and end more clearly. And the log is copied and pasted into a code window on the forum using the </> icon to preserve the text formatting.

60   data _null_;
61      call symput('v1',3);
62      call symputx('v2',3);
63   run;

NOTE: Numeric values have been converted to character
      values at the places given by: (Line):(Column).
      61:21
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
64
65   %put V1 is |&v1.|;
V1 is |           3|
66   %put V2 is |&v2.|;
V2 is |3|

 

agbpilot
Obsidian | Level 7

@ballardw @Reeza  excellent!  thank you both.  it works!

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 902 views
  • 11 likes
  • 5 in conversation