Hi all,
I'd like your help with this problem.
I have 2 tables:
I want to take each string in the STRINGS table, go over the list of values of each variable in VARIABLES table and if the value from this list occurs in the string, the value should be written into a new column that has the same name as the variable.
See the output table for better understanding:
The number of columns in VARIABLES table may vary, and the column names won't always be as regular as var1, var2...
Also, if the string contains more than 1 value fromt he list (e.g. the last observation in STRINGS table), it has to have assigned only 1 value in the new column, no matter which one (so in case of string "a b", it could be assigned either "a", "b" or "a b"). Important thing is for at least one value to be there without duplication of the original string in another row.
I am a beginner in coding in SAS, so the only way I could do this is by typing out a zillion of IF conditions for all the values for creating each new column. However, I'd like the solution to be automatic so that it doesn't matter how many variables there are, what are their names and how many values each one has.
Please, can anyone help me with this?
Here's the code for creating my data:
data strings;
input string $6.;
datalines;
a x
aa l z
b k
b y m
c z
c x k
a b
;
data variables;
infile datalines delimiter=',' dsd;
input var1 $ var2 $ var3 $;
datalines;
a,k,x
b,l,y
a b,m,z
,n,
;
Thanks!
Well, in this case my program in its core remains the same, except:
The rest is just diligent, even if somewhat tedious, coding:
data strings ;
input string $30.;
cards ;
Ford carrot
Honda beans
south america black beans
green beans Audi
x5 chili pepper lettuce
garlic north America
central europe bmw
lettuce onion X5
America pepper
X5 BMW Europe
Central East Europe
bmw X5 america
;
run ;
data variables;
infile datalines delimiter=',' dsd;
length cars $6 vegetables $12 regions $14;
input cars $ vegetables $ regions $;
datalines;
Audi,beans,America
BMW,black beans,Central Europe
BMW X5,carrot,East Europe
X5,chili pepper,Europe
Honda,green beans,North America
,lettuce,South America
,onion,
,pepper,
;
run;
proc sql noprint ;
select max (length) into :maxvlen from dictionary.columns
where libname="WORK" and memname="VARIABLES"
;
quit ;
data want (drop = _:) ;
set strings ;
if _n_ = 1 then do ;
length _s _z $ &maxvlen ;
dcl hash h () ;
h.definekey ("_s") ;
h.definedata ("_z", "_i_") ;
h.definedone () ;
do until (z) ;
set variables end = z ;
array v cars vegetables regions ;
do over v ;
if cmiss (v) then continue ;
do _i = 1 to countw (v) ;
_z = scan (v, _i) ;
h.ref (key: upcase (_z), data: _z, data: _i_) ;
do _j = _i + 1 to countw (v) ;
_z = catx (" ", _z, scan (v, _j)) ;
h.ref (key: upcase (_z), data: _z, data: _i_) ;
end ;
end ;
end ;
end ;
end ;
call missing (of v[*]) ;
do _i = 1 to countw (string) ;
_s = scan (string, _i) ;
if h.find (key: upcase (_s)) = 0 then if length (_z) > length (v) then v = _z ;
do _j = _i + 1 to countw (string) ;
_s = catx (" ", _s, scan (string, _j)) ;
if h.find (key: upcase (_s)) = 0 then if length (_z) > length (v) then v = _z ;
end ;
end ;
run ;
Kind regards
Paul D.
Up until your value of 'a b' this looked moderately routine. However since you are looking for a single value of 'a' from the first line of your Strings data it is going to be a bit difficult to search for just the 'a' except when 'a b' is present in a generic sense.
I ask for clarity sake, does your actual working data have exactly 1 or many Variables values that include other variable values as substrings? If so then you may need to provide more information or a more complex example closer to your actual data.
Best might be to find some way to split up that String variable into the component parts that you need if possible so that you have values of 'a' alone instead of 'a x'.
Hi ballardw,
My working data can contain many variable values that include other variable values as substrings.
Here's a more complex example:
You made me realize more things that are necessary here, so all in all I would like:
@ballardw, @hashman how would you go about this?
In case you want to give it a try, here's the code for data:
data strings ;
input string $30.;
cards ;
Ford carrot
Honda beans
south america black beans
green beans Audi
x5 chili pepper lettuce
garlic north America
central europe bmw
lettuce onion X5
America pepper
X5 BMW Europe
Central East Europe
bmw X5 america
;
run ;
data variables;
infile datalines delimiter=',' dsd;
length cars $6 vegetables $12 regions $14;
input cars $ vegetables $ regions $;
datalines;
Audi,beans,America
BMW,black beans,Central Europe
BMW X5,carrot,East Europe
X5,chili pepper,Europe
Honda,green beans,North America
,lettuce,South America
,onion,
,pepper,
;
run;
I still don't get what you want to do with the multi-word strings in the VARIABLES table.
Do you only want to match 'BMW X5' to STRINGS that contain 'BMW X5' in that order and not to strings that have 'X5 BMW' or 'BMW model X5'?
Do you want to match also to strings that contain only 'BMW' or only 'X5'?
Do you only want to match 'BMW X5' to STRINGS that contain 'BMW X5' in that order and not to strings that have 'X5 BMW' or 'BMW model X5'? YES, exactly.
Do you want to match also to strings that contain only 'BMW' or only 'X5'? These are matched to values 'BMW' and 'X5' that are also in the variable list, but not because they are part of 'BMW X5' that is in this list. You basically always scan the string for the exact value in the variable list. And if the value list contains values 'BMW', 'X5' and 'BMW X5', a string like the one in observation #12 would match all of these, but I'd like the output to be only the longest exact match - which is 'BMW X5' in this case. In obs #10 it matches 'X5' and 'BMW', so 'BMW' goes into output.
In that case use the FINDW() function.
proc sql ;
create table want as
select string
, case when (findw(string,cars,' ','sit')) then cars else ' ' end as cars
, case when (findw(string,vegetables,' ','sit')) then vegetables else ' ' end as vegetables
, case when (findw(string,regions,' ','sit')) then regions else ' ' end as regions
from strings,variables
having not missing(cats(calculated cars,calculated vegetables,calculated regions))
order by 1,2,3,4
;
quit;
This program finds all the matches and creates an observation for each match. How do I make it to select only the longest match for each variable?
Probably easier to do each "variable" separately then. I added row numbers to your sample dataset to make it easier to see what is happening. And to give a tiebreaker when there are multiple
data strings ;
input string $30.;
snum+1;
cards;
Ford carrot
Honda beans
south america black beans
green beans Audi
x5 chili pepper lettuce
garlic north America
central europe bmw
lettuce onion X5
America pepper
X5 BMW Europe
Central East Europe
bmw X5 america
;
data variables;
infile cards delimiter=',' dsd;
length cars $6 vegetables $12 regions $14;
input cars vegetables regions;
vnum+1;
cards;
Audi,beans,America
BMW,black beans,Central Europe
BMW X5,carrot,East Europe
X5,chili pepper,Europe
Honda,green beans,North America
,lettuce,South America
,onion,
,pepper,
;
proc sql ;
create table vegetables as
select snum,string
, lengthn(vegetables) as length
, vnum,vegetables
from strings left join variables
on findw(string,vegetables,' ','sit')
order by snum,string,length,snum
;
quit;
data vegetables;
set vegetables;
by snum;
if last.snum;
run;
proc print data=vegetables width=min;
run;
Obs snum string length vnum vegetables 1 1 Ford carrot 6 3 carrot 2 2 Honda beans 5 1 beans 3 3 south america black beans 11 2 black beans 4 4 green beans Audi 11 5 green beans 5 5 x5 chili pepper lettuce 12 4 chili pepper 6 6 garlic north America 0 . 7 7 central europe bmw 0 . 8 8 lettuce onion X5 7 6 lettuce 9 9 America pepper 6 8 pepper 10 10 X5 BMW Europe 0 . 11 11 Central East Europe 0 . 12 12 bmw X5 america 0 .
For some reason my output from your code looks different:
Any idea why? There are no errors or warnings in the log.
Well, in this case my program in its core remains the same, except:
The rest is just diligent, even if somewhat tedious, coding:
data strings ;
input string $30.;
cards ;
Ford carrot
Honda beans
south america black beans
green beans Audi
x5 chili pepper lettuce
garlic north America
central europe bmw
lettuce onion X5
America pepper
X5 BMW Europe
Central East Europe
bmw X5 america
;
run ;
data variables;
infile datalines delimiter=',' dsd;
length cars $6 vegetables $12 regions $14;
input cars $ vegetables $ regions $;
datalines;
Audi,beans,America
BMW,black beans,Central Europe
BMW X5,carrot,East Europe
X5,chili pepper,Europe
Honda,green beans,North America
,lettuce,South America
,onion,
,pepper,
;
run;
proc sql noprint ;
select max (length) into :maxvlen from dictionary.columns
where libname="WORK" and memname="VARIABLES"
;
quit ;
data want (drop = _:) ;
set strings ;
if _n_ = 1 then do ;
length _s _z $ &maxvlen ;
dcl hash h () ;
h.definekey ("_s") ;
h.definedata ("_z", "_i_") ;
h.definedone () ;
do until (z) ;
set variables end = z ;
array v cars vegetables regions ;
do over v ;
if cmiss (v) then continue ;
do _i = 1 to countw (v) ;
_z = scan (v, _i) ;
h.ref (key: upcase (_z), data: _z, data: _i_) ;
do _j = _i + 1 to countw (v) ;
_z = catx (" ", _z, scan (v, _j)) ;
h.ref (key: upcase (_z), data: _z, data: _i_) ;
end ;
end ;
end ;
end ;
end ;
call missing (of v[*]) ;
do _i = 1 to countw (string) ;
_s = scan (string, _i) ;
if h.find (key: upcase (_s)) = 0 then if length (_z) > length (v) then v = _z ;
do _j = _i + 1 to countw (string) ;
_s = catx (" ", _s, scan (string, _j)) ;
if h.find (key: upcase (_s)) = 0 then if length (_z) > length (v) then v = _z ;
end ;
end ;
run ;
Kind regards
Paul D.
Thank you! This is exactly what I need 🙂
The only thing I changed is that I created a macro variable with the list of variable names and then used it in array "v".
>I created a macro variable with the list of variable names and then used it in array "v"<
Given that you'd been striving to make the program as little hard coded as possible, this totally makes sense.
Hi, I discovered 2 more things that aren't working as I expected:
My output:
Desired output:
Please, do you have any tip on how to modify the code to achieve this?
The code so far with new data:
data strings ;
input string $30.;
cards ;
Ford carrot
Honda beans South Africa
south america black beans
green beans Audi
x5 chili pepper
garlic north America
central europe bmw
lettuce north onion X5
America pepper
X5 BMW Europe
Central East Europe
bmw X5 america
;
run ;
data variables;
infile datalines delimiter=',' dsd;
length cars $6 vegetablegeneral $7 vegetable $12 regions $14;
input cars $ vegetablegeneral $ vegetable $ regions $;
datalines;
Audi,beans,beans,America
BMW,carrot,black beans,Central Europe
BMW X5,lettuce,carrot,East Europe
X5,onion,chili pepper,Europe
Honda,pepper,green beans,North America
,,lettuce,South America
,,onion,
,,pepper,
;
run;
proc sql noprint ;
select max (length) into :maxvlen from dictionary.columns
where libname="WORK" and memname="VARIABLES"
;
quit ;
proc sql noprint;
select name into :varlist separated by ' ' from dictionary.columns
where libname="WORK" and memname="VARIABLES";
quit;
data want (drop = _:) ;
set strings ;
if _n_ = 1 then do ;
length _s _z $ &maxvlen ;
dcl hash h () ;
h.definekey ("_s") ;
h.definedata ("_z", "_i_") ;
h.definedone () ;
do until (z) ;
set variables end = z ;
array v &varlist ;
do over v ;
if cmiss (v) then continue ;
do _i = 1 to countw (v) ;
_z = scan (v, _i) ;
h.ref (key: upcase (_z), data: _z, data: _i_) ;
do _j = _i + 1 to countw (v) ;
_z = catx (" ", _z, scan (v, _j)) ;
h.ref (key: upcase (_z), data: _z, data: _i_) ;
end ;
end ;
end ;
end ;
end ;
call missing (of v[*]) ;
do _i = 1 to countw (string) ;
_s = scan (string, _i) ;
if h.find (key: upcase (_s)) = 0 then if length (_z) > length (v) then v = _z ;
do _j = _i + 1 to countw (string) ;
_s = catx (" ", _s, scan (string, _j)) ;
if h.find (key: upcase (_s)) = 0 then if length (_z) > length (v) then v = _z ;
end ;
end ;
run ;
@LuciaCekanakova: Will take a look when I get a piece of free quality time.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.