Solved: help with function for finding a word and lookup value for the word

Mani1 · Posted 01-29-2019 10:23 PM

Hi,

I need suggestion which function in SAS can be used to achieve below example requirement

I got data column as single variable

ABC123;ABC245;ABC678;

Also got lookup table as

123 mean good

245 mean average

678 mean bad and so on

I need to write some code to read cell ABC123;ABC245;ABC678; in a data set and lookup equivalent value in and data table and create new column with description .

Looking for any suggestion.

ErikLund_Jensen · Posted 01-31-2019 12:24 PM

Hi

I will start with creating some test data based on your description and examples. As mentioned a few times in this community, it would be a great help, if the person asking a question also provided the code to make test data, so it could be copy-pasted into SAS.

/* Simulate input data table */
data have;
	string = 'xyz0123; xyz0435; xyz0987;'; output;
	string = 'wcz0453; wcz0657; wcz0987;'; output;
	string = 'abc0123; abcsdfsdf0435;'; output;
	string = 'erfsdf0453; fgh0657; hij1111;'; output;
run;

/* Simulate existing code table */
data codes;
	length Code $4 Description $20 Category $10; 
	Code = '0123'; Description = 'M'; Category = 'Gender'; output;
	Code = '0657'; Description = 'F'; Category = 'Gender'; output;
	Code = '0435'; Description = 'California'; Category = 'State'; output;
	Code = '0453'; Description = 'Arizona'; Category = 'State'; output;
	Code = '0987'; Description = 'US'; Category = 'Country'; output;
run;

Then we should find a way to use code descriptions and categories in a recoding program. In this case hash objects would be the right choice, but it has a steep learning curve and requires a deep understanding af what's going on in a data step, so I will - as suggested by Reeza - use formats. Wi will create two formats (= translation tables) out of the code table. SAS Proc format can create a format from a table with a specified structure, so we build the tables and then use them as input to proc format.

/* Create format to translate from Code to Description */
data descfmt; set codes end=eof;
	keep fmtname type start label hlo;
	retain fmtname 'descfmt' type 'C';
	start = Code;
	label = Description;
	output;
run;
proc format cntlin=descfmt;
run;

/* Create format to translate from Code to Category */
data catfmt; set codes end=eof;
	keep fmtname type start label hlo;
	retain fmtname 'catfmt' type 'C';
	start = Code;
	label = Category;
run;
proc format cntlin=catfmt;
run;

Now we are ready to do the recoding String in a data step. Other variables besides String in the input data set are transferred "as is" to the output data set.

The idea is to process each record and loop over the elements in the input string. In the loop, we extract the element and split it in Prefix and Code. We then use the Category-format to find out which variable should hold the description, and use the Description-format to get the value. All variables are initiated with a not-ascertainable-value, that will be kept if a code is not found in the current string to match a given category.

If there is anything in the code you don't understand, like the functions used, then try googling it, just remember always to precede your search term (like scan function) with SAS. The built-in help is actually pretty good, too.

* Use formats to recode input;
data want; set have;
	length Code $4 wpref GenderPrefix Gender StatePrefix State CountryPrefix Country $40;
	drop String Code i wstr strl wpref;

	* Initiate variables with n.a;
	Gender = 'n.a.'; State = 'n.a.'; Country = 'n.a.';
	GenderPrefix = 'n.a.'; StatePrefix = 'n.a.'; CountryPrefix = 'n.a.';
	
	* Loop over elements in String;
	do i = 1 to count(string,';');

		* Split in single results and set result length;
		wstr = left(scan(string,i,';'));
		strl = length(trim(wstr));

		* Extract prefix as everything except last four characters, and code as rest;
		wpref = substr(wstr,1,strl-4);
		Code = substr(wstr,strl-3);

		* Set Description in proper variable according to Category; 
		if put(Code,$catfmt.) = 'Gender' then do;
			GenderPrefix = wpref;
			Gender = put(Code,$descfmt.);
		end;
		else if put(Code,$catfmt.) = 'State' then do;
			StatePrefix = wpref;
			State = put(Code,$descfmt.);
		end;
		else if put(Code,$catfmt.) = 'Country' then do;
			CountryPrefix = wpref;
			Country = put(Code,$descfmt.);
		end;
	end;
	output;
run;

And what do we get as output:

I hope this helps.

View solution in original post

PGStats · Posted 01-29-2019 10:39 PM

What do you want as the result?

PG

Reeza · Posted 01-29-2019 11:15 PM

SCAN() to separate the components.

PROC FORMAT to recode the values.

@Mani1 wrote:

Hi,

I need suggestion which function in SAS can be used to achieve below example requirement

I got data column as single variable

ABC123;ABC245;ABC678;

Also got lookup table as

123 mean good

245 mean average

678 mean bad and so on

I need to write some code to read cell ABC123;ABC245;ABC678; in a data set and lookup equivalent value in and data table and create new column with description .

Looking for any suggestion.

ErikLund_Jensen · Posted 01-30-2019 06:05 AM

Hi Mani1

Do you have your string 'ABC123;ABC245;....' as a variable in a SAS data set, or is it in some external data source, so the first task is to get it into SAS? - I ask because you use the spreadsheet-word CELL. If it is in SAS, use Reeza's suggestion and use Scan to split the string in variables, otherwise you could split while reading the source.

Does the string always contain exact 3 results, or could it be any number? - and is a result always exact 3 digits and always preceded by ABC, or is ABC used as a placeholder in the example for a course name or something? My example below works with any number of results and almost anything in a given result, as long as the last non-blank character before the result value is not a digit, like in "this is the 1. result 123" -> 123.

SAS formats are fine in a SAS report, and useful as a tool for recoding, but do not depend on using a format with a varable. You lose it if you keep your tables in a database, and it is often useful to have both the formatted and the unformatted value, so I perfer keeping the formatted value in a separate variable. And because you have a look-up table, you can get that with a simple join instead of having to transform your lookup-table into a format.

data results;
	id = 1; results = 'ABC123;ABC245;ABC678;'; output;
	id = 2; results = 'ABC123;ABC245;'; output;
run;
data scoreval;
	length score $10;
	result = 123; score = 'good'; output;
	result = 245; score = 'average'; output;
	result = 678; score = 'bad'; output;
run;

data inbetween; set results;
	keep id prefix result;
	length prefix $40. w $40;
	do i = 1 to count(results,';');
		* split in single results;
		w = scan(results,i,';');
		* extract prefix as everything until only digits to end of string;
		prefix = prxchange('s/(.*\D)(\d*$)/$1/',-1,trim(w));
		* extract numeric result as all digits following the last non-digit char in string;
		result = input(prxchange('s/(.*\D)(\d*$)/$2/',-1,trim(w)),8.);
		output;
	end;
run;

proc sql;
	create table scores as
		select a.id, a.prefix, a.result, b.score 
		from inbetween as a left join scoreval as b
		on a.result = b.result
                order by id;
quit;

Mani1 · Posted 01-30-2019 10:10 PM

Thank Jensen for your guidance. I will try to apply your logic and update you on results.

Answer to your question.

1. As you mentioned, its a string. I got single column in SAS dataset with multiple value separated by ";" the prefix for each value is not always same. Here I have show ABC as prefix. but its not always same. The dataset I have, Looks like I need to extract last four digit. (Can you help me with that)

2. To add more information to the type of data I have in the string column: Let me explain you with two data line.

Example: Decoded value

xyz123; xyz435; xyz987; M California US

wcz453; wcz657;wcz987; Arizona F US

Here description for codes are

123 - M (Gender Male)

657 - F (Gender Female)

435 - California (State )

987 - Arizona (State)

987 - US (Country)

If you see in above example; I got multiple information in single string and also not in same order. (as you see above - Gender code is first order in first dataline and 2nd order in second dataline).

I need to extract this information by scanning each string and look for gender code and apply correct gender description in gender column and so on for each data. If code is missing, then I need to say not available. I have separate table with possible code (4 digit) for gender, state and country. I need to lookup for possible value while scanning each string and generate description (the prefix is not always same, but code is common..example xyz123 or abc123, both are male)

Thanks for your help.

I'm beginner in SAS programming and trying to learn while working in my project.

ErikLund_Jensen · Posted 01-31-2019 06:14 AM

Hi

I need to know more about the code table.

First about values: You wrote: possible code (4 digit), but what is 4 digits in this case, it is a character variable with values like "0123", or is it a number with up to 4 digits? - and I guess that means that string "XYZ987" could also be 4 digits like "XYZ1139".

Next: In your example you have code 987 meaning both Arizona (State) and US (Country). I hope it's a typing error, because technically the same code could be used for State and Country, as long as you the type. But your input string "XYZ987" has no indication of whether 987 is Gender, State og Country, so the code has to be unique.

That leads to the last problem. If you want separate columns for Gender, State og Country in your output, you must have a table with the type of each code. You do not need it to associate code 987 with the text "M", but you need it to tell that "M" shold appeas in a column named "Gender".

Mani1 · Posted 01-31-2019 08:53 AM

Few additional points:

1. code 987 meaning both Arizona (State) and US (Country.

Its a typo error. every code has unique description and category as you said.

2.First about values: You wrote: possible code (4 digit), but what is 4 digits in this case, it is a character variable with values like "0123", or is it a number with up to 4 digits? - and I guess that means that string "XYZ987" could also be 4 digits like "XYZ1139".

a) Code with description are with 4 digits like 0123, 1100, 0200 etc; each falling under different category like Gender, Country , State. (I have data set with code, description and category)

b) the variable in the data set looks like ABC0100;2745L2000;600AT0100; etc. (prefix need not be always numeric/ character) as you have observed, last four digit (or character) are always code for which i need to map description for the given category column;

c)as i described in my previous post, category position are not always same in given string.

3) That leads to the last problem. If you want separate columns for Gender, State og Country in your output, you must have a table with the type of each code. You do not need it to associate code 987 with the text "M", but you need it to tell that "M" shold appeas in a column named "Gender".

a) I have lookup dataset with three variables - Code, code description and Category. For every category, i need to have column generated in output data set with code description for each dataline. If code is not available in given string, need to identify it as missing.

ErikLund_Jensen · Posted 01-31-2019 12:24 PM

Hi

I will start with creating some test data based on your description and examples. As mentioned a few times in this community, it would be a great help, if the person asking a question also provided the code to make test data, so it could be copy-pasted into SAS.

/* Simulate input data table */
data have;
	string = 'xyz0123; xyz0435; xyz0987;'; output;
	string = 'wcz0453; wcz0657; wcz0987;'; output;
	string = 'abc0123; abcsdfsdf0435;'; output;
	string = 'erfsdf0453; fgh0657; hij1111;'; output;
run;

/* Simulate existing code table */
data codes;
	length Code $4 Description $20 Category $10; 
	Code = '0123'; Description = 'M'; Category = 'Gender'; output;
	Code = '0657'; Description = 'F'; Category = 'Gender'; output;
	Code = '0435'; Description = 'California'; Category = 'State'; output;
	Code = '0453'; Description = 'Arizona'; Category = 'State'; output;
	Code = '0987'; Description = 'US'; Category = 'Country'; output;
run;

Then we should find a way to use code descriptions and categories in a recoding program. In this case hash objects would be the right choice, but it has a steep learning curve and requires a deep understanding af what's going on in a data step, so I will - as suggested by Reeza - use formats. Wi will create two formats (= translation tables) out of the code table. SAS Proc format can create a format from a table with a specified structure, so we build the tables and then use them as input to proc format.

/* Create format to translate from Code to Description */
data descfmt; set codes end=eof;
	keep fmtname type start label hlo;
	retain fmtname 'descfmt' type 'C';
	start = Code;
	label = Description;
	output;
run;
proc format cntlin=descfmt;
run;

/* Create format to translate from Code to Category */
data catfmt; set codes end=eof;
	keep fmtname type start label hlo;
	retain fmtname 'catfmt' type 'C';
	start = Code;
	label = Category;
run;
proc format cntlin=catfmt;
run;

Now we are ready to do the recoding String in a data step. Other variables besides String in the input data set are transferred "as is" to the output data set.

The idea is to process each record and loop over the elements in the input string. In the loop, we extract the element and split it in Prefix and Code. We then use the Category-format to find out which variable should hold the description, and use the Description-format to get the value. All variables are initiated with a not-ascertainable-value, that will be kept if a code is not found in the current string to match a given category.

If there is anything in the code you don't understand, like the functions used, then try googling it, just remember always to precede your search term (like scan function) with SAS. The built-in help is actually pretty good, too.

* Use formats to recode input;
data want; set have;
	length Code $4 wpref GenderPrefix Gender StatePrefix State CountryPrefix Country $40;
	drop String Code i wstr strl wpref;

	* Initiate variables with n.a;
	Gender = 'n.a.'; State = 'n.a.'; Country = 'n.a.';
	GenderPrefix = 'n.a.'; StatePrefix = 'n.a.'; CountryPrefix = 'n.a.';
	
	* Loop over elements in String;
	do i = 1 to count(string,';');

		* Split in single results and set result length;
		wstr = left(scan(string,i,';'));
		strl = length(trim(wstr));

		* Extract prefix as everything except last four characters, and code as rest;
		wpref = substr(wstr,1,strl-4);
		Code = substr(wstr,strl-3);

		* Set Description in proper variable according to Category; 
		if put(Code,$catfmt.) = 'Gender' then do;
			GenderPrefix = wpref;
			Gender = put(Code,$descfmt.);
		end;
		else if put(Code,$catfmt.) = 'State' then do;
			StatePrefix = wpref;
			State = put(Code,$descfmt.);
		end;
		else if put(Code,$catfmt.) = 'Country' then do;
			CountryPrefix = wpref;
			Country = put(Code,$descfmt.);
		end;
	end;
	output;
run;

And what do we get as output:

I hope this helps.

Reeza · Posted 01-31-2019 02:08 PM

A modification on @ErikLund_Jensen so that the position doesn't matter.

Note that the 11 doesn't have a match so it produces an error. I didn't design the code to handle errors, but you may want to.

Basically, rather than attempt to set values, I create a row for each element and transpose it afterwards instead.

Thanks to @ErikLund_Jensen for the starter code - I used his first few steps and code

/* Simulate input data table */
data have;
   string = 'xyz0123; xyz0435; xyz0987;'; output;
   string = 'wcz0453; wcz0657; wcz0987;'; output;
   string = 'abc0123; abcsdfsdf0435;'; output;
   string = 'erfsdf0453; fgh0657; hij1111;'; output;
run;

/* Simulate existing code table */
data codes;
   length Code $4 Description $20 Category $10; 
   Code = '0123'; Description = 'M'; Category = 'Gender'; output;
   Code = '0657'; Description = 'F'; Category = 'Gender'; output;
   Code = '0435'; Description = 'California'; Category = 'State'; output;
   Code = '0453'; Description = 'Arizona'; Category = 'State'; output;
   Code = '0987'; Description = 'US'; Category = 'Country'; output;
run;

/* Create format to translate from Code to Description */
data descfmt; set codes end=eof;
    keep fmtname type start label hlo;
    retain fmtname 'descfmt' type 'C';
    start = Code;
    label = Description;
    output;
run;
proc format cntlin=descfmt;
run;

/* Create format to translate from Code to Category */
data catfmt; set codes end=eof;
    keep fmtname type start label hlo;
    retain fmtname 'catfmt' type 'C';
    start = Code;
    label = Category;
run;
proc format cntlin=catfmt;
run;


data temp;
set have;
nElements = countw(string);
row = _n_;
do i=1 to nElements;
element = compress(scan(string, i, ';'), , 'kd');
category = put(element, catfmt.);
description = put(element, descfmt.);
output;
end;
run;

proc sort data=temp;
by row category description;
run;

proc transpose data=temp out=want;
by row;
id category;
var description;
run;

ErikLund_Jensen · Posted 01-31-2019 02:46 PM

@Reeza

Transpose is a much better solution, because it makes the code resistant to introduction of more categories. But splitting with compress is dangerous, as it is unknown if the prefix part of an element can contain digits.

And I am curious as to what you mean by "so that the position doesn't matter", because I tried the code with different positions of categories in a row and category duplicates, and it keeps the last occurrence of a given category regardless of position.

Though being a SAS user since V79, I never really got proc transpose into my "intuitive toolbox", so I wonder if transpose could handle the prefixes too like in my output example, if that was wanted. I would appreciate an answer on that.

Reeza · Posted 01-31-2019 03:18 PM

@ErikLund_Jensen wrote:

@Reeza

Transpose is a much better solution, because it makes the code resistant to introduction of more categories. But splitting with compress is dangerous, as it is unknown if the prefix part of an element can contain digits.

Agreed.

And I am curious as to what you mean by "so that the position doesn't matter", because I tried the code with different positions of categories in a row and category duplicates, and it keeps the last occurrence of a given category regardless of position.

I was referring to position of elements within the string, ie Gender being before Country. It doesn't matter with a Transpose approach.

Though being a SAS user since V79, I never really got proc transpose into my "intuitive toolbox", so I wonder if transpose could handle the prefixes too like in my output example, if that was wanted. I would appreciate an answer on that.

I'm not sure about the prefix part, I think that may require multiple transpose steps, but that information is all int he formats so I'm not sure how valuable it is to have it built into the data set. I don't think PROC TRANSPOSE can do it, but there's a macro via some users on here that supports transpose.

Recently (last 5 years) SAS has introduced more support for prefixes in names and such.

Mani1 · Posted 02-01-2019 03:24 AM

Thank you for your insights.

Mani.

Mani1 · Posted 02-01-2019 03:22 AM

This helped me perfectly. Thanks for your help. Mani.

help with function for finding a word and lookup value for the word

Re: help with function for finding a word and lookup value for the word

Re: help with function for finding a word and lookup value for the word

Re: help with function for finding a word and lookup value for the word

Re: help with function for finding a word and lookup value for the word

Re: help with function for finding a word and lookup value for the word

Re: help with function for finding a word and lookup value for the word

Re: help with function for finding a word and lookup value for the word

Re: help with function for finding a word and lookup value for the word

Re: help with function for finding a word and lookup value for the word

Re: help with function for finding a word and lookup value for the word

Re: help with function for finding a word and lookup value for the word

Re: help with function for finding a word and lookup value for the word

Re: help with function for finding a word and lookup value for the word

Classroom Training Available!