BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Mani1
Calcite | Level 5

Hi,

 

I need suggestion which function in SAS can be used to achieve below example requirement

 

I got data column as single variable

 

ABC123;ABC245;ABC678;

 

Also got lookup table as

 

123 mean good

245 mean average

678 mean bad and so on

 

I need to write some code to read cell ABC123;ABC245;ABC678; in a data set and lookup equivalent value in and data table and create new column with description .

 

Looking for any suggestion.

 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
ErikLund_Jensen
Rhodochrosite | Level 12

Hi

 

I will start with creating some test data based on your description and examples. As mentioned a few times in this community, it would be a great help, if the person asking a question also provided the code to make test data, so it could be copy-pasted into SAS. 

 

 

/* Simulate input data table */
data have;
	string = 'xyz0123; xyz0435; xyz0987;'; output;
	string = 'wcz0453; wcz0657; wcz0987;'; output;
	string = 'abc0123; abcsdfsdf0435;'; output;
	string = 'erfsdf0453; fgh0657; hij1111;'; output;
run;

/* Simulate existing code table */
data codes;
	length Code $4 Description $20 Category $10; 
	Code = '0123'; Description = 'M'; Category = 'Gender'; output;
	Code = '0657'; Description = 'F'; Category = 'Gender'; output;
	Code = '0435'; Description = 'California'; Category = 'State'; output;
	Code = '0453'; Description = 'Arizona'; Category = 'State'; output;
	Code = '0987'; Description = 'US'; Category = 'Country'; output;
run;

 

 

Then we should find a way to use code descriptions and categories in a recoding program. In this case hash objects would be the right choice, but it has a steep learning curve and requires a deep understanding af what's going on in a data step, so I will - as suggested by Reeza - use formats. Wi will create two formats (= translation tables) out of the code table. SAS Proc format can create a format from a table with a specified structure, so we build the tables and then use them as input to proc format.

 

/* Create format to translate from Code to Description */
data descfmt; set codes end=eof;
	keep fmtname type start label hlo;
	retain fmtname 'descfmt' type 'C';
	start = Code;
	label = Description;
	output;
run;
proc format cntlin=descfmt;
run;

/* Create format to translate from Code to Category */
data catfmt; set codes end=eof;
	keep fmtname type start label hlo;
	retain fmtname 'catfmt' type 'C';
	start = Code;
	label = Category;
run;
proc format cntlin=catfmt;
run;

Now we are ready to do the recoding String in a data step. Other variables besides String in the input data set are transferred "as is" to the output data set. 

 

The idea is to process each record and loop over the elements in the input string. In the loop, we extract the element and split it in Prefix and Code. We then use the Category-format to find out which variable should hold the description, and use the Description-format to get the value. All variables are initiated with a not-ascertainable-value, that will be kept if a code is not found in the current string to match a given category.

 

If there is anything in the code you don't understand, like the functions used, then try googling it, just remember always to precede your search term (like scan function) with SAS. The built-in help is actually pretty good, too.

 

* Use formats to recode input;
data want; set have;
	length Code $4 wpref GenderPrefix Gender StatePrefix State CountryPrefix Country $40;
	drop String Code i wstr strl wpref;

	* Initiate variables with n.a;
	Gender = 'n.a.'; State = 'n.a.'; Country = 'n.a.';
	GenderPrefix = 'n.a.'; StatePrefix = 'n.a.'; CountryPrefix = 'n.a.';
	
	* Loop over elements in String;
	do i = 1 to count(string,';');

		* Split in single results and set result length;
		wstr = left(scan(string,i,';'));
		strl = length(trim(wstr));

		* Extract prefix as everything except last four characters, and code as rest;
		wpref = substr(wstr,1,strl-4);
		Code = substr(wstr,strl-3);

		* Set Description in proper variable according to Category; 
		if put(Code,$catfmt.) = 'Gender' then do;
			GenderPrefix = wpref;
			Gender = put(Code,$descfmt.);
		end;
		else if put(Code,$catfmt.) = 'State' then do;
			StatePrefix = wpref;
			State = put(Code,$descfmt.);
		end;
		else if put(Code,$catfmt.) = 'Country' then do;
			CountryPrefix = wpref;
			Country = put(Code,$descfmt.);
		end;
	end;
	output;
run;

And what do we get as output:

 

want.gif

 

I hope this helps.

 

 

View solution in original post

12 REPLIES 12
PGStats
Opal | Level 21

What do you want as the result?

PG
Reeza
Super User

SCAN() to separate the components. 

 

PROC FORMAT to recode the values. 

 


@Mani1 wrote:

Hi,

 

I need suggestion which function in SAS can be used to achieve below example requirement

 

I got data column as single variable

 

ABC123;ABC245;ABC678;

 

Also got lookup table as

 

123 mean good

245 mean average

678 mean bad and so on

 

I need to write some code to read cell ABC123;ABC245;ABC678; in a data set and lookup equivalent value in and data table and create new column with description .

 

Looking for any suggestion.

 

 

 

 


 

ErikLund_Jensen
Rhodochrosite | Level 12

Hi Mani1

 

Do you have your string 'ABC123;ABC245;....' as a variable in a SAS data set, or is it in some external data source, so the first task is to get it into SAS? - I ask because you use the spreadsheet-word CELL. If it is in SAS, use Reeza's suggestion and use Scan to split the string in variables, otherwise you could split while reading the source.

 

Does the string always contain exact 3 results, or could it be any number? - and is a result always exact 3 digits and always preceded by ABC, or is ABC used as a placeholder in the example for a course name or something? My example below works with any number of results and almost anything in a given result, as long as the last non-blank character before the result value is not a digit, like in "this is the 1. result 123" -> 123.

 

SAS formats are fine in a SAS report, and useful as a tool for recoding, but do not depend on using a format with a varable. You lose it if you keep your tables in a database, and it is often useful to have both the formatted and the unformatted value, so I perfer keeping the formatted value in a separate variable. And because you have a look-up table, you can get that with a simple join instead of having to transform your lookup-table into a format.

 

data results;
	id = 1; results = 'ABC123;ABC245;ABC678;'; output;
	id = 2; results = 'ABC123;ABC245;'; output;
run;
data scoreval;
	length score $10;
	result = 123; score = 'good'; output;
	result = 245; score = 'average'; output;
	result = 678; score = 'bad'; output;
run;

data inbetween; set results;
	keep id prefix result;
	length prefix $40. w $40;
	do i = 1 to count(results,';');
		* split in single results;
		w = scan(results,i,';');
		* extract prefix as everything until only digits to end of string;
		prefix = prxchange('s/(.*\D)(\d*$)/$1/',-1,trim(w));
		* extract numeric result as all digits following the last non-digit char in string;
		result = input(prxchange('s/(.*\D)(\d*$)/$2/',-1,trim(w)),8.);
		output;
	end;
run;

proc sql;
	create table scores as
		select a.id, a.prefix, a.result, b.score 
		from inbetween as a left join scoreval as b
		on a.result = b.result
order by id; quit;

scores.gif

 

Mani1
Calcite | Level 5

Thank Jensen for your guidance. I will try to apply your logic and update you on results.

 

Answer to your question.

 

1. As you mentioned, its a string. I got single column in SAS dataset  with multiple value separated by ";" the prefix for each value is not always same. Here I have show ABC as prefix. but its not always same. The dataset I have, Looks like I need to extract last four digit. (Can you help me with that) 

 

2. To add more information to the type of data I have in the string column: Let me explain you with two data line.

 

Example:                                Decoded value

xyz123; xyz435; xyz987;       M   California   US

wcz453; wcz657;wcz987;      Arizona F US

 

Here description for codes are

123 - M (Gender Male)

657 - F (Gender Female)

435 - California (State )

987 - Arizona (State)

987 - US  (Country)

 

If you see in above example; I got multiple information in single string and also not in same order. (as you see above - Gender code is first order in first dataline  and 2nd order  in second dataline). 

 

I need to extract this information by scanning each string and look for gender code and apply correct gender description in gender column and so on for each data. If code is missing, then I need to say not available. I have separate table with possible code (4 digit) for gender, state and country. I need to lookup for possible value while scanning each string and generate description (the prefix is not always same, but code is common..example xyz123 or abc123, both are male)

 

Thanks for your help.

 

I'm beginner in SAS programming and trying to learn while working in my project.

ErikLund_Jensen
Rhodochrosite | Level 12

Hi

 

I need to know more about the code table.

First about values: You wrote: possible code (4 digit), but what is 4 digits in this case, it is a character variable with values like "0123", or is it a number with up to 4 digits? - and I guess that means that string "XYZ987" could also be 4 digits like "XYZ1139".

 

Next: In your example you have code 987 meaning both Arizona (State) and US (Country). I hope it's a typing error, because technically  the same code could be used for State and Country, as long as you the type. But your input string "XYZ987" has no indication of whether 987 is Gender, State og Country, so the code has to be unique.

 

That leads to the last problem. If you want separate columns for Gender, State og Country in your output, you must have a table with the type of each code. You do not need it to associate code 987 with the text "M", but you need it to tell that "M" shold appeas in a column named "Gender".

Mani1
Calcite | Level 5

Few additional points:

 

1. code 987 meaning both Arizona (State) and US (Country.

Its a typo error. every code has unique description and category as you said.

 

2.First about values: You wrote: possible code (4 digit), but what is 4 digits in this case, it is a character variable with values like "0123", or is it a number with up to 4 digits? - and I guess that means that string "XYZ987" could also be 4 digits like "XYZ1139".

 

a) Code with description are with 4 digits like 0123, 1100, 0200 etc; each falling under different category like Gender, Country , State. (I have data set with code, description and category)

 

b) the variable in the data set looks like ABC0100;2745L2000;600AT0100; etc. (prefix need not be always numeric/ character) as you have observed, last four digit (or character) are always code for which i need to map description for the given category column;

 

c)as i described in my previous post, category position are not always same in given string. 

 

3) That leads to the last problem. If you want separate columns for Gender, State og Country in your output, you must have a table with the type of each code. You do not need it to associate code 987 with the text "M", but you need it to tell that "M" shold appeas in a column named "Gender".

 

a) I have lookup dataset with three variables - Code, code description and Category. For every category, i need to have column generated in output data set with code description for each dataline. If code is not available in given string, need to identify it as missing.

 

 

 

 

ErikLund_Jensen
Rhodochrosite | Level 12

Hi

 

I will start with creating some test data based on your description and examples. As mentioned a few times in this community, it would be a great help, if the person asking a question also provided the code to make test data, so it could be copy-pasted into SAS. 

 

 

/* Simulate input data table */
data have;
	string = 'xyz0123; xyz0435; xyz0987;'; output;
	string = 'wcz0453; wcz0657; wcz0987;'; output;
	string = 'abc0123; abcsdfsdf0435;'; output;
	string = 'erfsdf0453; fgh0657; hij1111;'; output;
run;

/* Simulate existing code table */
data codes;
	length Code $4 Description $20 Category $10; 
	Code = '0123'; Description = 'M'; Category = 'Gender'; output;
	Code = '0657'; Description = 'F'; Category = 'Gender'; output;
	Code = '0435'; Description = 'California'; Category = 'State'; output;
	Code = '0453'; Description = 'Arizona'; Category = 'State'; output;
	Code = '0987'; Description = 'US'; Category = 'Country'; output;
run;

 

 

Then we should find a way to use code descriptions and categories in a recoding program. In this case hash objects would be the right choice, but it has a steep learning curve and requires a deep understanding af what's going on in a data step, so I will - as suggested by Reeza - use formats. Wi will create two formats (= translation tables) out of the code table. SAS Proc format can create a format from a table with a specified structure, so we build the tables and then use them as input to proc format.

 

/* Create format to translate from Code to Description */
data descfmt; set codes end=eof;
	keep fmtname type start label hlo;
	retain fmtname 'descfmt' type 'C';
	start = Code;
	label = Description;
	output;
run;
proc format cntlin=descfmt;
run;

/* Create format to translate from Code to Category */
data catfmt; set codes end=eof;
	keep fmtname type start label hlo;
	retain fmtname 'catfmt' type 'C';
	start = Code;
	label = Category;
run;
proc format cntlin=catfmt;
run;

Now we are ready to do the recoding String in a data step. Other variables besides String in the input data set are transferred "as is" to the output data set. 

 

The idea is to process each record and loop over the elements in the input string. In the loop, we extract the element and split it in Prefix and Code. We then use the Category-format to find out which variable should hold the description, and use the Description-format to get the value. All variables are initiated with a not-ascertainable-value, that will be kept if a code is not found in the current string to match a given category.

 

If there is anything in the code you don't understand, like the functions used, then try googling it, just remember always to precede your search term (like scan function) with SAS. The built-in help is actually pretty good, too.

 

* Use formats to recode input;
data want; set have;
	length Code $4 wpref GenderPrefix Gender StatePrefix State CountryPrefix Country $40;
	drop String Code i wstr strl wpref;

	* Initiate variables with n.a;
	Gender = 'n.a.'; State = 'n.a.'; Country = 'n.a.';
	GenderPrefix = 'n.a.'; StatePrefix = 'n.a.'; CountryPrefix = 'n.a.';
	
	* Loop over elements in String;
	do i = 1 to count(string,';');

		* Split in single results and set result length;
		wstr = left(scan(string,i,';'));
		strl = length(trim(wstr));

		* Extract prefix as everything except last four characters, and code as rest;
		wpref = substr(wstr,1,strl-4);
		Code = substr(wstr,strl-3);

		* Set Description in proper variable according to Category; 
		if put(Code,$catfmt.) = 'Gender' then do;
			GenderPrefix = wpref;
			Gender = put(Code,$descfmt.);
		end;
		else if put(Code,$catfmt.) = 'State' then do;
			StatePrefix = wpref;
			State = put(Code,$descfmt.);
		end;
		else if put(Code,$catfmt.) = 'Country' then do;
			CountryPrefix = wpref;
			Country = put(Code,$descfmt.);
		end;
	end;
	output;
run;

And what do we get as output:

 

want.gif

 

I hope this helps.

 

 

Reeza
Super User

A modification on @ErikLund_Jensen  so that the position doesn't matter. 

Note that the 11 doesn't have a match so it produces an error. I didn't design the code to handle errors, but you may want to. 

 

Basically, rather than attempt to set values, I create a row for each element and transpose it afterwards instead. 

 

Thanks to @ErikLund_Jensen  for the starter code - I used his first few steps and code

 

/* Simulate input data table */
data have;
   string = 'xyz0123; xyz0435; xyz0987;'; output;
   string = 'wcz0453; wcz0657; wcz0987;'; output;
   string = 'abc0123; abcsdfsdf0435;'; output;
   string = 'erfsdf0453; fgh0657; hij1111;'; output;
run;

/* Simulate existing code table */
data codes;
   length Code $4 Description $20 Category $10; 
   Code = '0123'; Description = 'M'; Category = 'Gender'; output;
   Code = '0657'; Description = 'F'; Category = 'Gender'; output;
   Code = '0435'; Description = 'California'; Category = 'State'; output;
   Code = '0453'; Description = 'Arizona'; Category = 'State'; output;
   Code = '0987'; Description = 'US'; Category = 'Country'; output;
run;

/* Create format to translate from Code to Description */
data descfmt; set codes end=eof;
    keep fmtname type start label hlo;
    retain fmtname 'descfmt' type 'C';
    start = Code;
    label = Description;
    output;
run;
proc format cntlin=descfmt;
run;

/* Create format to translate from Code to Category */
data catfmt; set codes end=eof;
    keep fmtname type start label hlo;
    retain fmtname 'catfmt' type 'C';
    start = Code;
    label = Category;
run;
proc format cntlin=catfmt;
run;


data temp;
set have;
nElements = countw(string);
row = _n_;
do i=1 to nElements;
element = compress(scan(string, i, ';'), , 'kd');
category = put(element, catfmt.);
description = put(element, descfmt.);
output;
end;
run;

proc sort data=temp;
by row category description;
run;

proc transpose data=temp out=want;
by row;
id category;
var description;
run;


ErikLund_Jensen
Rhodochrosite | Level 12

@Reeza 

 

Transpose is a much better solution, because it makes the code resistant to introduction of more categories. But splitting with compress is dangerous, as it is unknown if the prefix part of an element can contain digits.

 

And I am curious as to what you mean by "so that the position doesn't matter", because I tried the code with different positions of categories in a row and category duplicates, and it keeps the last occurrence of a given category regardless of position.

 

Though being a SAS user since V79, I never really got proc transpose into my "intuitive toolbox", so I wonder if transpose could handle the prefixes too like in my output example, if that was wanted. I would appreciate an answer on that.

Reeza
Super User

@ErikLund_Jensen wrote:

@Reeza 

 

Transpose is a much better solution, because it makes the code resistant to introduction of more categories. But splitting with compress is dangerous, as it is unknown if the prefix part of an element can contain digits.


 

Agreed.

 


And I am curious as to what you mean by "so that the position doesn't matter", because I tried the code with different positions of categories in a row and category duplicates, and it keeps the last occurrence of a given category regardless of position.


I was referring to position of elements within the string, ie Gender being before Country. It doesn't matter with a Transpose approach. 

 


 

Though being a SAS user since V79, I never really got proc transpose into my "intuitive toolbox", so I wonder if transpose could handle the prefixes too like in my output example, if that was wanted. I would appreciate an answer on that.


 

I'm not sure about the prefix part, I think that may require multiple transpose steps, but that information is all int he formats so I'm not sure how valuable it is to have it built into the data set. I don't think PROC TRANSPOSE can do it, but there's a macro via some users on here that supports transpose. 

 

Recently (last 5 years) SAS has introduced more support for prefixes in names and such.

Mani1
Calcite | Level 5

Thank you for your insights.

 

Mani.

Mani1
Calcite | Level 5

This helped me perfectly. Thanks for your help. Mani.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 1689 views
  • 1 like
  • 4 in conversation