Re: How to create dummy variables with this case

CynthiaWei · Posted 06-23-2020 07:30 PM

SAS Pros,

I want to create some dummy variables to describe the type of food listed in the lunch variable.

What I have is the variable ID and Lunch. Lunch is a character variable. It includes the food the people have for their lunch. The items included in the Lunch variable is kinda in a non-organized way. You will also see that ID=1 and ID=6 actually have the same food but are listed in the different order. So, how can I create dummy variables protein=1, carbon=1, fiber=1, and juice=1 to represent the lunch they have. Also, can the SAS code treat the Lunch variable as a whole instead of splitting it by its items.

What I want:

ID Lunch Protein Carbon Fiber Juice
1 egg; bread; carrot; 1 1 1
2 cheese; veggie juice; cucumber; 1 1 1
3 noodles; three pieces of pork ribs; lettuce 1 1 1
4 some bread stick; salmon; mango juice 1 1 1 1
5 milk; cookie; tomato; 1 1 1
6 bread; carrot; egg; 1 1 1

Mango juice fall into both fiber and juice.

Right now, I am using the following code:

data want;

set have;

if lunch in ('egg; bread; carrot;',
'cheese; veggie juice; cucumber;',
'noodles; three pieces of pork ribs; lettuce',
'some bread stick; salmon; mango juice',
'milk; cookie; tomato;',
'bread; carrot; egg; ') then protein=1;

else if if lunch in ('egg; bread; carrot;',
'noodles; three pieces of pork ribs; lettuce',
'some bread stick; salmon; mango juice',
'milk; cookie; tomato;',
'bread; carrot; egg; ') then carbon=1;

else if lunch in ('egg; bread; carrot;',
'cheese; veggie juice; cucumber;',
'noodles; three pieces of pork ribs; lettuce',
'some bread stick; salmon; mango juice',
'milk; cookie; tomato;',
'bread; carrot; egg; ') then fiber=1;

else if lunch in ('cheese; veggie juice; cucumber;',
'some bread stick; salmon; mango juice') then juice=1;

run;

But, it seems doesn't generate the correct dummy variables. Could you please help me debug the code or there is some better code I can use?

I really appreciate any kind of help!

Best regards,

C

Shmuel · Posted 06-23-2020 08:12 PM

Cancel the "else" in your code:

data want;
set have;

if lunch in ('egg; bread; carrot;',
'cheese; veggie juice; cucumber;',
'noodles; three pieces of pork ribs; lettuce',
'some bread stick; salmon; mango juice',
'milk; cookie; tomato;',
'bread; carrot; egg; ') then protein=1;

/*else*/ if if lunch in ('egg; bread; carrot;',
'noodles; three pieces of pork ribs; lettuce',
'some bread stick; salmon; mango juice',
'milk; cookie; tomato;',
'bread; carrot; egg; ') then carbon=1;

/*else*/ if lunch in ('egg; bread; carrot;',
'cheese; veggie juice; cucumber;',
'noodles; three pieces of pork ribs; lettuce',
'some bread stick; salmon; mango juice',
'milk; cookie; tomato;',
'bread; carrot; egg; ') then fiber=1;

 /*else*/ if lunch in ('cheese; veggie juice; cucumber;',
'some bread stick; salmon; mango juice') then juice=1;

run;

ChrisNZ · Posted 06-23-2020 08:12 PM

Like this?

if prxmatch('/egg|cheese|pork|salmon|milk/i',trim(LUNCH)) then PROTEIN=1;

if prxmatch('/bread|noodles|cookie/i' ,trim(LUNCH)) then CARBON =1;

if prxmatch('/carrot|veggie|tomato/i' ,trim(LUNCH)) then FIBER =1;

if prxmatch('/juice/i' ,trim(LUNCH)) then JUICE =1;

High-Performance SAS Coding - Third Edition

Reeza · Posted 06-23-2020 10:52 PM

I'm going to make a suggestion. Make a master table that maps each food to its category. Use that to create a format.
Then split your data so that you have one food for each line.

I suspect any further analysis in this structure will be painful and restructuring early will save you headaches.

ballardw · Posted 06-24-2020 10:51 AM

You have at least two data issues that bring up questions.

First, why do the values of "noodles; three pieces of pork ribs; lettuce" and "some bread stick; salmon; mango juice" not end in a semicolon like the others? Since you are comparing with an equals condition that may be an issue if the values do end in ;

The other and why @Reeza's comment about restructuring is particularly valid are the two values of

"egg; bread; carrot;" and "bread; carrot; egg;". Since again you are using an equals condition then neither of these would match the "carrot; bread; egg;" when it occurs in the data. And if some are missing the ; at the end and others aren't or the spaces aren't consistent you must list many combinations. Which gets worse if somehow you have values of Egg as well as egg.

One question is always "why are there multiple values in a single variable?" Anything like that often gets much harder to work with.

It might be a good idea to add variables that indicate each item such as a variable Egg that has a value of 1 when it is in the list and 0 otherwise.

Then you could use some code like:

Protein = (max(Egg, Pork, Salmon, <other variables indicating protein>) = 1);

CynthiaWei · Posted 06-27-2020 08:34 PM

Thank you so much for your response!

Actually, what you have pointed out is what the nature of the dataset I am using. It is true that there are multiple values in a single variable. Also, the difference in order and capital of certain values make it different even though they include the same items, e.g. 'Egg; bread; carrot;', ''bread; Carrot; Egg;', and 'carrot; Bread; egg'. It is also the nature of the dataset that some texts end in a semicolon but some don't and some have a trailing space after text and some don't. I can't agree with you and Reenza that I should treat the multiple values individually. However, there are so many observations and multiple values in each observation. So far, I have already completed the category for each combination of the multiple value. I think use each of these combination as a whole instead of individual one is a better way for me.

I am facing another QUESTIONS!

Specifically, I want to remove all the trailing space after text. I don't care about the leading space or the extra space in between. I have tried trim(var name), trimn(var name), strip(var name), and compress(var name). But none of them worked for my data. I don't why it doesn't.

Also, I am curious that why SAS cannot recognize the text with a trailing space, e.g. 'Egg; bread; carrot;' and 'Egg; bread; carrot; ' (there is a trailing space after). I found the texts with trailing spaces were not successfully coded.

Really appreciate!

Reeza · Posted 06-27-2020 09:47 PM

Because as character strings go, they are not the same. A space is a character to the computer system. And what looks like a space may also be a tab or different whitespace character. If you have a trailing space how should SAS know that it's supposed to match them? Should a tab match a regular space? In your particular case it should but I've seen times when it shouldn't. They have to pick an approach and most languages default to using exact comparisons.

Patrick · Posted 06-27-2020 09:49 PM

@CynthiaWei wrote:

I can't agree with you and Reenza that I should treat the multiple values individually.

You probably should listen to what people as experienced as @ballardw and @Reeza are recommending. It's much easier to work with a long and narrow data set and you can always transpose the data to generate the wide structures if and when you need it.

Below how you could at least "clean-up" your lunch strings a bit so that it's easier to work with. The code to transpose your data into a long format would be very similar.

data have;
  infile datalines truncover dlm='|';
  input id lunch :$100.;
  datalines4;
1|egg; bread; carrot;
2|cheese; veggie juice; cucumber;
3|noodles; three pieces of pork ribs; lettuce 
4|some bread   stick; salmon; mango juice
5|milk; cookie; tomato;
6|bread; carrot; egg;
;;;;

data have_stdz;
  set have;
  array items {10} $40 _temporary_;
  call missing(of items[*]);
  do _i=1 by 1;
    items[_i]=scan(lunch,_i,';');
    items[_i]=prxchange('s/\s+/ /',-1,strip(items[_i]));
    items[_i]=propcase(strip(items[_i]));
    if missing(items[_i]) then leave;
  end;
  call sortc(of items[*]);
  lunch_stdz=catx(';',of items[*]);
run;

proc print;
run;

Reeza · Posted 06-27-2020 11:22 PM

I can't agree with you and Reenza that I should treat the multiple values individually. However, there are so many observations and multiple values in each observation. So far, I have already completed the category for each combination of the multiple value. I think use each of these combination as a whole instead of individual one is a better way for me.

The reason I recommend this approach is for accuracy, ease of calculations later on and scalability.

Regarding accuracy, your proposed method is more prone to missing categories and not being able to easily find these issues in your data unless you explicitly look for it. This means each time you need to re-run this analysis or something changes you MUST check it in detail again. Your code/approach assumes if you don't find an item, it's not there, not that your code/logic may have missed it. If a single item is out of order or gets truncated you'll also miss it. In my approach, if you miscode an item it will get flagged and you'll see that issue.Ultimately it is your project and you can choose whatever method is easier for you, but I thought I'd explain why I recommend this approach in detail and provide an example for you below. There are other ways to also account for these concerns but this is one of the simplest IMHO.

data have;
  infile datalines truncover dlm='|';
  input id lunch :$100.;
  datalines4;
1|egg; bread; carrot;
2|cheese; veggie juice; cucumber;
3|noodles; three pieces of pork ribs; lettuce 
4|some bread stick; salmon; mango juice
5|milk; cookie; tomato;
6|bread; carrot; egg;
7|chips; chocolate; mango
;;;;

data item_list;
set have;

nWords = countc(lunch, ';');

do i=1 to nWords;
   item = lowcase(strip(trim(scan(lunch, i, ';'))));
   output;
end;

run;

proc freq data=item_list ;
table item / out=categorize_list;
run;

/*I would then output that to Excel or a document and create my categories of the food types
and then import that data back to SAS or do it in SAS directly with IF/THEN statemetns*/

data cg_list;
set categorize_list;
length food_group $15.;

if item in 
    ('egg', 'three pieces of pork ribs', 'salmon') 
        then food_group = "meat";
else if item in 
    ('carrot', 'lettuce', 'veggie juice', 'cucumber' )  
        then food_group = "vegetable";
else if item in 
    ('mango juice', 'tomato') 
        then food_group = "fruit";
else if item in 
    ('bread', 'noodles', 'cookie', 'some bread stick') 
        then food_group = "breads";
else if item in 
    ('cheese', 'milk') 
        then food_group = "dairy";
else if item in 
    ('chips', 'chocolate') 
        then food_group = "junk";
else food_group = "CHECKME";

run;

*check my mappings - make sure no CHECKME;
proc print data=cg_list;
where food_group = "CHECKME";
run;

*create a format out of it;
data cg_fmt;
set cg_list;
fmtname = 'food_group_fmt';
type ='C';
start = item;
label = food_group;
run;

proc format cntlin=cg_fmt;
run;

*apply format to data;
data items_categorized;
set item_list;
food_group = put(item, $food_group_fmt.);
count = 1;
run;

Examples of usage:


title 'how many people ate from each food group?';
proc sql;
select food_group, count(distinct ID) as Num_People
from items_categorized
group by food_group;
quit;

title 'which people ate how much of each food group?';
proc freq data=items_categorized;
table id*food_group / nopercent nocol norow;
run;

(The output, looks better in SAS than here but I was too lazy to paste a picture in)

how many people ate from each food group?


food_group	Num_People
breads	5
dairy	2
fruit	1
junk	1
meat	4
vegetable	3

which people ate how much of each food group?

The FREQ Procedure

Frequency


Table of id by food_group
id	food_group
id	breads	dairy	fruit	junk	meat	vegetable	Total
1	1	0	0	0	1	1	3
2	0	1	0	0	0	2	3
3	1	0	0	0	1	0	2
4	1	0	0	0	1	0	2
5	1	1	1	0	0	0	3
6	1	0	0	0	1	1	3
7	0	0	0	2	0	0	2
Total	5	2	1	2	4	4	18

Patrick · Posted 06-27-2020 11:38 PM

Here a coding approach how you can get to your desired result. It also demonstrates the advantages of using a long and narrow structure.

All you need to do now is amend the format to cover all your items. Should you actually have a master list somewhere with food names and categories then one could potentially even dynamically create both the format and the parsing of your Lunch string to pull out the food items.

data have;
  infile datalines truncover dlm='|';
  input id lunch :$100.;
  datalines4;
1|egg; bread; carrot;
2|cheese;	veggie juice; cucumber;
3|noodles; three pieces	of pork ribs; lettuce 
4|some bread   stick; salmon; mango juice
5|milk; cookie; tomato;	
6|bread; carrot; egg;
;;;;

data have_long(drop=_:);
  length id 8;
  set have(rename=(id=group_id));
  length item $40;
  do _i=1 by 1;
    item=scan(lunch,_i,';');
    item=prxchange('s/\s+/ /',-1,strip(item));
    item=upcase(strip(item));
    if missing(item) then leave;
    id+1;
    output;
  end;
run;

proc format;
  value $foodcat (multilabel)
    'BREAD' = 'Carbon'
    'CARROT' = 'Fiber'
    'CHEESE' = 'Protein'
    'COOKIE' = 'Carbon'
    'CUCUMBER' = 'Fiber'
    'EGG' = 'Protein'
    'LETTUCE' = 'Fiber'
    'MANGO JUICE' = 'Fiber'
    'MANGO JUICE' = 'Juice'
    'MILK' = 'Protein'
    'NOODLES' = 'Carbon'
    'SALMON' = 'Protein'
    'SOME BREAD STICK' = 'Carbon'
    'THREE PIECES OF PORK RIBS' = 'Protein'
    'TOMATO' = 'Fiber'
    'VEGGIE JUICE' = 'Fiber'
    'VEGGIE JUICE' = 'Juice'
    ;
run;

proc summary data=have_long n nway;
  class group_id;
  class item / mlf;
  id lunch;
  format item $foodcat.;
  output out=long_n;
run; 

proc transpose data=long_n out=want(drop=_:) ;
  by group_id lunch;
  id item;
  var _freq_;
run;

proc print data=want;
run;

Tom · Posted 06-27-2020 11:40 PM

Specifically, I want to remove all the trailing space after text.

I hope you understand by now that SAS stores character strings as fixed length. So if you remove the trailing spaces and then assign it to character variable the spaces will just be added back on so that the string is the length that the variable can hold. Because of this when SAS compares two strings it ignores the trailing spaces.

ChrisNZ · Posted 06-28-2020 01:51 AM

Have you tried my code? It works regardless of spaces or case. Why make your life difficult?

High-Performance SAS Coding - Third Edition

Shmuel · Posted 06-28-2020 02:02 PM

I suppose you have much more combinations of different kinds of food and many ingredients to associate with that food.

Next code steps include:

1) create a dataset per ingredient with all kinds of food having it

2) creating a format to supply a list of (1 or 0) of all ingredients

3) analyze the lunch for food contained

(check log for skipped food)

and assign ingredients into variables with value 1 or 0.

You can add ingredient datasets as foods per ingredient - as mush as you need,

and update line: %let ing_order = carbon,protein,fiber; to contain full list of ingredients;

Next is a tested code:

data protein;
  retain protein 1;
  infile cards;
  input food $;
cards;
egg
bread
carrot
cheese
pork
salmon
; run;

data carbon;
  retain carbon 1;
  infile cards;
  input food $;
cards;
bread
noodles
cookie
; run;

data fiber;
  retain fiber 1;
  infile cards;
  input food $;
cards;
carrot
lettuce
tomato
;run;

%let ing_order = carbon,protein,fiber; /* choose your prefered order */
/* each ingredient is a dataset and a variable name in the wanted dataset */

%macro sort_all;
   %let ni = %sysfunc(countw("&ing_order"));
   %do j=1 %to %eval(&ni);
       %let dsn = %scan(%quote(&ing_order),&j);
       proc sort data=&dsn; by food; run;
    %end;
%mend;
%sort_all;

data _null_;
     ing_order = "&ing_order";
     ni = countw(ing_order);
     length ing_list $100; /*adapt length to contain the full list */
     do i=1 to ni;
         ing_list = catx(' ',ing_list, scan(ing_order,i));
     end;
     put ing_list = ;
     call symput('ing_list',trim(ing_list));
run;

data all_food(drop=fmtname)
     cntl (rename=(food=start ingredients=label)
            keep=fmtname food ingredients);
merge protein carbon fiber /* other ... */ end=eof;
 by food;
    retain fmtname '$ingred';
    length ingredients $40;
    ingredients = catx(',',&ing_order);
    output;
    
    if eof then do;
       food = '**OTHER**';
       ingredients = '***';
       output;
    end;
run;
/* remark: as label is limited to $40 - this method is limited to 20 ingredients */
proc format lib=work cntlin=cntl; run;

data lunch;
 infile datalines4 truncover;
 input id lunch $char100.;
datalines4;
1 egg; bread; carrot;  
2 cheese; veggie juice; cucumber; 
3 noodles; three pieces of pork ribs; lettuce 
4 some bread stick; salmon; mango juice 
5 milk; cookie; tomato;   
6 bread; carrot; egg; 
;;;;
run;

options mprint ;
%macro spread;
   %let ni = %sysfunc(countw("&ing_order"));
   %do j=1 %to %eval(&ni);
     %let varname = %scan(%quote(&ing_order),&j);
     %put VARANME = &varname;
       %do; 
            ing = input(scan(ingredients,&j),1.);
            if ing = . then ing = 0;
            &varname = max(&varname, ing);
       %end;
   %end;
%mend spread;

data want(drop=nw i food ing ingredients);
 set lunch;
     length food $12;
     array ngr {*} &ing_list;
     nw = countw(lunch);
     do i=1 to nw;
        food = scan(lunch,i,' ;');
        ingredients = translate(put(food,$ingred.),'0','.');
        if substr(ingredients,1,1) not in ('1','0') then put food= ingredients=;
        else do;
           %spread; 
        end;
     end;
run;

SAS Innovate 2025: Call for Content

Classroom Training Available!