Solved: Creating new observations in a dataset if a value in one column matche...

fikel · Posted 01-18-2021 03:15 AM

Hello, I have a dataset from a nutrition database called fnddssr_wt_perc. There are three variables: foodcode (an 8-digit number) which represents a particular food, sr_code (either 4-5 digits or 8 digits) which represents an ingredient in that food, and wt_percent which represents the relative weight of each ingredient (sr_code) in the food (foodcode). For an example from the sample data here: for food 11445000, the ingredient 12142 is 2.89% and 1121 is 97.1%. There are 27908 observations because most of the ~4000 unique foodcodes are associated with multiple sr_codes. Here are two samples:

datalines;
11445000 12142 0.0289017341
11445000 1121 0.9710982659
11459990 11460170 0.8
11459990 11460160 0.2
11460000 19335 0.13
;

The creators of this data used some foodcodes as sr_codes which can be seen as the 8-digit codes in the second column (11460170) in order to simplify the database. I need to find each time they used an 8-digit foodcode as an sr_code (which should be 4 or 5 digits) and replace the 8-digit entry with its respective entries in the database while accounting for the wt_percent of the original entry.

As an example, for foodcode 11459990 above, I need to match the 8-digit SR code to their respective foodcode entries in the data which are:

datalines;
11460160 19335 0.13
11460160 19166 0.052
11460160 1117 0.818
11460170 19335 0.13
11460170 1117 0.87
;

and multiply the replacement sr_code wt_percent by the original entry's wt_percent, (0.8 for 11460160, 0.2 for 11460170) and replace the sr_codes, so that the new dataset would have these values:

datalines;
11445000 12142 0.0289017341
11445000 1121 0.9710982659
11459990 19335 0.104
11459990 1117 0.696
11459990 19335 0.026
11459990 19166 0.0104
11459990 1117 0.1636
11460000 19335 0.13
;

The goal is essentially to expand out the ingredients that are foods from column one into their actual ingredients that have the 4-5 digit sr_codes in column 2. I have tried looking at tutorials and papers, and it seems that somehow I would either need to use a temporary array or a hash table as a way to lookup sr_codes (column 2) to find the matching foodcode entries (column 1) and somehow output all of the associated matches (since there are multiple observations for some of the foodcodes) while also multiplying by the original 8-digit sr_code's wt_percent. I am very lost - I started programming 2 weeks ago and know very little about arrays, hash tables, or macros, so I would greatly appreciate any help.

PeterClemmensen · Posted 01-18-2021 01:36 PM

I think this gives the correct result.

data want(drop = sr wt w);

   if _N_ = 1 then do;
      dcl hash h(dataset : "have(rename = (sr_code = sr wt_percent = wt) where=(ceil(log10(sr+1)) < 7))", multidata : "Y");
      h.definekey("foodcode");
      h.definedata("sr", "wt");
      h.definedone();
   end;

   set have;
   sr = .;
   wt = .;

   if ceil(log10(sr_code+1)) = 8 then do;
      w = wt_percent;
      do while (h.do_over(key : sr_code) = 0);
         sr_code = sr;
         wt_percent = w * wt;
         output;
      end;
   end;

   else output;

   format wt_percent best12.;

run;

Result:

foodcode  sr_code  wt_percent 
11445000  12142    0.0289017341 
11445000  1121     0.9710982659 
11459990  19335    0.104 
11459990  1117     0.696 
11459990  19335    0.026 
11459990  19166    0.0104 
11459990  1117     0.1636 
11460000  19335    0.13 
11460160  19335    0.13 
11460160  19166    0.052 
11460160  1117     0.818 
11460170  19335    0.13 
11460170  1117     0.87

The DATA to DATA Step Macro
Blog: SASnrd

View solution in original post

PeterClemmensen · Posted 01-18-2021 03:48 AM

are the two data sets in the two first code snippets from the same original data? Or are they separate data sets?

The DATA to DATA Step Macro
Blog: SASnrd

fikel · Posted 01-18-2021 10:28 AM

They are from the same original dataset: fnddssr_wt_perc

PeterClemmensen · Posted 01-18-2021 11:13 AM

Ok. Are foodcode and sr_code both numeric?

The DATA to DATA Step Macro
Blog: SASnrd

fikel · Posted 01-18-2021 11:44 AM

Yes, foodcode and sr_code are both numeric.

PeterClemmensen · Posted 01-18-2021 12:42 PM

Ok. I think I understand the replacement part. See the code below. I combined the two data sets in the first two code snippets. You said they were from the data data set right? Does the code replace the values as you expect?

Regarding the multiplication, I'm not sure I follow. Can you be more specific? Perhaps illustrate with the data I post below.

Feel free to ask 🙂

data have;
input foodcode sr_code wt_percent 20.10;
datalines;
11445000 12142    0.0289017341
11445000 1121     0.9710982659
11459990 11460170 0.8         
11459990 11460160 0.2         
11460000 19335    0.13        
11460160 19335    0.13        
11460160 19166    0.052       
11460160 1117     0.818       
11460170 19335    0.13        
11460170 1117     0.87        
;

data want(drop = sr);

   if _N_ = 1 then do;
      dcl hash h(dataset : "have(rename = (sr_code = sr) where=(ceil(log10(sr+1)) < 7))", multidata : "Y");
      h.definekey("foodcode");
      h.definedata("sr", "wt_percent");
      h.definedone();
   end;

   set have;
   sr = .;

   if ceil(log10(sr_code+1)) = 8 then do;
      do while (h.do_over(key : sr_code) = 0);
         sr_code = sr;
         output;
      end;
   end;

   else output;

run;

Result:

foodcode  sr_code  wt_percent 
11445000  12142    0.0289017341 
11445000  1121     0.9710982659 
11459990  19335    0.13 
11459990  1117     0.87 
11459990  19335    0.13 
11459990  19166    0.052 
11459990  1117     0.818 
11460000  19335    0.13 
11460160  19335    0.13 
11460160  19166    0.052 
11460160  1117     0.818 
11460170  19335    0.13 
11460170  1117     0.87

The DATA to DATA Step Macro
Blog: SASnrd

fikel · Posted 01-18-2021 01:07 PM

Wow, it is so close. Like you said the difference is in the multiplication. The program you wrote correctly replaces the 8-digit sr_codes as expected. The last step would be that when we do the replacement, we multiply all of the incoming wt_percent's by the wt_percent being replaced. Ex: In the data you displayed, one of the entries that will be replaced is:

11459990 11460170 0.8  
11459990 11460160 0.2

We will be replacing these entries with the data for 11460170 and 11460160:

11460160 19335    0.13        
11460160 19166    0.052       
11460160 1117     0.818       
11460170 19335    0.13        
11460170 1117     0.87

However, the values for the new wt_percent needs to account for the original wt_percent of 11460170 which is 0.8 and 11460160 which is 0.2. So we would multiply the wt_percent's replacing 11460170 (0.13 and 0.87) by 0.8 each to get (0.104 and 0.696) and the wt_percent's replacing 11460160 (0.13, 0.052, and 0.818) by 0.2 each to get (0.026, 0.0104, 0.1636), so the finished product should be (notice that the sum of all wt_percent's for 11459990 ends up being 1):

11459990  19335    0.104 
11459990  1117     0.696 
11459990  19335    0.026 
11459990  19166    0.0104 
11459990  1117     0.1636

Instead of:

11459990  19335    0.13 
11459990  1117     0.87 
11459990  19335    0.13 
11459990  19166    0.052 
11459990  1117     0.818

I really hope that is clear. Thank you so much for the work you have done!

PeterClemmensen · Posted 01-18-2021 01:10 PM

Ok 5 min. Well solve this 🙂

The DATA to DATA Step Macro
Blog: SASnrd

PeterClemmensen · Posted 01-18-2021 01:36 PM

I think this gives the correct result.

data want(drop = sr wt w);

   if _N_ = 1 then do;
      dcl hash h(dataset : "have(rename = (sr_code = sr wt_percent = wt) where=(ceil(log10(sr+1)) < 7))", multidata : "Y");
      h.definekey("foodcode");
      h.definedata("sr", "wt");
      h.definedone();
   end;

   set have;
   sr = .;
   wt = .;

   if ceil(log10(sr_code+1)) = 8 then do;
      w = wt_percent;
      do while (h.do_over(key : sr_code) = 0);
         sr_code = sr;
         wt_percent = w * wt;
         output;
      end;
   end;

   else output;

   format wt_percent best12.;

run;

Result:

foodcode  sr_code  wt_percent 
11445000  12142    0.0289017341 
11445000  1121     0.9710982659 
11459990  19335    0.104 
11459990  1117     0.696 
11459990  19335    0.026 
11459990  19166    0.0104 
11459990  1117     0.1636 
11460000  19335    0.13 
11460160  19335    0.13 
11460160  19166    0.052 
11460160  1117     0.818 
11460170  19335    0.13 
11460170  1117     0.87

The DATA to DATA Step Macro
Blog: SASnrd

fikel · Posted 01-18-2021 01:56 PM

That is awesome; it works. You have saved me a lot of time and helped me understand hash objects better with your example. Thank you so much!

PeterClemmensen · Posted 01-18-2021 01:59 PM

Anytime, glad to help.

Since you mention understanding hash object. If you truly want to learn them, read Data Management Solutions Using SAS Hash Table Operations. There is no better reference 🙂

The DATA to DATA Step Macro
Blog: SASnrd

Creating new observations in a dataset if a value in one column matches a value in another column

Re: Creating new observations in a dataset if a value in one column matches a value in another colum

Re: Creating new observations in a dataset if a value in one column matches a value in another colum

Re: Creating new observations in a dataset if a value in one column matches a value in another colum

Re: Creating new observations in a dataset if a value in one column matches a value in another colum

Re: Creating new observations in a dataset if a value in one column matches a value in another colum

Re: Creating new observations in a dataset if a value in one column matches a value in another colum

Re: Creating new observations in a dataset if a value in one column matches a value in another colum

Re: Creating new observations in a dataset if a value in one column matches a value in another colum

Re: Creating new observations in a dataset if a value in one column matches a value in another colum

Re: Creating new observations in a dataset if a value in one column matches a value in another colum

Re: Creating new observations in a dataset if a value in one column matches a value in another colum

Ready to join fellow brilliant minds for the SAS Hackathon?