Hello, I have a dataset from a nutrition database called fnddssr_wt_perc. There are three variables: foodcode (an 8-digit number) which represents a particular food, sr_code (either 4-5 digits or 8 digits) which represents an ingredient in that food, and wt_percent which represents the relative weight of each ingredient (sr_code) in the food (foodcode). For an example from the sample data here: for food 11445000, the ingredient 12142 is 2.89% and 1121 is 97.1%. There are 27908 observations because most of the ~4000 unique foodcodes are associated with multiple sr_codes. Here are two samples:
datalines;
11445000 12142 0.0289017341
11445000 1121 0.9710982659
11459990 11460170 0.8
11459990 11460160 0.2
11460000 19335 0.13
;The creators of this data used some foodcodes as sr_codes which can be seen as the 8-digit codes in the second column (11460170) in order to simplify the database. I need to find each time they used an 8-digit foodcode as an sr_code (which should be 4 or 5 digits) and replace the 8-digit entry with its respective entries in the database while accounting for the wt_percent of the original entry.
As an example, for foodcode 11459990 above, I need to match the 8-digit SR code to their respective foodcode entries in the data which are:
datalines;
11460160 19335 0.13
11460160 19166 0.052
11460160 1117 0.818
11460170 19335 0.13
11460170 1117 0.87
;and multiply the replacement sr_code wt_percent by the original entry's wt_percent, (0.8 for 11460160, 0.2 for 11460170) and replace the sr_codes, so that the new dataset would have these values:
datalines;
11445000 12142 0.0289017341
11445000 1121 0.9710982659
11459990 19335 0.104
11459990 1117 0.696
11459990 19335 0.026
11459990 19166 0.0104
11459990 1117 0.1636
11460000 19335 0.13
;The goal is essentially to expand out the ingredients that are foods from column one into their actual ingredients that have the 4-5 digit sr_codes in column 2. I have tried looking at tutorials and papers, and it seems that somehow I would either need to use a temporary array or a hash table as a way to lookup sr_codes (column 2) to find the matching foodcode entries (column 1) and somehow output all of the associated matches (since there are multiple observations for some of the foodcodes) while also multiplying by the original 8-digit sr_code's wt_percent. I am very lost - I started programming 2 weeks ago and know very little about arrays, hash tables, or macros, so I would greatly appreciate any help.
I think this gives the correct result.
data want(drop = sr wt w);
   if _N_ = 1 then do;
      dcl hash h(dataset : "have(rename = (sr_code = sr wt_percent = wt) where=(ceil(log10(sr+1)) < 7))", multidata : "Y");
      h.definekey("foodcode");
      h.definedata("sr", "wt");
      h.definedone();
   end;
   set have;
   sr = .;
   wt = .;
   if ceil(log10(sr_code+1)) = 8 then do;
      w = wt_percent;
      do while (h.do_over(key : sr_code) = 0);
         sr_code = sr;
         wt_percent = w * wt;
         output;
      end;
   end;
   else output;
   format wt_percent best12.;
run;
Result:
foodcode sr_code wt_percent 11445000 12142 0.0289017341 11445000 1121 0.9710982659 11459990 19335 0.104 11459990 1117 0.696 11459990 19335 0.026 11459990 19166 0.0104 11459990 1117 0.1636 11460000 19335 0.13 11460160 19335 0.13 11460160 19166 0.052 11460160 1117 0.818 11460170 19335 0.13 11460170 1117 0.87
are the two data sets in the two first code snippets from the same original data? Or are they separate data sets?
Ok. Are foodcode and sr_code both numeric?
Ok. I think I understand the replacement part. See the code below. I combined the two data sets in the first two code snippets. You said they were from the data data set right? Does the code replace the values as you expect?
Regarding the multiplication, I'm not sure I follow. Can you be more specific? Perhaps illustrate with the data I post below.
Feel free to ask 🙂
data have;
input foodcode sr_code wt_percent 20.10;
datalines;
11445000 12142    0.0289017341
11445000 1121     0.9710982659
11459990 11460170 0.8         
11459990 11460160 0.2         
11460000 19335    0.13        
11460160 19335    0.13        
11460160 19166    0.052       
11460160 1117     0.818       
11460170 19335    0.13        
11460170 1117     0.87        
;
data want(drop = sr);
   if _N_ = 1 then do;
      dcl hash h(dataset : "have(rename = (sr_code = sr) where=(ceil(log10(sr+1)) < 7))", multidata : "Y");
      h.definekey("foodcode");
      h.definedata("sr", "wt_percent");
      h.definedone();
   end;
   set have;
   sr = .;
   if ceil(log10(sr_code+1)) = 8 then do;
      do while (h.do_over(key : sr_code) = 0);
         sr_code = sr;
         output;
      end;
   end;
   else output;
run;
Result:
foodcode sr_code wt_percent 11445000 12142 0.0289017341 11445000 1121 0.9710982659 11459990 19335 0.13 11459990 1117 0.87 11459990 19335 0.13 11459990 19166 0.052 11459990 1117 0.818 11460000 19335 0.13 11460160 19335 0.13 11460160 19166 0.052 11460160 1117 0.818 11460170 19335 0.13 11460170 1117 0.87
Wow, it is so close. Like you said the difference is in the multiplication. The program you wrote correctly replaces the 8-digit sr_codes as expected. The last step would be that when we do the replacement, we multiply all of the incoming wt_percent's by the wt_percent being replaced. Ex: In the data you displayed, one of the entries that will be replaced is:
11459990 11460170 0.8  
11459990 11460160 0.2                We will be replacing these entries with the data for 11460170 and 11460160:
11460160 19335    0.13        
11460160 19166    0.052       
11460160 1117     0.818       
11460170 19335    0.13        
11460170 1117     0.87However, the values for the new wt_percent needs to account for the original wt_percent of 11460170 which is 0.8 and 11460160 which is 0.2. So we would multiply the wt_percent's replacing 11460170 (0.13 and 0.87) by 0.8 each to get (0.104 and 0.696) and the wt_percent's replacing 11460160 (0.13, 0.052, and 0.818) by 0.2 each to get (0.026, 0.0104, 0.1636), so the finished product should be (notice that the sum of all wt_percent's for 11459990 ends up being 1):
11459990 19335 0.104 11459990 1117 0.696 11459990 19335 0.026 11459990 19166 0.0104 11459990 1117 0.1636
Instead of:
11459990  19335    0.13 
11459990  1117     0.87 
11459990  19335    0.13 
11459990  19166    0.052 
11459990  1117     0.818I really hope that is clear. Thank you so much for the work you have done!
I think this gives the correct result.
data want(drop = sr wt w);
   if _N_ = 1 then do;
      dcl hash h(dataset : "have(rename = (sr_code = sr wt_percent = wt) where=(ceil(log10(sr+1)) < 7))", multidata : "Y");
      h.definekey("foodcode");
      h.definedata("sr", "wt");
      h.definedone();
   end;
   set have;
   sr = .;
   wt = .;
   if ceil(log10(sr_code+1)) = 8 then do;
      w = wt_percent;
      do while (h.do_over(key : sr_code) = 0);
         sr_code = sr;
         wt_percent = w * wt;
         output;
      end;
   end;
   else output;
   format wt_percent best12.;
run;
Result:
foodcode sr_code wt_percent 11445000 12142 0.0289017341 11445000 1121 0.9710982659 11459990 19335 0.104 11459990 1117 0.696 11459990 19335 0.026 11459990 19166 0.0104 11459990 1117 0.1636 11460000 19335 0.13 11460160 19335 0.13 11460160 19166 0.052 11460160 1117 0.818 11460170 19335 0.13 11460170 1117 0.87
Anytime, glad to help.
Since you mention understanding hash object. If you truly want to learn them, read Data Management Solutions Using SAS Hash Table Operations. There is no better reference 🙂
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.
