08-21-2016 05:23 AM
Appreciate if someone help me with the following interview questions which I came across yesterday.
I've a dataset called countries and it has variable 'Country'. It has values as follows.
Now I need to create a dataset for each country. In this case I need to create a datasets like Country_US , Country_UK , Country_IND and Country_JAP. Tomorrow other countries like CHN , RUS can be added to the country variable. Then I need to create datasets like Country_CHN, Country_RUS. Since it is a production system, I need to write one macro which would suite for 'n' number of countries. I should not hard code any of the country values in my macro.
Thanks in advance for any help you offer me.
08-21-2016 06:10 AM - edited 08-21-2016 06:13 AM
Of course there is more than one way to do this. Here is one approach, using SQL:
%macro createm; proc sql noprint; select distinct country into :country1-:country99 from countries; quit; %let cnum=&sqlobs; %do i=1 %to &cnum; %let dsname=country_&&country&i; %if not %dsexists(dsname) %then %do; data country_&&country&i; /* furtther statements */ run; %end; %end; %mend; %createm
The macro dsexists is not given but is easy to do. Ask if you want help here.
There is a shortcut for :country1-:country999 in 9.4 but I would have to look it up.
Hope this helps,
EDIT: %dsexist could be replaced with
08-21-2016 07:56 PM
Macros for creating datasets based on variable values [ New ] https://goo.gl/qFJSsL https://communities.sas.com/t5/General-SAS-Programming/Macros-for-creating-datasets-based-on-variable-values/m-p/292969 HAVE Up to 40 obs from class total obs=19 Obs NAME SEX AGE HEIGHT WEIGHT 1 Alice F 13 56.5 84.0 2 Barbara F 13 65.3 98.0 3 Carol F 14 62.8 102.5 4 Jane F 12 59.8 84.5 5 Janet F 15 62.5 112.5 6 Joyce F 11 51.3 50.5 7 Judy F 14 64.3 90.0 8 Louise F 12 56.3 77.0 9 Mary F 15 66.5 112.0 10 Alfred M 14 69.0 112.5 11 Henry M 14 63.5 102.5 12 James M 12 57.3 83.0 13 Jeffrey M 13 62.5 84.0 14 John M 12 59.0 99.5 15 Philip M 16 72.0 150.0 16 Robert M 12 64.8 128.0 17 Ronald M 15 67.0 133.0 18 Thomas M 11 57.5 85.0 19 William M 15 66.5 112.0 WANT Up to 40 obs from M total obs=10 Obs NAME SEX AGE HEIGHT WEIGHT 1 Alfred M 14 69.0 112.5 2 Henry M 14 63.5 102.5 3 James M 12 57.3 83.0 4 Jeffrey M 13 62.5 84.0 5 John M 12 59.0 99.5 6 Philip M 16 72.0 150.0 7 Robert M 12 64.8 128.0 8 Ronald M 15 67.0 133.0 9 Thomas M 11 57.5 85.0 10 William M 15 66.5 112.0 Up to 40 obs from F total obs=9 Obs NAME SEX AGE HEIGHT WEIGHT 1 Alice F 13 56.5 84.0 2 Barbara F 13 65.3 98.0 3 Carol F 14 62.8 102.5 4 Jane F 12 59.8 84.5 5 Janet F 15 62.5 112.5 6 Joyce F 11 51.3 50.5 7 Judy F 14 64.3 90.0 8 Louise F 12 56.3 77.0 9 Mary F 15 66.5 112.0 SOLUTION %symdel SEX; %put &=sex; * Probably not the most efficient method; proc sort data=sashelp.class out=class; by sex; ;run;quit; data _null_; set class; by sex; if first.sex then do; call symputx('sex',sex); rc=dosubl(resolve(' data &sex; set class(where=(sex="&sex")); run;quit; ')); end; ;run;quit;
08-22-2016 02:56 AM
First collect the countries into a dataset, then create the tables dynamically from that:
proc sort data=countries (keep=country) out=cnt nodupkey ; by country; run; data _null_; set cnt; call execute('data country_' !! trim(country) !! '; set countries; where country="' !! trim(country) !! '";run;'); run;
08-23-2016 02:21 AM
Could you please tell me why you've placed double quotes and single quotes in where clause?
When you look at it closely, you will find that the argument to call execute is one string, built from character literals and the country name in the variable.
Since within the where condition in the call execute the country name is a literal, it needs to be enclosed in quotes. The best method to put (literal) quotes in a string variable is to use one type (single or double) on the "outside" and the other in the "inside".
Which quotes you use on the outside depends on the presence of macro triggers. If you want to mask them (what is often the case with call execute), you have to use single quotes on the outside; if you need them to be active, use double ones, as they also disable the macro masking effect of the single quotes you will use in the inside.
Another method would be to use functions to create the quotes inside the string, but that reduces readability IMO.
08-22-2016 04:40 AM
You have your answers above, data _null_ call execute would be my choice. However there would also be a few questions I would ask before even setting down to do any code.
Firstly what is the reasoning behind having a dataset for each country, to me this doesn't make any sense, SAS i built with by group processing in mind, so keeping all the data in one dataset and then processing using by groups would be far easier to work with and more efficient.
Secondly, why is there two rows per country? That would worry me, that I am missing something from the question.
Thirdly what happens on dirty data in that list? You could have USA/US, or maybe CH - does this mean Switzerland or China?
Personally my main question would be, give me the detailed sotware development lifecycle plan on this piece of work and then I can suggest some techinical implementation. Unfortunately most of the people giving interviews either do no know about these, or have a set of fixed nonsense questions designed to see if you can remember some syntax rather than how to properly develop software - which is probably why every job I go into has gigabytes of garbage macro code lying about the place with no documentation or plan.
(Sorry, that ended up as a bit of a rant )
08-22-2016 08:39 AM
I'm a big fan of using hash tables for such dynamic splitting (recognizing that it's a rarely a good idea to split such data, as others have mentioned).
Below uses a DOW loop and a hash table. You read all the records for one country into a hash table, then output the data in the hash table to a dataset. And the hash output() method can use the value of a datastep variable to determine the name of the output datasets. This still feels like magic, after so many years of being required to name datasets during DATA step compilation time.
I learned this approach from one of Paul Dorfman's many many excellent papers on hash, perhaps:
137 proc sort data=sashelp.prdsale out=sale; 138 by country; 139 run; NOTE: There were 1440 observations read from the data set SASHELP.PRDSALE. NOTE: The data set WORK.SALE has 1440 observations and 10 variables. 140 141 data _null_ ; 142 if _n_=1 then do; 143 declare hash h () ; 144 h.definekey ("_n_") ; 145 h.definedata ("ACTUAL","PREDICT","COUNTRY","REGION"); 146 h.definedone () ; 147 end; 148 do _n_ = 1 by 1 until ( last.country ) ; 149 set sale ; 150 by country ; 151 h.add() ; 152 end ; 153 h.output (dataset: cats("OUT_",compress(country,".")) ) ; 154 h.clear() ; 155 run ; NOTE: The data set WORK.OUT_CANADA has 480 observations and 4 variables. NOTE: The data set WORK.OUT_GERMANY has 480 observations and 4 variables. NOTE: The data set WORK.OUT_USA has 480 observations and 4 variables. NOTE: There were 1440 observations read from the data set WORK.SALE.
For an interview question, if they ask for a macro solution I might show them a macro solution , and also show a non-macro alternative like this. I think it's fair for a potential employer to say "write a little macro" just to see if you can do it, and your style, etc. Just about any macro question has a non-macro solution.
I also would raise the points mentioned by @RW9. When interviewing as a candidate, pushing back a bit on the problem definition is often useful. Often such problems are designed with mistakes/poor specifications/etc, to see how you would respond. And even if this was not defined as such, it's helpful to see how your potential employers react to constructive feedback.
08-26-2016 01:23 PM
NEW SOLUTION * if you know just the upper limit on the number of levels; data _null_; retain index 1; array BeenThere $1 _temporary_ ; set sashelp.class; if whichc(sex,of BeenThere[*])=0 then do; BeenThere[index]=sex; if index=32 then do; putlog "More than 32 levels increase the dimension of BeenThere"; stop; end; Index=Index+1; call symputx('sex',sex); rc=dosubl(resolve(' data &sex; set sashelp.class(where=(sex="&sex")); run;quit; ')); end; ;run;quit;
08-26-2016 01:50 PM
It fun to build on others ideas, sometimes the last post is the best. Not here though.
Variation of an earlier post. I am not a huge fan of call execute. At one time it did not scale.
Also the quoting can give you a headache.
Dosubl executes and finishes between interations of the parent datastep.
Code is not stacked and run after the dataset. DOSUBL can also post back to
the parent(see some of my old SAS-L posts)
A hash without a sort would be interesting.
proc sort data=sashelp.class(keep=sex) out=unique nodupkey; by sex; ;run;quit; data _null_; set unique; call symputx('sex',sex); rc=dosubl(resolve(' data &sex; set class(where=(sex="&sex")); run;quit; ')); ;run;quit;