| SIC | Year | share | Dummy | 
| 5080 | 2010 | 1489.499 | |
| 5080 | 2010 | 68.57 | |
| 5080 | 2011 | 1193.963 | |
| 5080 | 2011 | 427.454 | |
| 5080 | 2012 | 134.285 | |
| 5080 | 2012 | 1151.153 | |
| 5080 | 2012 | 0 | |
| 5080 | 2012 | 157.123 | |
| 5080 | 2012 | 62.593 | |
| 4512 | 2010 | 2455.718 | |
| 4512 | 2010 | 873.324 | |
| 4512 | 2011 | 403.31 | |
| 4512 | 2011 | 0 | |
| 4512 | 2011 | 357.093 | |
| 4512 | 2011 | 243.378 | |
| 4512 | 2012 | 724.538 | |
| 4512 | 2012 | 421.442 | |
| 4512 | 2012 | 198.51 | |
| 3670 | 2010 | 34.59 | |
| 3670 | 2010 | 32.13 | |
| 3670 | 2010 | 143.4 | |
| 3670 | 2010 | 15.883 | |
| 3670 | 2010 | 0.522 | 
Hi, I want to create a dummy variable from the column "share". if the share is above the median then 1 else 0. Median is calculated by SIC 2, 3, and 4 digits and by year. I can solve it step by step procedure. It will take multiple steps. Is there any procedure by which I can do the above in one step?
Hi @abdulla I am not sure what you mean by -"by SIC 2, 3, and 4 digits "
If I can assume, if median of BY GROUP which is each SIC value is what you meant, then the logic is share>median then 1 else 0. If this is correct then-
A straightforward SQL should do like:-
data have;
input SIC	Year	share;*	Dummy;
cards;
5080	2010	1489.499	 
5080	2010	68.57	 
5080	2011	1193.963	 
5080	2011	427.454	 
5080	2012	134.285	 
5080	2012	1151.153	 
5080	2012	0	 
5080	2012	157.123	 
5080	2012	62.593	 
4512	2010	2455.718	 
4512	2010	873.324	 
4512	2011	403.31	 
4512	2011	0	 
4512	2011	357.093	 
4512	2011	243.378	 
4512	2012	724.538	 
4512	2012	421.442	 
4512	2012	198.51	 
3670	2010	34.59	 
3670	2010	32.13	 
3670	2010	143.4	 
3670	2010	15.883	 
3670	2010	0.522	 
;
proc sql;
create table want as
select *,share>median(share) as dummy
from have
group by SIC,Year;
quit;
/*And if you want median value in another column for reference*/
proc sql;
create table want as
select *,share>median(share) as dummy,median(share) as median
from have
group by SIC,Year;
quit;Hi @abdulla I am not sure what you mean by -"by SIC 2, 3, and 4 digits "
If I can assume, if median of BY GROUP which is each SIC value is what you meant, then the logic is share>median then 1 else 0. If this is correct then-
A straightforward SQL should do like:-
data have;
input SIC	Year	share;*	Dummy;
cards;
5080	2010	1489.499	 
5080	2010	68.57	 
5080	2011	1193.963	 
5080	2011	427.454	 
5080	2012	134.285	 
5080	2012	1151.153	 
5080	2012	0	 
5080	2012	157.123	 
5080	2012	62.593	 
4512	2010	2455.718	 
4512	2010	873.324	 
4512	2011	403.31	 
4512	2011	0	 
4512	2011	357.093	 
4512	2011	243.378	 
4512	2012	724.538	 
4512	2012	421.442	 
4512	2012	198.51	 
3670	2010	34.59	 
3670	2010	32.13	 
3670	2010	143.4	 
3670	2010	15.883	 
3670	2010	0.522	 
;
proc sql;
create table want as
select *,share>median(share) as dummy
from have
group by SIC,Year;
quit;
/*And if you want median value in another column for reference*/
proc sql;
create table want as
select *,share>median(share) as dummy,median(share) as median
from have
group by SIC,Year;
quit;First you have to create variables with 2-digit SIC and 3-digit SIC.
data have2;
    set have;
    sic2=substr(sic,1,2);
    sic3=substr(sic,1,3);
run;
Then you compute medians for sic2 sic3 and sic. This is a job for PROC SUMMARY!
#ProcSummaryRulez <== official hashtag for PROC SUMMARY
proc summary data=have2;
    class sic2 sic3 sic;
    ways 1;
    var share;
    output out=_medians_ median= /autoname;
run;
Then you can determine a dummy variable for sic2 and a different dummy variable for sic3 and a different dummy variable using sic from the computed medians. But you said you want only one dummy variable, I don't really know what you expect at this step, so I'll stop here and let you explain further.
You need to determine medians for 2-, 3-, and 4-digit Standard Industrial Codes. Now in your sample data, the medians will be the same for each 4-digit and its corresponding 3-digit and 2-digit SIC codes (i.e. the same median for SIC4=4512, SIC3=451, and SIC2=45). So you would really need only 1 dummy variable per record.
But if you have a normal mix of such codes, you'll have many instances of multiple SIC3 codes nested within a given SIC2 code, as well as multiple SIC4 codes within each SIC3 code. So I've generated some sample data below to demonstrate. As a results you will need 3 dummy variables DUMMY2, DUMMY3, and DUMMY4:
data have;
input SIC $4.	Year	share;*	Dummy;
cards;
5080	2010	1489.499	 
5080	2010	68.57	 
5080	2011	1193.963	 
5080	2011	427.454	 
5080	2012	134.285	 
5080	2012	1151.153	 
5080	2012	0	 
5080	2012	157.123	 
5080	2012	62.593	 
4512	2010	2455.718	 
4512	2010	873.324	 
4512	2011	403.31	 
4512	2011	0	 
4512	2011	357.093	 
4512	2011	243.378	 
4512	2012	724.538	 
4512	2012	421.442	 
4512	2012	198.51	 
4515	2010	3455.718	 
4515	2010	1873.324	 
4515	2011	1403.31	 
4515	2011	1000	 
4515	2011	1357.093	 
4515	2011	1243.378	 
4515	2012	1724.538	 
4515	2012	1421.442	 
4515	2012	1198.51	 
4521	2010	4455.718	 
4521	2010	2873.324	 
4521	2011	2403.31	 
4521	2011	2000	 
4521	2011	2357.093	 
4521	2011	2243.378	 
4521	2012	2724.538	 
4521	2012	2421.442	 
4521	2012	2198.51	 
3670	2010	34.59	 
3670	2010	32.13	 
3670	2010	143.4	 
3670	2010	15.883	 
3670	2010	0.522	 
;
data want  (drop=d _:);
  set have end=end_of_have;
  length sic2 $2 dummy2 8   sic3 $3 dummy3 8   sic4 $4 dummy4 8;
  array _sic {2:4} $4 sic2-sic4;
  array _dum {2:4}    dummy2-dummy4;
  do d=2 to 4; 
    _sic{d}=sic;
  end;
  if _n_=1 then do;
    declare hash hoh () ;            /* Hash of hashes */
      hoh.definekey('sic');
      hoh.definedata('sic','H','IH','_N','_MEDIAN');
      hoh.definedone();
    declare hiter ihoh ('hoh');
    declare hash h;  declare hiter ih;
  end;
  do d=4 to 2 by -1;
    /* Add this SIC4 (and SIC3 and SIC2) to hash HOH if not there yet */
    if hoh.find(key:_sic{d})^=0 then do;       /* "^0" means not already there */
      h=_new_ hash(ordered:'a',multidata:'Y');
        h.definekey('share');
        h.definedata('share');
        h.definedone();
      ih=_new_ hiter('h');
      sic=_sic{d};     /* Copy then shortened SIC code */
      hoh.add();       /* Add this (possible shortened) SIC to HOH (with N and _median yet to be determined) */
    end;
    /* And always add the share value to the corresponding hash H */
    h.add();
  end;
  if end_of_have;  /* End of data?  Then Hashes are built, now get N's and _medians */
  /* Now iterate over all HOH dataitems, calculating N and _median for each */
  do _rc=ihoh.first() by 0 until (ihoh.next()^=0);  /* Retrieve corresponding H */
    _N=h.num_items;
    do _i=1 to floor(_N/2);   /* Read H up to _median or just short of it */
      ih.next();
    end;
    _median=share;
    if mod(_N,2)=1 then do;   /* Read just beyond _median if needed */
      ih.next();
      _median=mean(_median,share);
    end;
    hoh.replace();   /* Having calclulated N and MEDIAN, re-write the HOH entry */
  end;
  hoh.output(dataset:'hohdata');  /* This is unneccesary but take a look at it */
  do until (end_of_have2);        /* Now re-read each record and compare to median */
    set have (rename=(sic=_orig_sic)) end=end_of_have2;
    do d=2 to 4;
      _sic{d}=_orig_sic;
      hoh.find(key:_sic{d});
      if share > _median then _dum{d}=1;
      else _dum{d}=0;
      sic=_orig_sic;
    end;
    output;
  end;
run;
The hash object HOH is built to have one "dataitem" (think one row) for each distinct SIC2, SIC3, and SIC4 code. It's meant to have a variable _N and _MEDIAN for each code. It is built up dynamically via the first pass of the data. Each HOH data item also has its own subordinate hash H, which keeps (in sorted order) all the SHARE values for that code.
At the end of the first pass, each dataitem in HOH is used to retrieve the corresponding H, and each H is processed half-way to determine the median. Then that median (and corresponding N) is re-written back into HOH, which is now complete with the needed stats for each code.
Then a second pass of the data is executed. For each incoming SIC (i.e. each incoming SIC4) the corresponding N and MEDIAN are retrieved from HOH, and the corresponding dummy4 value is set. Then the "parent" SIC3 and SIC2 codes are similarly implemented.
It's one data step, with two total passes of the data.
BTW, just to show the contents of the HOH hash, I (unnecessarily) output it to data set HOHDATA - take a look. It has all the medians.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
