BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
abdulla
Pyrite | Level 9
SICYearshareDummy
508020101489.499 
5080201068.57 
508020111193.963 
50802011427.454 
50802012134.285 
508020121151.153 
508020120 
50802012157.123 
5080201262.593 
451220102455.718 
45122010873.324 
45122011403.31 
451220110 
45122011357.093 
45122011243.378 
45122012724.538 
45122012421.442 
45122012198.51 
3670201034.59 
3670201032.13 
36702010143.4 
3670201015.883 
367020100.522 

 

Hi, I want to create a dummy variable from the column "share". if the share is above the median then 1 else 0. Median is calculated by SIC 2, 3, and 4 digits and by year. I can solve it step by step procedure. It will take multiple steps. Is there any procedure by which I can do the above in one step? 

1 ACCEPTED SOLUTION

Accepted Solutions
novinosrin
Tourmaline | Level 20

Hi @abdulla  I am not sure what you mean by  -"by SIC 2, 3, and 4 digits "

 

If I can assume, if median of BY GROUP which is each SIC value is what you meant, then the logic is share>median then 1 else 0. If this is correct then-

A straightforward SQL should do like:-



data have;
input SIC	Year	share;*	Dummy;
cards;
5080	2010	1489.499	 
5080	2010	68.57	 
5080	2011	1193.963	 
5080	2011	427.454	 
5080	2012	134.285	 
5080	2012	1151.153	 
5080	2012	0	 
5080	2012	157.123	 
5080	2012	62.593	 
4512	2010	2455.718	 
4512	2010	873.324	 
4512	2011	403.31	 
4512	2011	0	 
4512	2011	357.093	 
4512	2011	243.378	 
4512	2012	724.538	 
4512	2012	421.442	 
4512	2012	198.51	 
3670	2010	34.59	 
3670	2010	32.13	 
3670	2010	143.4	 
3670	2010	15.883	 
3670	2010	0.522	 
;
proc sql;
create table want as
select *,share>median(share) as dummy
from have
group by SIC,Year;
quit;

  

/*And if you want median value in another column for reference*/
proc sql;
create table want as
select *,share>median(share) as dummy,median(share) as median
from have
group by SIC,Year;
quit;

View solution in original post

4 REPLIES 4
novinosrin
Tourmaline | Level 20

Hi @abdulla  I am not sure what you mean by  -"by SIC 2, 3, and 4 digits "

 

If I can assume, if median of BY GROUP which is each SIC value is what you meant, then the logic is share>median then 1 else 0. If this is correct then-

A straightforward SQL should do like:-



data have;
input SIC	Year	share;*	Dummy;
cards;
5080	2010	1489.499	 
5080	2010	68.57	 
5080	2011	1193.963	 
5080	2011	427.454	 
5080	2012	134.285	 
5080	2012	1151.153	 
5080	2012	0	 
5080	2012	157.123	 
5080	2012	62.593	 
4512	2010	2455.718	 
4512	2010	873.324	 
4512	2011	403.31	 
4512	2011	0	 
4512	2011	357.093	 
4512	2011	243.378	 
4512	2012	724.538	 
4512	2012	421.442	 
4512	2012	198.51	 
3670	2010	34.59	 
3670	2010	32.13	 
3670	2010	143.4	 
3670	2010	15.883	 
3670	2010	0.522	 
;
proc sql;
create table want as
select *,share>median(share) as dummy
from have
group by SIC,Year;
quit;

  

/*And if you want median value in another column for reference*/
proc sql;
create table want as
select *,share>median(share) as dummy,median(share) as median
from have
group by SIC,Year;
quit;
abdulla
Pyrite | Level 9
Hi novinosrin,
Thank you very much. By SIC 2 means, I am trying to find a median based on SIC two-digit match. For example, SIC 5080 and 5081 are matched by two and three digit but not 4 digit. So, group can be all SIC based on two digit (50 as an example) or three digit (508 as an example) and by each year.
PaigeMiller
Diamond | Level 26

First you have to create variables with 2-digit SIC and 3-digit SIC.

 

data have2;
    set have;
    sic2=substr(sic,1,2);
    sic3=substr(sic,1,3);
run;

Then you compute medians for sic2 sic3 and sic. This is a job for PROC SUMMARY!

#ProcSummaryRulez <== official hashtag for PROC SUMMARY

 

proc summary data=have2;
    class sic2 sic3 sic;
    ways 1;
    var share;
    output out=_medians_ median= /autoname;
run;

Then you can determine a dummy variable for sic2 and a different dummy variable for sic3 and a different dummy variable using sic from the computed medians. But you said you want only one dummy variable, I don't really know what you expect at this step, so I'll stop here and let you explain further.

 
--
Paige Miller
mkeintz
PROC Star

 

You need to determine medians for 2-, 3-, and 4-digit Standard Industrial Codes.  Now in your sample data, the medians will be the same for each 4-digit and its corresponding 3-digit and 2-digit SIC codes (i.e. the same median for SIC4=4512, SIC3=451, and SIC2=45).  So you would really need only 1 dummy variable per record.

 

But if you have a normal mix of such codes, you'll have many instances of multiple SIC3 codes nested within a given SIC2 code, as well as multiple SIC4 codes within each SIC3 code.  So I've generated some sample data below to demonstrate.  As a results you will need 3 dummy variables DUMMY2, DUMMY3, and DUMMY4:

 

  1. To your 4512 codes, I've added a set of 4515 codes giving two sets for SIC3=451. Each of the 4515 codes has a share value of 1000 more than the corresponding 4512 code, so the 4515 median will be 1000 more,  and the 451 median will be between the 4512 and 4515 medians.

  2. And I also added a set of 4521 codes, giving three sets for SIC2=45.  The 4521 share values have another 1000 added to each of them.  So the 45 median will be lower than the 4521 median and higher than the 4512 median.

 

data have;
input SIC $4.	Year	share;*	Dummy;
cards;
5080	2010	1489.499	 
5080	2010	68.57	 
5080	2011	1193.963	 
5080	2011	427.454	 
5080	2012	134.285	 
5080	2012	1151.153	 
5080	2012	0	 
5080	2012	157.123	 
5080	2012	62.593	 
4512	2010	2455.718	 
4512	2010	873.324	 
4512	2011	403.31	 
4512	2011	0	 
4512	2011	357.093	 
4512	2011	243.378	 
4512	2012	724.538	 
4512	2012	421.442	 
4512	2012	198.51	 
4515	2010	3455.718	 
4515	2010	1873.324	 
4515	2011	1403.31	 
4515	2011	1000	 
4515	2011	1357.093	 
4515	2011	1243.378	 
4515	2012	1724.538	 
4515	2012	1421.442	 
4515	2012	1198.51	 
4521	2010	4455.718	 
4521	2010	2873.324	 
4521	2011	2403.31	 
4521	2011	2000	 
4521	2011	2357.093	 
4521	2011	2243.378	 
4521	2012	2724.538	 
4521	2012	2421.442	 
4521	2012	2198.51	 
3670	2010	34.59	 
3670	2010	32.13	 
3670	2010	143.4	 
3670	2010	15.883	 
3670	2010	0.522	 
;

data want  (drop=d _:);
  set have end=end_of_have;

  length sic2 $2 dummy2 8   sic3 $3 dummy3 8   sic4 $4 dummy4 8;
  array _sic {2:4} $4 sic2-sic4;
  array _dum {2:4}    dummy2-dummy4;
  do d=2 to 4; 
    _sic{d}=sic;
  end;

  if _n_=1 then do;
    declare hash hoh () ;            /* Hash of hashes */
      hoh.definekey('sic');
      hoh.definedata('sic','H','IH','_N','_MEDIAN');
      hoh.definedone();
    declare hiter ihoh ('hoh');

    declare hash h;  declare hiter ih;
  end;

  do d=4 to 2 by -1;
    /* Add this SIC4 (and SIC3 and SIC2) to hash HOH if not there yet */
    if hoh.find(key:_sic{d})^=0 then do;       /* "^0" means not already there */
      h=_new_ hash(ordered:'a',multidata:'Y');
        h.definekey('share');
        h.definedata('share');
        h.definedone();
      ih=_new_ hiter('h');

      sic=_sic{d};     /* Copy then shortened SIC code */
      hoh.add();       /* Add this (possible shortened) SIC to HOH (with N and _median yet to be determined) */
    end;
    /* And always add the share value to the corresponding hash H */
    h.add();
  end;

  if end_of_have;  /* End of data?  Then Hashes are built, now get N's and _medians */

  /* Now iterate over all HOH dataitems, calculating N and _median for each */
  do _rc=ihoh.first() by 0 until (ihoh.next()^=0);  /* Retrieve corresponding H */
    _N=h.num_items;
    do _i=1 to floor(_N/2);   /* Read H up to _median or just short of it */
      ih.next();
    end;
    _median=share;
    if mod(_N,2)=1 then do;   /* Read just beyond _median if needed */
      ih.next();
      _median=mean(_median,share);
    end;
    hoh.replace();   /* Having calclulated N and MEDIAN, re-write the HOH entry */
  end;

  hoh.output(dataset:'hohdata');  /* This is unneccesary but take a look at it */

  do until (end_of_have2);        /* Now re-read each record and compare to median */
    set have (rename=(sic=_orig_sic)) end=end_of_have2;
    do d=2 to 4;
      _sic{d}=_orig_sic;
      hoh.find(key:_sic{d});
      if share > _median then _dum{d}=1;
      else _dum{d}=0;
      sic=_orig_sic;
    end;
    output;
  end;
run;

The hash object HOH is built to have one "dataitem" (think one row) for each distinct SIC2, SIC3, and SIC4 code.  It's meant to have a variable _N and _MEDIAN for each code.  It is built up dynamically via the first pass of the data.  Each HOH data item also has its own subordinate hash H, which keeps (in sorted order) all the SHARE values for that code.

 

At the end of the first pass, each dataitem in HOH is used to retrieve the corresponding H, and each H is processed half-way to determine the median.  Then that median (and corresponding N) is re-written back into HOH, which is now complete with the needed stats for each code.

 

Then a second pass of the data is executed.  For each incoming SIC  (i.e. each incoming SIC4) the corresponding N and MEDIAN are retrieved from HOH, and the corresponding dummy4 value is set.  Then the "parent" SIC3 and SIC2 codes are similarly implemented.

 

It's one data step, with two total passes of the data.

 

BTW, just to show the contents of the HOH hash, I (unnecessarily) output it to data set HOHDATA - take a look.  It has all the medians.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 1646 views
  • 3 likes
  • 4 in conversation