SIC | Year | share | Dummy |
5080 | 2010 | 1489.499 | |
5080 | 2010 | 68.57 | |
5080 | 2011 | 1193.963 | |
5080 | 2011 | 427.454 | |
5080 | 2012 | 134.285 | |
5080 | 2012 | 1151.153 | |
5080 | 2012 | 0 | |
5080 | 2012 | 157.123 | |
5080 | 2012 | 62.593 | |
4512 | 2010 | 2455.718 | |
4512 | 2010 | 873.324 | |
4512 | 2011 | 403.31 | |
4512 | 2011 | 0 | |
4512 | 2011 | 357.093 | |
4512 | 2011 | 243.378 | |
4512 | 2012 | 724.538 | |
4512 | 2012 | 421.442 | |
4512 | 2012 | 198.51 | |
3670 | 2010 | 34.59 | |
3670 | 2010 | 32.13 | |
3670 | 2010 | 143.4 | |
3670 | 2010 | 15.883 | |
3670 | 2010 | 0.522 |
Hi, I want to create a dummy variable from the column "share". if the share is above the median then 1 else 0. Median is calculated by SIC 2, 3, and 4 digits and by year. I can solve it step by step procedure. It will take multiple steps. Is there any procedure by which I can do the above in one step?
Hi @abdulla I am not sure what you mean by -"by SIC 2, 3, and 4 digits "
If I can assume, if median of BY GROUP which is each SIC value is what you meant, then the logic is share>median then 1 else 0. If this is correct then-
A straightforward SQL should do like:-
data have;
input SIC Year share;* Dummy;
cards;
5080 2010 1489.499
5080 2010 68.57
5080 2011 1193.963
5080 2011 427.454
5080 2012 134.285
5080 2012 1151.153
5080 2012 0
5080 2012 157.123
5080 2012 62.593
4512 2010 2455.718
4512 2010 873.324
4512 2011 403.31
4512 2011 0
4512 2011 357.093
4512 2011 243.378
4512 2012 724.538
4512 2012 421.442
4512 2012 198.51
3670 2010 34.59
3670 2010 32.13
3670 2010 143.4
3670 2010 15.883
3670 2010 0.522
;
proc sql;
create table want as
select *,share>median(share) as dummy
from have
group by SIC,Year;
quit;
/*And if you want median value in another column for reference*/
proc sql;
create table want as
select *,share>median(share) as dummy,median(share) as median
from have
group by SIC,Year;
quit;
Hi @abdulla I am not sure what you mean by -"by SIC 2, 3, and 4 digits "
If I can assume, if median of BY GROUP which is each SIC value is what you meant, then the logic is share>median then 1 else 0. If this is correct then-
A straightforward SQL should do like:-
data have;
input SIC Year share;* Dummy;
cards;
5080 2010 1489.499
5080 2010 68.57
5080 2011 1193.963
5080 2011 427.454
5080 2012 134.285
5080 2012 1151.153
5080 2012 0
5080 2012 157.123
5080 2012 62.593
4512 2010 2455.718
4512 2010 873.324
4512 2011 403.31
4512 2011 0
4512 2011 357.093
4512 2011 243.378
4512 2012 724.538
4512 2012 421.442
4512 2012 198.51
3670 2010 34.59
3670 2010 32.13
3670 2010 143.4
3670 2010 15.883
3670 2010 0.522
;
proc sql;
create table want as
select *,share>median(share) as dummy
from have
group by SIC,Year;
quit;
/*And if you want median value in another column for reference*/
proc sql;
create table want as
select *,share>median(share) as dummy,median(share) as median
from have
group by SIC,Year;
quit;
First you have to create variables with 2-digit SIC and 3-digit SIC.
data have2;
set have;
sic2=substr(sic,1,2);
sic3=substr(sic,1,3);
run;
Then you compute medians for sic2 sic3 and sic. This is a job for PROC SUMMARY!
#ProcSummaryRulez <== official hashtag for PROC SUMMARY
proc summary data=have2;
class sic2 sic3 sic;
ways 1;
var share;
output out=_medians_ median= /autoname;
run;
Then you can determine a dummy variable for sic2 and a different dummy variable for sic3 and a different dummy variable using sic from the computed medians. But you said you want only one dummy variable, I don't really know what you expect at this step, so I'll stop here and let you explain further.
You need to determine medians for 2-, 3-, and 4-digit Standard Industrial Codes. Now in your sample data, the medians will be the same for each 4-digit and its corresponding 3-digit and 2-digit SIC codes (i.e. the same median for SIC4=4512, SIC3=451, and SIC2=45). So you would really need only 1 dummy variable per record.
But if you have a normal mix of such codes, you'll have many instances of multiple SIC3 codes nested within a given SIC2 code, as well as multiple SIC4 codes within each SIC3 code. So I've generated some sample data below to demonstrate. As a results you will need 3 dummy variables DUMMY2, DUMMY3, and DUMMY4:
data have;
input SIC $4. Year share;* Dummy;
cards;
5080 2010 1489.499
5080 2010 68.57
5080 2011 1193.963
5080 2011 427.454
5080 2012 134.285
5080 2012 1151.153
5080 2012 0
5080 2012 157.123
5080 2012 62.593
4512 2010 2455.718
4512 2010 873.324
4512 2011 403.31
4512 2011 0
4512 2011 357.093
4512 2011 243.378
4512 2012 724.538
4512 2012 421.442
4512 2012 198.51
4515 2010 3455.718
4515 2010 1873.324
4515 2011 1403.31
4515 2011 1000
4515 2011 1357.093
4515 2011 1243.378
4515 2012 1724.538
4515 2012 1421.442
4515 2012 1198.51
4521 2010 4455.718
4521 2010 2873.324
4521 2011 2403.31
4521 2011 2000
4521 2011 2357.093
4521 2011 2243.378
4521 2012 2724.538
4521 2012 2421.442
4521 2012 2198.51
3670 2010 34.59
3670 2010 32.13
3670 2010 143.4
3670 2010 15.883
3670 2010 0.522
;
data want (drop=d _:);
set have end=end_of_have;
length sic2 $2 dummy2 8 sic3 $3 dummy3 8 sic4 $4 dummy4 8;
array _sic {2:4} $4 sic2-sic4;
array _dum {2:4} dummy2-dummy4;
do d=2 to 4;
_sic{d}=sic;
end;
if _n_=1 then do;
declare hash hoh () ; /* Hash of hashes */
hoh.definekey('sic');
hoh.definedata('sic','H','IH','_N','_MEDIAN');
hoh.definedone();
declare hiter ihoh ('hoh');
declare hash h; declare hiter ih;
end;
do d=4 to 2 by -1;
/* Add this SIC4 (and SIC3 and SIC2) to hash HOH if not there yet */
if hoh.find(key:_sic{d})^=0 then do; /* "^0" means not already there */
h=_new_ hash(ordered:'a',multidata:'Y');
h.definekey('share');
h.definedata('share');
h.definedone();
ih=_new_ hiter('h');
sic=_sic{d}; /* Copy then shortened SIC code */
hoh.add(); /* Add this (possible shortened) SIC to HOH (with N and _median yet to be determined) */
end;
/* And always add the share value to the corresponding hash H */
h.add();
end;
if end_of_have; /* End of data? Then Hashes are built, now get N's and _medians */
/* Now iterate over all HOH dataitems, calculating N and _median for each */
do _rc=ihoh.first() by 0 until (ihoh.next()^=0); /* Retrieve corresponding H */
_N=h.num_items;
do _i=1 to floor(_N/2); /* Read H up to _median or just short of it */
ih.next();
end;
_median=share;
if mod(_N,2)=1 then do; /* Read just beyond _median if needed */
ih.next();
_median=mean(_median,share);
end;
hoh.replace(); /* Having calclulated N and MEDIAN, re-write the HOH entry */
end;
hoh.output(dataset:'hohdata'); /* This is unneccesary but take a look at it */
do until (end_of_have2); /* Now re-read each record and compare to median */
set have (rename=(sic=_orig_sic)) end=end_of_have2;
do d=2 to 4;
_sic{d}=_orig_sic;
hoh.find(key:_sic{d});
if share > _median then _dum{d}=1;
else _dum{d}=0;
sic=_orig_sic;
end;
output;
end;
run;
The hash object HOH is built to have one "dataitem" (think one row) for each distinct SIC2, SIC3, and SIC4 code. It's meant to have a variable _N and _MEDIAN for each code. It is built up dynamically via the first pass of the data. Each HOH data item also has its own subordinate hash H, which keeps (in sorted order) all the SHARE values for that code.
At the end of the first pass, each dataitem in HOH is used to retrieve the corresponding H, and each H is processed half-way to determine the median. Then that median (and corresponding N) is re-written back into HOH, which is now complete with the needed stats for each code.
Then a second pass of the data is executed. For each incoming SIC (i.e. each incoming SIC4) the corresponding N and MEDIAN are retrieved from HOH, and the corresponding dummy4 value is set. Then the "parent" SIC3 and SIC2 codes are similarly implemented.
It's one data step, with two total passes of the data.
BTW, just to show the contents of the HOH hash, I (unnecessarily) output it to data set HOHDATA - take a look. It has all the medians.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.