SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

selecting only certain observations based on 2 variable values in a data step

Accepted Solution Solved
Reply
Contributor
Posts: 35
Accepted Solution

selecting only certain observations based on 2 variable values in a data step

Hi,

I have a dataset that contains a variable for year (2010-2013), a variable for unitID (LT0100 to LT2400) and some other variables (a_1-a_8;b_1-b_8;etc...).

In a data step below, I want to create a new variable (f_5) based on (one variable divided by the sum of that variable), where the summary procedure is done given they all belong to the same year and same unit Id.

I think that I need to create some do-loops or macro variables, but I can´t figure out the specific coding. This is what I have tried:

proc sql;

  select distinct unitID into :Unit  separated by ', ' from Dataset1;

  select distinct year into :year     separated by ', ' from Dataset1;

quit;

data Dataset1;

set Dataset1;

if &unitID and &year then

  a2_tot = sum(a_2);

f_5 = a_2/a2_tot;

  drop a2_tot;

run;

Any suggestions on how to create the desired new variable?

Best regards,

Hank


Accepted Solutions
Solution
‎02-10-2014 10:07 AM
PROC Star
Posts: 7,364

Re: selecting only certain observations based on 2 variable values in a data step

Like Tom said, sql will be a lot easier in this case and, in fact, can do everything in just a couple of lines.  Here is the sql code and yet another way to do it in a datastep:

proc sql;

  create table want as

    select *,a_2/sum(a_2) as f_5

      from have

        group by unitid,year

          order by year,unitid,code

  ;

quit;

/*or*/

proc sort data=have;

  by unitid year code;

run;

data want (drop=sum_f_5);

  do until (last.year);

    set have (where=(code ne 0));

    by unitid year;

    sum_f_5=ifn(first.year,a_2,sum(sum_f_5,a_2));

  end;

  do until (last.year);

    set have;

    by unitid year;

    f_5=a_2/sum_f_5;

    output;

  end;

run;

proc sort data=want;

  by year unitid code;

run;

View solution in original post


All Replies
PROC Star
Posts: 7,364

Re: selecting only certain observations based on 2 variable values in a data step

Example have and want datasets are usually an easier way of showing the forum what you really want to achieve.  From your description it appears like you only want to accomplish something like the following:

proc sql;

  create table want as

    select unitid,year,x,x/sum(x) as f_5

      from have

        group by unitid,year

  ;

quit;

Contributor
Posts: 35

Re: selecting only certain observations based on 2 variable values in a data step

Thanks for your reply! Here is a clarification:

In the dataset below, I have all the variables except f_5. I want to create f_5 as "a_2/SUM(a_2, where UnitID and year is the same and Code range between 1-6)".

In other words, I want to have the ratio of a specific code value, over the sum of all code values (except 0), for the variable a_2 for each specific UnitID and Year.

Preferably in a data step, since the creation of this new variable (f_5) is just one out of many more variables that needs to be created in ways similar to this one and I would prefer if all could be created in the same data step.

UnitIdYearCodea_1a_2a_3f_5
LT010020100
LT010020101295913932021216002440,137518
LT01002010245650101427159965570,06129
LT010020103330177440154020415130,172445
LT0100201047384528116317458558540,499536
LT01002010513576203337111254680,014332
LT010020106145428226749511100660,114878
LT030020100
LT0300201016497061719043269100,15413
LT0300201021058760972911678240,087232
LT03002010311564933487387333970,31268
LT030020104204778641534414442010,372399
LT030020105257839766021747770,068682
LT03002010624816654402410260,004878
LT040020100
LT040020101478329374002241110,091693
LT04002010274550116167652800,039637
LT0400201037054151339095421810,328304
LT0400201049019861827507128610,448047
LT040020105175015376551129650,092319
LT040020106185470185470
LT010020110
LT0100201112959139352233,216002440,137518
LT0100201124565010156986,59965570,06129
LT010020113330177444169420415130,172445
LT0100201147384528127949158558540,499536
LT010020115135762036708,111254680,014332
LT0100201161454282294244,511100660,114878
LT030020110
LT030020111649706189094,43269100,15413
LT0300201121058760107020,11678240,087232
LT0300201131156493383611,87333970,31268
LT0300201142047786456878,414442010,372399
LT03002011525783984262,21747770,068682
LT03002011624816659842410260,004878
LT040020110
LT040020111478329411402241110,091693
LT04002011274550117783,7652800,039637
LT040020113705415147299,95421810,328304
LT0400201149019862010257128610,448047
LT04002011517501541420,51129650,092319
LT040020116185470185470

Best regards,

Hank

PROC Star
Posts: 1,099

Re: selecting only certain observations based on 2 variable values in a data step

Hi, Hank

Thanks for the improved explanation.

While it is doable to accomplish everything you need in a single data step, it's fairly challenging. I would recommend Arthur's approach to get the summarized data first, in a step like this:

proc sql;

create table have_summ as

select UnitId, Year, sum(a_2) as Sum_a_2

from have

where Code > 0

group by UnitId, Year

order by UnitId, Year;

quit;

and then sort your original dataset:

proc sort data=have;

by UnitId Year;

run;

and then in your processing data step, you have access to the sum of a_2 as Sum_a_2, and can easily do your division, along with all of your other processing.

data want;

merge have have_summ;

by UnitId Year;

...

division, and other processing

...

run;

Solution
‎02-10-2014 10:07 AM
PROC Star
Posts: 7,364

Re: selecting only certain observations based on 2 variable values in a data step

Like Tom said, sql will be a lot easier in this case and, in fact, can do everything in just a couple of lines.  Here is the sql code and yet another way to do it in a datastep:

proc sql;

  create table want as

    select *,a_2/sum(a_2) as f_5

      from have

        group by unitid,year

          order by year,unitid,code

  ;

quit;

/*or*/

proc sort data=have;

  by unitid year code;

run;

data want (drop=sum_f_5);

  do until (last.year);

    set have (where=(code ne 0));

    by unitid year;

    sum_f_5=ifn(first.year,a_2,sum(sum_f_5,a_2));

  end;

  do until (last.year);

    set have;

    by unitid year;

    f_5=a_2/sum_f_5;

    output;

  end;

run;

proc sort data=want;

  by year unitid code;

run;

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 4 replies
  • 437 views
  • 7 likes
  • 3 in conversation