Solved: Using information from one dataset to create another dataset

elbarto · Posted 12-14-2020 10:15 AM

Hello everyone, I have an excerpt of my data as the following:

DATA have ;
  input id	year	action	shock;
DATALINES;
1055	1981	0	.
1055	1982	0	.
1055	1983	0	.
1055	1984	0	.
1055	1985	1	1
1055	1986	1	.
1055	1987	1	.
1055	1988	1	.
1055	1989	0	.
1055	1990	0	.
1055	1991	0	.
1055	1992	0	.
1055	1993	0	.
1085	1981	0	.
1085	1982	1	1
1085	1983	0	.
1085	1984	0	.
1085	1985	0	.
1085	1986	0	.
1085	1987	0	.
1085	1988	0	.
1085	1989	0	.
1085	1990	0	.
1085	1991	1	1
1085	1992	1	.
1085	1993	1	.
1212	1981	0	.
1212	1982	0	.
1212	1983	1	1
1212	1984	1	.
1212	1985	1	.
1212	1986	0	.
1212	1987	0	.
1212	1988	1	1
1212	1989	1	.
1212	1990	1	.
1212	1991	0	.
1212	1992	0	.
1212	1993	0	.
1842	1981	0	.
1842	1982	0	.
1842	1983	0	.
1842	1984	0	.
1842	1985	0	.
1842	1986	0	.
1842	1987	0	.
1842	1988	0	.
1842	1989	0	.
1842	1990	0	.
1842	1991	0	.
1842	1992	0	.
1842	1993	0	.
2913	1981	0	.
2913	1982	0	.
2913	1983	0	.
2913	1984	0	.
2913	1985	1	1
2913	1986	1	.
2913	1987	1	.
2913	1988	0	.
2913	1989	0	.
2913	1990	0	.
2913	1991	1	1
2913	1992	1	.
2913	1993	1	.
;
RUN;

I want to create the following dataset:

DATA want ;
  input id	event_year	treat;
DATALINES;
1055	1985	1
1085	1991	1
1212	1983	1
1212	1988	1
2913	1985	1
2913	1991	1
1085	1985	0
1842	1985	0
1055	1991	0
1842	1991	0
1842	1983	0
1085	1988	0
1842	1988	0
;
RUN;

The want dataset is created as follows:

For each id, consider only shock = 1. If the corresponding value of action equals 1 for the corresponding year and the following two years and equals 0 for the previous two years, then in the want dataset, enter an observation for the id, the corresponding year (named as event_year) and set treat = 1. As an example, consider id = 1055, year = 1985 in have. Here shock = 1, and action = 0 for years 1983 and 1984 (the previous two years), and equals to 1 for years 1985 (the current year corresponding to shock = 1), 1986 and 1987 (the following two years), so in the want dataset, there is an entry for id = 1055, event_year = 1985 with treat = 1.

Once all the rows with treat = 1 are created, then to create the rows with treat = 0, we do the following:

For each treat = 1, we look at all the id's in the have dataset with year equal to the event_year but with action = 0 in that given year. If action also equals to 0 for the two years before and two years after event_year, then enter an observation for the id, the corresponding event_year and set treat = 0 in want. As an example, consider id = 1055, event_year = 1985, treat = 1 in want. In the have dataset, id = 1085 has action = 0 for year = 1985. Action also takes the value of 0 two years before and two years after 1985, so create the observation id = 1085, event_year = 1985 and treat = 0 in want. Similarly, in the have dataset, id = 1842 also has action = 0 for year = 1985 as well as for the two years before and two years after. So create the observation id = 1842, event_year = 1985 and treat = 0 in want.

There is no need to output duplicates in want, so for example, for id = 1055, event_year = 1985 and treat = 1, we already have rows for id = 1085, event_year = 1985, treat = 0, and id = 1842, event_year = 1985, treat = 0. So when we look at id = 2913, event_year = 1985 and treat = 1, this corresponds to exactly the same two rows, so there is no need to output this again in want.

Also, can the code be made flexible so that in the above rule, instead of being two years before and two years after, it can be any arbitrary window around the event_year? For example, one year before to two years after, or 3 years before to 3 years after.

RichardDeVen · Posted 12-14-2020 03:44 PM

Complicated group wise computations and rule application can be performed in SQL. Although a single very complex SQL statement can perform the work, it is better to separate the major distinct pieces into separate queries that can be debugged and studied independently for testing and performance.

In Proc SQL the groupwise SUM function of a logical expression is very helpful for counting the number of cases affirming or contradicting some rule condition.

Example:

Data (hidden as spoiler)

Spoiler

DATA have ;
  input id  year  action  shock;
DATALINES;
1055  1981  0 .
1055  1982  0 .
1055  1983  0 .
1055  1984  0 .
1055  1985  1 1
1055  1986  1 .
1055  1987  1 .
1055  1988  1 .
1055  1989  0 .
1055  1990  0 .
1055  1991  0 .
1055  1992  0 .
1055  1993  0 .
1085  1981  0 .
1085  1982  1 1
1085  1983  0 .
1085  1984  0 .
1085  1985  0 .
1085  1986  0 .
1085  1987  0 .
1085  1988  0 .
1085  1989  0 .
1085  1990  0 .
1085  1991  1 1
1085  1992  1 .
1085  1993  1 .
1212  1981  0 .
1212  1982  0 .
1212  1983  1 1
1212  1984  1 .
1212  1985  1 .
1212  1986  0 .
1212  1987  0 .
1212  1988  1 1
1212  1989  1 .
1212  1990  1 .
1212  1991  0 .
1212  1992  0 .
1212  1993  0 .
1842  1981  0 .
1842  1982  0 .
1842  1983  0 .
1842  1984  0 .
1842  1985  0 .
1842  1986  0 .
1842  1987  0 .
1842  1988  0 .
1842  1989  0 .
1842  1990  0 .
1842  1991  0 .
1842  1992  0 .
1842  1993  0 .
2913  1981  0 .
2913  1982  0 .
2913  1983  0 .
2913  1984  0 .
2913  1985  1 1
2913  1986  1 .
2913  1987  1 .
2913  1988  0 .
2913  1989  0 .
2913  1990  0 .
2913  1991  1 1
2913  1992  1 .
2913  1993  1 .
;

DATA have ; input id year action shock; DATALINES; 1055 1981 0 . 1055 1982 0 . 1055 1983 0 . 1055 1984 0 . 1055 1985 1 1 1055 1986 1 . 1055 1987 1 . 1055 1988 1 . 1055 1989 0 . 1055 1990 0 . 1055 1991 0 . 1055 1992 0 . 1055 1993 0 . 1085 1981 0 . 1085 1982 1 1 1085 1983 0 . 1085 1984 0 . 1085 1985 0 . 1085 1986 0 . 1085 1987 0 . 1085 1988 0 . 1085 1989 0 . 1085 1990 0 . 1085 1991 1 1 1085 1992 1 . 1085 1993 1 . 1212 1981 0 . 1212 1982 0 . 1212 1983 1 1 1212 1984 1 . 1212 1985 1 . 1212 1986 0 . 1212 1987 0 . 1212 1988 1 1 1212 1989 1 . 1212 1990 1 . 1212 1991 0 . 1212 1992 0 . 1212 1993 0 . 1842 1981 0 . 1842 1982 0 . 1842 1983 0 . 1842 1984 0 . 1842 1985 0 . 1842 1986 0 . 1842 1987 0 . 1842 1988 0 . 1842 1989 0 . 1842 1990 0 . 1842 1991 0 . 1842 1992 0 . 1842 1993 0 . 2913 1981 0 . 2913 1982 0 . 2913 1983 0 . 2913 1984 0 . 2913 1985 1 1 2913 1986 1 . 2913 1987 1 . 2913 1988 0 . 2913 1989 0 . 2913 1990 0 . 2913 1991 1 1 2913 1992 1 . 2913 1993 1 . ;

Code:

%let N_BEFORE = 2;
%let N_AFTER  = 2;

proc sql;
  create table treated as
  select
    shocked.id
    , shocked.year as event_year
    , 1 as treat
  from
    (select * from have where shock) as shocked
  join
    have as range
  on
    shocked.id = range.id
  where
    range.year between shocked.year - &N_BEFORE and shocked.year + &N_AFTER
  group
    by shocked.id, event_year
  having
      sum (range.year <  shocked.year and range.action = 0) = &N_BEFORE    /* count # number rule affirming cases */
    & sum (range.year >= shocked.year and range.action = 1) > &N_AFTER
  ;

  create table untreated as
  select 
    unactioned.id
    , unactioned.year as event_year
    , 0 as treat
  from
    ( select distinct have.*       /* distinct very important here */
      from have join treated 
      on have.year = treated.event_year
      where action = 0
    ) as unactioned
  join 
    have as range
  on
    unactioned.id = range.id
  where
    range.year between unactioned.year - &N_BEFORE and unactioned.year + &N_AFTER
    and range.action = 0    /* rule is all zeroes */
  group
    by unactioned.id, event_year
  having 
    count(*) = &N_BEFORE + &N_AFTER + 1  /* count number of zeroes */
  ;

  create table wanted as
  select * from treated
  union 
  select * from untreated
  ;

View solution in original post

RichardDeVen · Posted 12-14-2020 03:44 PM

Complicated group wise computations and rule application can be performed in SQL. Although a single very complex SQL statement can perform the work, it is better to separate the major distinct pieces into separate queries that can be debugged and studied independently for testing and performance.

In Proc SQL the groupwise SUM function of a logical expression is very helpful for counting the number of cases affirming or contradicting some rule condition.

Example:

Data (hidden as spoiler)

Spoiler

DATA have ;
  input id  year  action  shock;
DATALINES;
1055  1981  0 .
1055  1982  0 .
1055  1983  0 .
1055  1984  0 .
1055  1985  1 1
1055  1986  1 .
1055  1987  1 .
1055  1988  1 .
1055  1989  0 .
1055  1990  0 .
1055  1991  0 .
1055  1992  0 .
1055  1993  0 .
1085  1981  0 .
1085  1982  1 1
1085  1983  0 .
1085  1984  0 .
1085  1985  0 .
1085  1986  0 .
1085  1987  0 .
1085  1988  0 .
1085  1989  0 .
1085  1990  0 .
1085  1991  1 1
1085  1992  1 .
1085  1993  1 .
1212  1981  0 .
1212  1982  0 .
1212  1983  1 1
1212  1984  1 .
1212  1985  1 .
1212  1986  0 .
1212  1987  0 .
1212  1988  1 1
1212  1989  1 .
1212  1990  1 .
1212  1991  0 .
1212  1992  0 .
1212  1993  0 .
1842  1981  0 .
1842  1982  0 .
1842  1983  0 .
1842  1984  0 .
1842  1985  0 .
1842  1986  0 .
1842  1987  0 .
1842  1988  0 .
1842  1989  0 .
1842  1990  0 .
1842  1991  0 .
1842  1992  0 .
1842  1993  0 .
2913  1981  0 .
2913  1982  0 .
2913  1983  0 .
2913  1984  0 .
2913  1985  1 1
2913  1986  1 .
2913  1987  1 .
2913  1988  0 .
2913  1989  0 .
2913  1990  0 .
2913  1991  1 1
2913  1992  1 .
2913  1993  1 .
;

DATA have ; input id year action shock; DATALINES; 1055 1981 0 . 1055 1982 0 . 1055 1983 0 . 1055 1984 0 . 1055 1985 1 1 1055 1986 1 . 1055 1987 1 . 1055 1988 1 . 1055 1989 0 . 1055 1990 0 . 1055 1991 0 . 1055 1992 0 . 1055 1993 0 . 1085 1981 0 . 1085 1982 1 1 1085 1983 0 . 1085 1984 0 . 1085 1985 0 . 1085 1986 0 . 1085 1987 0 . 1085 1988 0 . 1085 1989 0 . 1085 1990 0 . 1085 1991 1 1 1085 1992 1 . 1085 1993 1 . 1212 1981 0 . 1212 1982 0 . 1212 1983 1 1 1212 1984 1 . 1212 1985 1 . 1212 1986 0 . 1212 1987 0 . 1212 1988 1 1 1212 1989 1 . 1212 1990 1 . 1212 1991 0 . 1212 1992 0 . 1212 1993 0 . 1842 1981 0 . 1842 1982 0 . 1842 1983 0 . 1842 1984 0 . 1842 1985 0 . 1842 1986 0 . 1842 1987 0 . 1842 1988 0 . 1842 1989 0 . 1842 1990 0 . 1842 1991 0 . 1842 1992 0 . 1842 1993 0 . 2913 1981 0 . 2913 1982 0 . 2913 1983 0 . 2913 1984 0 . 2913 1985 1 1 2913 1986 1 . 2913 1987 1 . 2913 1988 0 . 2913 1989 0 . 2913 1990 0 . 2913 1991 1 1 2913 1992 1 . 2913 1993 1 . ;

Code:

%let N_BEFORE = 2;
%let N_AFTER  = 2;

proc sql;
  create table treated as
  select
    shocked.id
    , shocked.year as event_year
    , 1 as treat
  from
    (select * from have where shock) as shocked
  join
    have as range
  on
    shocked.id = range.id
  where
    range.year between shocked.year - &N_BEFORE and shocked.year + &N_AFTER
  group
    by shocked.id, event_year
  having
      sum (range.year <  shocked.year and range.action = 0) = &N_BEFORE    /* count # number rule affirming cases */
    & sum (range.year >= shocked.year and range.action = 1) > &N_AFTER
  ;

  create table untreated as
  select 
    unactioned.id
    , unactioned.year as event_year
    , 0 as treat
  from
    ( select distinct have.*       /* distinct very important here */
      from have join treated 
      on have.year = treated.event_year
      where action = 0
    ) as unactioned
  join 
    have as range
  on
    unactioned.id = range.id
  where
    range.year between unactioned.year - &N_BEFORE and unactioned.year + &N_AFTER
    and range.action = 0    /* rule is all zeroes */
  group
    by unactioned.id, event_year
  having 
    count(*) = &N_BEFORE + &N_AFTER + 1  /* count number of zeroes */
  ;

  create table wanted as
  select * from treated
  union 
  select * from untreated
  ;

ChrisNZ · Posted 12-14-2020 06:36 PM

Like this?

%let n_before = 2;
%let n_after  = 2;

proc sql;
  create table MATCH1 as
  select unique a.*   
  from HAVE a
      ,HAVE b
  where a.SHOCK  = 1
    and a.ACTION = 1
    and a.ID     = b.ID
    and a.YEAR   between b.YEAR-&n_before. and b.YEAR+&n_after.
  group by a.ID, a.YEAR
  having sum(b.ACTION)=3
     and sum(b.ACTION*(a.YEAR <= b.YEAR))=3;

proc sql;
  create table MATCH2 as
  select unique b.ID, a.YEAR as YEAR
  from MATCH1 a
      ,HAVE   b
  where a.ID    ^= b.ID
    and b.ACTION = 0
    and a.YEAR   between b.YEAR-&n_before. and b.YEAR+&n_after.
  group by b.ID, a.ID, a.YEAR
  having sum(b.ACTION)=0 & count(b.ACTION)=5  order by a.ID , b.id, b.YEAR;

 proc sql;
   select ID, YEAR, 1 as TREAT from MATCH1
   union
   select unique *,0 from MATCH2
   order by TREAT, ID;

ID	YEAR	TREAT
1055	1991	0
1085	1985	0
1085	1988	0
1842	1983	0
1842	1985	0
1842	1988	0
1842	1991	0
1055	1985	1
1085	1991	1
1212	1983	1
1212	1988	1
2913	1985	1
2913	1991	1

SAS seems to have an issue with the first select unique.

High-Performance SAS Coding - Third Edition

Using information from one dataset to create another dataset

Re: Using information from one dataset to create another dataset

Re: Using information from one dataset to create another dataset

Re: Using information from one dataset to create another dataset

Using information from one dataset to create another dataset

Re: Using information from one dataset to create another dataset

Re: Using information from one dataset to create another dataset

Re: Using information from one dataset to create another dataset

SAS Innovate 2025: Call for Content

Click image to register for webinar

Classroom Training Available!