Solved: Re: Help with Proc SQL step

therock · Posted 11-11-2015 04:30 PM

Hi Everyone,

I need help with a problem that was solved last time. I am hoping that you guys will be able to help me with the new data problem. I have companies information that I need help with. Here is a sample of the data

CoIDYearProfitLossporin2K2

123	1998	-100	1
123	1999	-100	1
123	2000	-150	1
123	2001	200	1
456	1999	50	0
456	2000	10	0
456	2001	-25	0
789	1999	-150	1
789	2000	-100	1
789	2001	50	1
789	2002	100	1

The last column created a variable called "Lossporin2K2" which will be filled with 1 for every value of that company if the profit was below zero two years in a row or more during the year 2000 or before otherwise it will be 0. The last column was generously helped by the SAS community member Ksharp (Xia Keshan) with the following code:

proc sql;
create table want as 
 select *,
   case 
   when (select count(*) from have where coid=a.coid and year le 2000 and profit lt 0) gt 1 then 1
   else 0
   end as Lossporin2K2
   from have as a
   group by coid ;
quit;

Now I need another column, where create a variable called "Lossin3Cp2k" which will be filled with 1 for every value of that company if the profit was below zero three years in a row or more during the year 2000 or before otherwise it will be 0. So in this example, only the first company will have 1 and the bottom two company will have 0 for all years.

Thanks so much for help!

The Rock

RogerSpeas · Posted 11-19-2015 10:29 PM

Given this data...

data have;

infile datalines dlm='09'x truncover ;

input coid year profit ta;

datalines;

123 1995 -100 1

123 1996 -100 2

123 1997 -100 3

123 1998 -100 4

123 1999 -100 5

123 2000 -150 6

123 2001 200 7

223 1995 -100 1

223 1996 -100 2

223 1998 10 3

223 1999 -100 4

223 2000 -150 5

223 2001 200 6

456 1996 -10 1

456 1997 -10 2

456 1998 . 3

456 1999 -10 4

456 2000 -10 5

456 2001 . 6

789 1997 -100 0

789 1998 . 1

789 1999 100 2

789 2000 -100 3

789 2001 . 4

789 2002 . 5

890 1998 . 1

890 1999 . 2

890 2000 -100 3

890 2001 -100 4

890 2002 100 5

890 2003 200 6

890 2004 300 7

891 1997 100 1

891 1998 -100 2

891 1999 -200 3

891 2000 200 4

891 2001 200 5

891 2002 100 6

892 1998 -100 1

892 1999 . 2

892 2000 100 3

892 2001 100 4

;

run;

You would create rolling totals number of negative and number of non-missing for each year prior to 2000, the threeYear and twoYEar tables below. As they would represent multiple years prior to 2000, you would then need to summary the annual rolling count to determin the ternary logic. So if there are three negative years, then "1", else if there are three non-missing value, then "0", otherwise missing. You can review the rolling totals in the tables, and the pre-ternary summary results from the select clauses.

proc sql;

create table threeYear as

select a.coid, a.year, sum(b.profit<0 and b.profit is not missing) as ThreeBadYears,

sum(b.profit is not missing) as c3

from have as a, have as b

where a.coid=b.coid and (a.year between b.year-0 and b.year-2 and

a.year<=1998)

group by a.coid, a.year;

select coid, max(threeBadYears) as Bad3,

min(c3) as NM3

from threeyear

group by coid;

*if Bad3=3 and NM3=3 then 1

if Bad2<3 and NM3=3 then 0

if NM3<3 then . ;

create table twoYear as

select a.coid, a.year, sum(b.profit<=0 and b.profit is not missing ) as TwoBadYears,

sum(b.profit=.) as C2,

from have as a, have as b

where a.coid=b.coid and (a.year between b.year-0 and b.year-1 and

a.year<=1999)

group by a.coid, a.year;

select coid, max(TwoBadYears) as Bad2,

max(C2) as NM2

from TwoYear

group by coid;

*if Bad2=2 then 1

if NM=2 then 0

if NM<2 then . ;

quit;

You can combine those queries as inline views. Instead of include the maximum value, you would use the maximum values in the CASE expression to determine the ternary value.

As not all company might appears in the Two and Three years rolling total tables, you would need to use a LEFT JOIN so as not to remove company not represented in these two table. The ternary values would be returned for companies with pre-2000 data and the LEFT JOIN would set companies without data to missing.

proc sql;

select have.*, y2.TwoInRow, y3.ThreeInRow

from

(have

left join

(select coid,

case when (max(TwoBadYears)=2) then 1

when (max(C2)=2) then 0

else .

end as TwoInRow

from (

select a.coid, a.year,

sum(b.profit<0 and b.profit is not missing) as TwoBadYears,

sum(b.profit is not missing) as C2

from have as a, have as b

where a.coid=b.coid and (a.year between b.year-1 and b.year+0 and

a.year<=1999)

group by a.coid, a.year

)

group by coid) as Y2

on have.coid = y2.coid )

left join

(select coid,

case when (max(ThreeBadYears)=3) then 1

when (max(C3)=3) then 0

end as ThreeInRow

from (

select c.coid, c.year,

sum(d.profit<0 and d.profit is not missing) as ThreeBadYears,

sum(d.profit is not missing) as C3

from have as c, have as d

where c.coid=d.coid and (c.year between d.year-2 and d.year+0 and

c.year<=1998)

group by c.coid, c.year

)

group by coid) as Y3

on have.coid=y3.coid ;

quit;

The SQL code is highly recursive as SQL does not handle sequential processing as efficiently as the data step. The HAVE table is ready 5 times, and 4 intermediate summary tables are re-read as inline views (albeit they are narrow tables and if HAVE is small the OS might have cached the file). The DATA step is likely to read the table just once with a smaller memory/cpu footprint.

View solution in original post

stat_sas · Posted 11-11-2015 04:59 PM

I think a small modification in the above code will get you the desired results. Please see below, just replace gt 1 with =3 in sql code.

proc sql;
create table want as
select *,
   case
   when (select count(*) from have where coid=a.coid and year le 2000 and profit lt 0) =3 then 1
   else 0
   end as Lossin3Cp2k

from have as a
group by coid ;
quit;

RogerSpeas · Posted 11-11-2015 05:24 PM

data have; *see 789 modification - - + - ;

infile datalines dsd truncover;

input coid year profit code ;

datalines;

123,1998,-100,1

123,1999,-100,1

123,2000,-150,1

123,2001,200,1

456,1999,50,0

456,2000,10,0

456,2001,-25,0

789,1999,-150,1

789,2000,-100,1

789,2001,50,1

789,2002,-100,1

;

run;

RogerSpeas · Posted 11-11-2015 05:20 PM

You could do a range join, a.year between b.year and b.year+2.

Using a boolean expression to sum the number of negative events, and retain only the rows with a total of three.

proc sql;

select a.coid, a.year, sum(b.profit<0) as ThreeBadYears

from have as a, have as b

where a.coid=b.coid and (a.year between b.year+0 and b.year+2)

group by a.coid, a.year

having count(*)=3;

quit;

coid	year	ThreeBadYears
123	2000	3
123	2001	2
456	2001	1
789	2001	1
789	2002	2

Instead of return 3, use a boolean to return 1 and eliminate the zero events...

proc sql;

select a.coid, sum(b.profit<0)=3 as Lossporin3K3

from have as a, have as b

where a.coid=b.coid and (a.year between b.year+0 and b.year+2)

group by a.coid, a.year

having count(*)=3 and LossPorin3k3 ne 0;

quit;

coid	Lossporin3K3
123	1

You can LEFT join the result or the above query back into your query....sum (a.lossPorin3k3, 0)

or create a table and use a correlated query in a case expression.

PGStats · Posted 11-11-2015 11:03 PM

SQL is really not ideal for sequential data processing. I think you would do better with data step processing (assuming your data is sorted by coid and year) :

data have;
input coid year profit;
datalines;
123 1998 -100
123 1999 -100
123 2000 -150
123 2001 200
456 1999 50
456 2000 10
456 2001 -25
789 1999 -150
789 2000 -100
789 2001 50
789 2002 100
999 1998 -150
999 1999 50
999 2000 -100
999 2001 50
999 2002 100
;

data lossSpells;
spell = 0; y = -999;
do until(last.coid);
    set have; 
    where profit<0 and year<=2000; 
    by coid;
    if year = y + spell then spell + 1;
    else do; y = year; spell = 1; end;
    end;
keep coid spell;
run;

data want;
merge have lossSpells(in=loss); by coid;
code2 = loss and spell >= 2;
code3 = loss and spell >= 3;
drop spell;
run;

proc print data=want noobs; run;

BTW, KSharp's SQL query counted all loss years before 2000, not consecutive years.

PG

RogerSpeas · Posted 11-12-2015 03:01 AM

I failed to notice the detail consecutive years prior to 2001. I believe an additional filter a.year>=1998 would resolve that issue.

I would agree SQL might not be the best choice for sequential processing of data. The DATA step is likely a better choice if we can assume the data is sorted by COID and YEAR.

However, if the data is sorted, the SQL reflexive join would likely use the SORT/MERGE method when joining the tables. If so, the SQL code would likely read the table, HAVE sequentially by the COID. In other words, the preformance difference might be very similar to the data step. However, if one desires both a two years and three year run/spell, SQL might require multiple data passes.

If I was going to write the DATA step code, I would might mirror the reflexive join the DATA step. So as an alternative suggestion, one could take advantage of the OS disk caching to avoid having to pick the HAVE table twice.

data need;

set have(keep=coid year profit) ; *first pass check for runs;

by coid;

if first.coid then do;

count=0;

TwoInRow=0; ThreeInRow=0;

retain TwoInRow ThreeInRow;

end;

if year<=2000 then do; *exclude if check for all years;

if profit>=0 then do; *reset counter when profitable;

count=0;

end;

else do;

count+1; *count years of loss;

end;

if count=2 then TwoInRow=1; *flag two in a row;

else if count=3 then ThreeInRow=1; *flag three in a row;

drop count ;

end;

if last.coid then do until(last.coid); *if last, re-read the BY group;

set have;

by coid; *reset first./last. ;

output;

end;

run;

You might be able to add the reflexive code as a second UNTIL loop, after the first UNTIL loop in the first data step. However, the WHERE statement applies to all tables in the data step. And I suspect that if you used the WHERE= the two loop might lose GROUP BY sychronization.

The other thing that I might point out, your code assume only three years of data to check, 1998-2000 and rightly so, as the data dictates. If the data however went back 5 years, - - - + - , would yield a spell of 1, - + - - +, would yield a spell of 0. I suspect you could capture the CODE2 and CODE3 with some logic in the first DATA step for a more general solution.

..however, the WHERE statement applies to all tables in the DATA step, so you wouldn't have the synchronization of the two BY statments.

Steelers_In_DC · Posted 11-12-2015 08:38 AM

I redid lossporin2k2, with the below code you can easily add new variables for 2,3,4,5 consecutive losses:

data have;
infile cards dsd;
input CoID$ Year Profit;
cards;
123,1998,-100
123,1999,-100
123,2000,-150
123,2001,200
456,1999,50
456,2000,10
456,2001,-25
789,1999,-150
789,2000,-100
789,2001,50
789,2002,100
;

data prep;
set have;
by coid;
if profit < 0 then count + 1;
if first.coid then count = 1;
if profit >= 0 then count = 0;
run;

data want;
do until(last.coid);
set prep;
by coid;
if count = 2 then Lossporin2K2 = 1;
if count = 3 then Lossin3Cp2k = 1;
end;
do until(last.coid);
set prep;
by coid;
output;
end;
drop count;
run;

RogerSpeas · Posted 11-12-2015 10:47 AM

I was just responding to what if the solution was open to suggestions outside of SQL.

The DATA step is useful for sequential processing, however, a common issue with DATA step code, is that programmers are often picking things up and putting them down again.

Thus, I was suggesting that if you use two data steps, you are doing extra picking up and putting down. It appears you pick HAVE up, put PREP down, and the pick PREP up twice sychronoized. If I was to convert PGstat's code into a single data step, the WHERE statement or WHERE= data set option would make the algorithm of one data step solution more problematic.

If I was to combine your two data steps, then one would need to retain/reset the mask variables.

data prep;

set have;

by coid;

if profit < 0 then count + 1;

if first.coid then do; count=1; Lossporin2k2=0; Lossin3cp2k=0; end;

if profit >= 0 then count = 0;

if count = 2 then Lossporin2K2 = 1;

if count = 3 then Lossin3Cp2k = 1;

retain loss: ;

if last.coid;

do until(last.coid);

set have;

by coid;

output;

end;

drop count;

run;

Steelers_In_DC · Posted 11-12-2015 10:55 AM

Roger,

I like that solution. I doctored mine a little:

data want;
set have;
by coid;
if profit < 0 then count + 1;
if first.coid then count = 1;
if profit >= 0 then count = 0;
do until(last.coid);
set prep;
by coid;
if count = 2 then Lossporin2K2 = 1;
if count = 3 then Lossin3Cp2k = 1;
output;
end;
drop count;
run;

How you have yours gave me a new perspective on double dow, I wasn't using it to it's potential. Thanks!

RogerSpeas · Posted 11-12-2015 11:11 AM

But want would your count have in a - - - + - five year sequence?

therock · Posted 11-12-2015 11:44 AM

Hi Everyone,

Thanks so much for help! However I am apologizing in advance for the following comment.

I forgot to mention three important items:

1) My data starts from 1996-2013. There might be situation for a particular company (let's call it Company C) prior to 2000, where profit might be positive in year 1996 and 1997, but negative afterwards until 2000, then positive again. Because it has three years of consecutive negative profit prior to 2000 I need the "Lossporin2K2" and "Lossin3Cp2k" column to be equal to 1 for all the years I have data for that company.

2) I have data on some companies that I only have information on profit for particular years. For example, let's say for Company D, I have only have information on profit for years 1996-2000, but information for other variables on that company is available from 1996-2013. Because the profit information is given and based on that the last two column ("Lossporin2K2" and "Lossin3Cp2k") could either be 1 or 0.

3) On the other hand, I might have situation where there is no profit information for any year or only information for years after 2000. For those companies I need the the last two column ("Lossporin2K2" and "Lossin3Cp2k") to be missing (.).

I apologize for this confusion. Would it cause any change in code?

Thanks,

The Rock

PGStats · Posted 11-12-2015 12:31 PM

Item # 2 are you saying that you don't care what value variables Lossporin2K2" and "Lossin3Cp2k take in that case?

PG

therock · Posted 11-12-2015 01:39 PM

What I am saying is that based on what the profit is in years 1996-2000, that is what last two columns will be based on. If the profit is positive, then it will 0. If the profit is negative, then it will be 1. If the profit is missing for 1997, but it is there for 1998 and 1999 (and positive) and no information on profit following that, then based on the profit for those two years, the last two columns will be decided. In this case, it wil be missing for "Lossin3Cp2k" and 0 for "Lossporin2K2."

Hope this clears up the confusion!

RogerSpeas · Posted 11-12-2015 02:13 PM

After the BY statement in the second do loop...

if last.coid then do until(last.coid); *if last, re-read the BY group;

set have;

by coid; *reset first./last. ;

if first.coid=1 and year=2000 then do;

TwoInRow=.; ThreeInRow=.;

end;

*if first.coid=1 and year=1999 then do;

* ThreeInRow=.;

* end; output;

end;

However, I'm confused. If year 2000 has a negative profit is it included in the count?

If the first year of profit 2000 and will a negative value... would the mask variables be zero or missing?

If the first year of profit is 1999, TwoInARow could have a value of 0,1 would ThreeInARow be .?

Additionally, you are suggesting that there can be breaks in data in the years prior to 2000.... - - . - -

Would ThreeInARow be missing, as three consecutive years were never observed?

What if the prior five years were.... - , - , - would both mask variable be missing?

If so, ternary logic would need to be added, as you need to set the masks to missing when profit was missing..

else if profit=. then count=. ;

Fortunately, the sum statement, count+1, will ignore missing value when encountered.

Tthe MASK variable would now would need to be initatized to missing rather than zero.

if count=2 then TwoInRow+1; *flag two in a row;

else if count=3 then ThreeInRow+1; *flag three in a row;

And before the second loop noted above...

if TwoInRow ne . then twoInRow=1;

if ThreeInRow ne . then threeinRow=1;

That's just a quick stab, but I think I have the logic right. But I need a little more details on when missing would need to be set.

therock · Posted 11-12-2015 02:19 PM

So for the first question that you have, yes; the year 2000 is counted. However, if only the first year is given is 2000, then the whole company set will have a missing values. If the first year profit is in 1999 and it is positive and the second year profit is negative, then twoinarow will be 0, but threeinarow will be missing.

For your second question, yes there can be a break in a data. If there is break, then you might have only twoinarow, but not threeinarow.

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!