10-17-2013 06:06 AM
we have the following problem:
We would like to add certain observations from the data set TEST to the same data set TEST (thereby creating a new data set TEST2) where these observations to be added fulfill three conditions.
In addition, we would like to see this merge performed several times, i.e., for every group of observations classified by id_group.
Maybe that sounds a bit complicated and/or is not explained very well, so please take a look at the pictures below which should give a better explanation by showing the inital data set and the one we would like to create...
The initial data set TEST looks like:
The data set we would like to create should look like:
Our preliminary code looks like (not working):
CREATE TABLE test2
AS SELECT a.*
FROM test a
LEFT JOIN test b
ON a.date_consensus > b.date_forecast
AND a.date_forecast_minus90d <= b.date_forecast_confirmation
AND a.date_forecast_minus90d >= b.date_forecast
GROUP BY id_group;
I.e., SAS should do the following:
Any help or comment is more than appreciated (sorry, we are SAS beginners).
Thank you very much,
Alex and Niklas
10-17-2013 09:26 AM
Something like this???
match = 0;
/* id 10 */
if date_consensus (26.01.1998) > date_forecast (07.10.1997)
and date_forecast_minus90d (28.10.1997) <= date_forecast_confirmation (16.01.1998)
and date_forecast_minus90d (28.10.1997) >= date_forecast (07.10.1997)
id_group_new = 10;
/* id 100 */
if date_consensus (27.01.1998) > date_forecast (07.10.1997)
and date_forecast_minus90d (29.10.1997) <= date_forecast_confirmation (16.01.1998)
and date_forecast_minus90d (29.10.1997) >= date_forecast (07.10.1997)
id_group_new = 100;
/* original post if not met by any condition above */
if match eq 0 then do;
id_group_new = 1;
10-17-2013 11:06 AM
hmm, like this....
create table res as select * from
(select *, id_group as id_group_new from kk.test)
select a.*,b.id_group_new from kk.test as a
(select *,id as bid,id_group as id_group_new from kk.test where id_group eq 10) as b
ON b.date_consensus > a.date_forecast
AND b.date_forecast_minus90d <= a.date_forecast_confirmation
AND b.date_forecast_minus90d >= a.date_forecast
and a.id_group eq 1
select c.*,d.id_group_new from kk.test as c
(select *,id_group as id_group_new from kk.test where id_group eq 100) as d
ON d.date_consensus > c.date_forecast
AND d.date_forecast_minus90d <= c.date_forecast_confirmation
AND d.date_forecast_minus90d >= c.date_forecast
and c.id_group eq 1
10-17-2013 11:09 AM
I don't think variables matter. The issue is getting a row to repeat, once for each group at the beginning of each group.
I'm thinking a data step with multiple set statements....
10-17-2013 11:30 AM
yeah, that's it (almost)! Thank you very much!
However, what shall we do if we have like 100,000 id_groups in our FULL sample?
Is it possible to implement some kind of loop/macro?
10-17-2013 11:55 AM
This works, though I think there should be a simpler answer...I did have to create an new ID group to merge in with, so you may want to rename them afterwards or drop one.
You could probably simplify it to one bigger SQL query, but I would find that harder to understand personally.
proc sql ;
create table merge1 as
select distinct t1.id_group as id_group, t3.id, t3.id_group as id_group_old, t3.date_consensus, t3.date_forecast, t3.date_forecast_minus90d, t3.date_forecast_confirmation, t3.eps_forecast
from test t1
(select * from test t2
where t2.id_group=1) t3;
set test (in=a where=(id_group ne 1)) merge1;
if a then id_group_old=id_group;
proc sort data=merge2;
by id_group id id_group_old;
10-17-2013 12:24 PM
thanks for sharing your approach, too. Yes, it works.
However, regarding the fact that we would like to run this code on our FULL sample, I think we will not be able to implement your approach as it requires knowing which ids should be merged to which id_group, i.e., we would have to manually look which ids fulfill our 3 conditions mentioned above for each id_group (... and our FULL sample consists of more than 100,000 id_groups).
10-18-2013 01:26 AM
Maybe loading the dataset into memory can help you (SAS(R) 9.2 Language Reference: Dictionary, Fourth Edition).
It might help whn you read it multiple times....
10-19-2013 08:16 PM
If I understand the problem right then one of the challenges is to compare "everything with everything" without the need of a Cartesian product. Below is some code attempting to do this.
Based on the sample result you've given I wasn't sure what should happen with cases where you have multiple rows in the same id_group so eventually the code needs some tweaking to return what you're after. I've added a column "Source_Flg" telling whether a row comes from the source or got added based on your rules. That should help you to tell us whether the logic I've used is what you're after or not.
I was also thinking about using hash/hiter objects but as it appears you're dealing with quite a bit of rows you would first need to tell us how big your source data set is (uncompressed) and how much memory you've got available.
In the "Do Loop" section of the code the source data set gets read multiple times. As FrederikE suggested I've loaded the source data set into memory. If you haven't got enough memory then remove the "SASFILE" statements.
How much difference it makes whether the source data set is loaded into memory or not will depend on how the disk where you're source data set lives is attached to the server and how the disk caching options are set for this disk. If performance is "bad" then it's eventually worth to first copy the data set to "WORK" as the SAS work area is normally pointing to a storage area with throughput as good as available.
And here what I could come up with:
libname test 'C:\Tests';
SASFILE test.test load ;
length Source_flg 8;
length id_group_new 8;
set test.test nobs=_stop;
if _n_=1 then
if 0 then _id_group=id_group;
dcl hash h (hashexp:5);
do _i=_start to _stop;
keep=id_group date_consensus date_forecast_minus90d date_forecast_confirmation
rename=(id_group=_id_group date_consensus=_date_consensus date_forecast_minus90d=_date_forecast_minus90d date_forecast_confirmation=_date_forecast_confirmation)
if h.check()=0 then continue;
if id_group = _id_group then
if _date_consensus > date_forecast
and _date_forecast_minus90d <= date_forecast_confirmation
and _date_forecast_minus90d >= date_forecast
SASFILE test.test close ;