- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi, I have one file with company tickers and dates of different events, and each company can have several different events.
I have another file with each company's ticker and end of year report, say december 1 of a given year.
The first file looks like this:
date | ticker |
---|---|
12/12/2008 | ABC |
4/6/2008 | ABC |
25/3/2010 | DEF |
6/2/2002 | DEF |
The Second file looks like this:
End of year report | ticker |
---|---|
1/12/2010 | ABC |
1/12/2009 | ABC |
1/12/2008 | ABC |
1/12/2007 | ABC |
1/12/2006 | ABC |
What I want to do is for company ABC on 12/12/2008 in the first file to do a merge with the data from the second file for the same company but the row corresponding to 1/12/2008 because its the closest recent date.
For ABC on 4/6/2008 from the first file I would like to merge with ABC from the second file but row corresponding to 1/12/2007 because its the closest recent date here.
And so on for all the companies and dates
Thank you
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here is a slight modification of 's code:
data h1;
infile cards dlm=',';
input date :ddmmyy10. ticker $;
format date ddmmyy10.;
cards;
12/12/2008,ABC
4/6/2008,ABC
25/3/2010,DEF
6/2/2002,DEF
;
data h2;
infile cards dlm=',';
input date :ddmmyy10. ticker $;
format date ddmmyy10.;
cards;
1/12/2010,ABC
1/12/2009,ABC
1/12/2008,ABC
1/12/2007,ABC
1/12/2006,ABC
1/12/2010,DEF
1/12/2009,DEF
1/12/2008,DEF
1/12/2007,DEF
1/12/2001,DEF
;
proc sql;
create table want as
select a.date,a.ticker, b.date as rpt_dt format=ddmmyy10.
from h1 a
left join
h2 b
on a.ticker=b.ticker
where b.date <= a.date
group by a.date,a.ticker
having abs(a.date-b.date)=min(abs(a.date-b.date))
order by a.ticker, b.date descending
;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The following is NOT the most efficient way, so ask for Hash to speed it up:
data h1;
infile cards dlm='09'x;
input date :ddmmyy10. ticker $;
format date ddmmyy10.;
cards;
12/12/2008 ABC
4/6/2008 ABC
25/3/2010 DEF
6/2/2002 DEF
;
data h2;
infile cards dlm='09'x;
input date :ddmmyy10. ticker $;
format date ddmmyy10.;
cards;
1/12/2010 ABC
1/12/2009 ABC
1/12/2008 ABC
1/12/2007 ABC
1/12/2006 ABC
;
proc sql;
create table want as
select *, (select date from h2 where ticker=a.ticker and date <= a.date having a.date-date=min(a.date-date)) as rpt_dt format=ddmmyy10.
from h1 a
;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi Hai and thank you for replying,
I ran the code that you posted but the result that I got is the following:
1 | . | 4/6/2008 | . |
---|---|---|---|
2 | . | 6/2/2002 | . |
obs | date | ticker | rpt_dt |
what I wish that I could have is for example the row (with all columns) from h1 of say ABC 12/12/2008 merged with the row (with all columns) of ABC 1/12/2008 from h2
Thank you!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
when doing "copy and paste" from HTML to text (or vise versa) , sometimes those unprinted symbols are converted (eg. blank vs tab). So in this case, it is Tab converting to Blanks. So go ahead remove "infile" statement.
data h1;
input date :ddmmyy10. ticker $;
format date ddmmyy10.;
cards;
12/12/2008 ABC
4/6/2008 ABC
25/3/2010 DEF
6/2/2002 DEF
;
data h2;
input date :ddmmyy10. ticker $;
format date ddmmyy10.;
cards;
1/12/2010 ABC
1/12/2009 ABC
1/12/2008 ABC
1/12/2007 ABC
1/12/2006 ABC
;
proc sql;
create table want as
select *, (select date from h2 where ticker=a.ticker and date <= a.date having a.date-date=min(a.date-date)) as rpt_dt format=ddmmyy10.
from h1 a
;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I did the code and it works thanks a lot !!!
Now suppose I have in each file an additional column:
h1
1 | 12/12/2008 | ABC | a |
---|---|---|---|
2 | 06/04/2008 | ABC | s |
3 | 03/25/2010 | DEF | d |
4 | 02/06/2002 | DEF | h |
h2
1 | 12/01/2010 | ABC | ef |
---|---|---|---|
2 | 12/01/2009 | ABC | dfgg |
3 | 12/01/2008 | ABC | rgrgrg |
4 | 12/01/2007 | ABC | thr |
5 | 12/01/2005 | ABC | thgt |
When I tried to do the merge in sql I got the following want:
1 | 12/12/2008 | ABC | a | ef | 01/12/2008 |
---|---|---|---|---|---|
2 | 12/12/2008 | ABC | a | dfgg | 01/12/2008 |
3 | 12/12/2008 | ABC | a | rgrgrg | 01/12/2008 |
4 | 12/12/2008 | ABC | a | thr | 01/12/2008 |
5 | 12/12/2008 | ABC | a | thgt | 01/12/2008 |
6 | 06/04/2008 | ABC | s | ef | 01/12/2007 |
7 | 06/04/2008 | ABC | s | dfgg | 01/12/2007 |
8 | 06/04/2008 | ABC | s | rgrgrg | 01/12/2007 |
9 | 06/04/2008 | ABC | s | thr | 01/12/2007 |
10 | 06/04/2008 | ABC | s | thgt | 01/12/2007 |
11 | 03/25/2010 | DEF | d | ef | . |
12 | 03/25/2010 | DEF | d | dfgg | . |
13 | 03/25/2010 | DEF | d | rgrgrg | . |
14 | 03/25/2010 | DEF | d | thr | . |
15 | 03/25/2010 | DEF | d | thgt | . |
16 | 02/06/2002 | DEF | h | ef | . |
17 | 02/06/2002 | DEF | h | dfgg | . |
18 | 02/06/2002 | DEF | h | rgrgrg | . |
19 | 02/06/2002 | DEF | h | thr | . |
20 | 02/06/2002 | DEF | h | thgt | . |
How do I get rid of all the unnecessary entries and get what you did (4 rows) but with the two additional columns (1 from h1 and 1 from h2) ?
Thank you
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I'm not sure if it's going to work, but maybe try to do something like this
data dataset1;
input date :ddmmyy10. ticker $;
format date ddmmyy10.;
cards;
12/12/2008 ABC
4/6/2008 ABC
25/3/2010 DEF
6/2/2002 DEF
;
run;
data dataset1;
set h1;
year=year(date);
eoy=mdy(12,1,year(date)); /*end of year date*/
format eoy ddmmyy10.;
dif=abs(intck('day', date, eoy)); /*calculates the number of days between the two dates*/
run;
proc sort data=dataset1;
by ticker year dif;
run;
data dataset1;
set dataset1;
by ticker year;
if first.year;
run;
In this way you get one observation for each ticker and year and you can merge it with the second dataset by date
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here is a slight modification of 's code:
data h1;
infile cards dlm=',';
input date :ddmmyy10. ticker $;
format date ddmmyy10.;
cards;
12/12/2008,ABC
4/6/2008,ABC
25/3/2010,DEF
6/2/2002,DEF
;
data h2;
infile cards dlm=',';
input date :ddmmyy10. ticker $;
format date ddmmyy10.;
cards;
1/12/2010,ABC
1/12/2009,ABC
1/12/2008,ABC
1/12/2007,ABC
1/12/2006,ABC
1/12/2010,DEF
1/12/2009,DEF
1/12/2008,DEF
1/12/2007,DEF
1/12/2001,DEF
;
proc sql;
create table want as
select a.date,a.ticker, b.date as rpt_dt format=ddmmyy10.
from h1 a
left join
h2 b
on a.ticker=b.ticker
where b.date <= a.date
group by a.date,a.ticker
having abs(a.date-b.date)=min(abs(a.date-b.date))
order by a.ticker, b.date descending
;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi Arthur,
I did your modification and now got rid of the of the "multiplicative effect" that I was having before, is it just because you made a left join?
thanks, and here is what my example file looks:
1 | 12/12/2008 | ABC | a | rgrgrg | 01/12/2008 |
---|---|---|---|---|---|
2 | 06/04/2008 | ABC | s | thr | 01/12/2007 |
3 | 02/06/2002 | DEF | h | . | |
4 | 03/25/2010 | DEF | d | . |
my next step will be to try this on my real data!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I applied Dr Tabachneck's and Hai.kuo's code to my real data which is 2 tables one having 76 000 rows and the other having 22 000 rows and my computer has been thinking for 3 hours already, is it normal?
Thank you
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You should include your actual code Otherwise impossible to say. Maybe forget to include quit or a join in your code, who knows :smileyconfused:
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
here included is my actual code:
PROC IMPORT OUT= bs8
DATAFILE= " K:\FilesFormatted\bs8.xlsx"
DBMS= EXCEL REPLACE;
sheet = "Sheet1";
GETNAMES=YES;
MIXED=NO;
SCANTEXT=YES;
USEDATE=YES;
SCANTIME=YES;
RUN;
data bs8;
set bs8;
format date2 ddmmyy10. ;
run;
PROC IMPORT OUT= sdc
DATAFILE= " K:\FilesFormatted\new sdc.xlsx"
DBMS= EXCEL REPLACE;
sheet = "Sheet1";
GETNAMES=YES;
MIXED=NO;
SCANTEXT=YES;
USEDATE=YES;
SCANTIME=YES;
RUN;
data sdc;
set sdc;
format date_an ddmmyy10. ;
format date_ef ddmmyy10. ;
run;
/*the merging by closest date code*/
proc sql;
create table sdc2 as
select *
from sdc a
left join
bs8 b
on a.acq_ticker=b.ticker1
where b.date2 <= a.date_an
group by a.date_an,a.acq_ticker
having abs(a.date_an-b.date2)=min(abs(a.date_an-b.date2))
order by a.acq_ticker, b.date2 descending
;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The whole process should only take a couple of seconds. Have you tried to run the code step-by-step to see if there is a problem in one of the steps?
Can you post your log?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I did an inner join and after 10 minutes got 1.2 million rows.
I had many duplicates from the missing entries of both tables, because when I dropped these duplicates I was left with only 8k rows!
I guess that when I was doing a left join my computer was calculating even more rows and at some point probably got stuck...
Thank you for your help!!!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The DATA step way is pretty easy until you start adding additional variables:
proc sort data=events;
by ticker date;
run;
proc sort data=yearly_reports;
by ticker end_of_year_report;
run;
data want;
set yearly_reports (in=in1 rename=(end_of_year_report=date)) events (in=in2);
by ticker date;
if in1 then end_of_year_report=date;
retain end_of_year_report;
if in2;
run;