BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
hellohere
Pyrite | Level 9

Run merge by two ways. One by dataset merge, the other by SQL. The SAS log/Dataset shows same row count(as expected). 

But when run subset of the merge dataset, the outcomes are diff(one is zero, the other is count as expected)...

 

Awkward...

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

So you made two datasets out of your original dataset.

One with the first date and the other with that last date.

Then you did a data step merge.

So the value of TIMESEC will be the one from the LAST date dataset since it will be the last value loaded into that variable.

 

Then you did a join between the THREE datasets.

This time the value of TIMESEC will come from the original dataset since it is the first time a variable with that name is included in the list of variables the SELECT is listing.

 

In SQL joins you have to be very careful when you are have multiple variables with the same name.  It is best to be very explicit about which version of the variable you want to keep.  Do not use * in the list of columns.  Use COALESCE() function to control the precedence for the selection.

 

Try this example where ID is the key variable and DAY and VALUE are two other variables that are common to both datasets.

data have first last;
  input id day value ;
  if day=1 then output have first;
  else output have last;
cards;
1 1 10
1 2 20
;

data test1;
  merge first last;
  by id ;
run;

data test2;
  merge last first;
  by id;
run;

proc sql;
create table test3 as select * from first,last where first.id=last.id;
create table test4 as select * from last,first where first.id=last.id;
quit;

data all ;
 length dsn dsn2 $41;
 set test: indsname=dsn2;
 dsn=dsn2;
run;
Obs       dsn        id    day    value

 1     WORK.TEST1     1     2       20
 2     WORK.TEST2     1     1       10
 3     WORK.TEST3     1     1       10
 4     WORK.TEST4     1     2       20

 

View solution in original post

7 REPLIES 7
mkeintz
PROC Star

Insufficient problem description.

 

Show the code and corresponding logs.  

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
hellohere
Pyrite | Level 9

%let ds=tick_out_all;

 

proc sort data=&ds.; by tick tradingday timesec; run;quit;

data &ds._first(keep=tick tradingday timesec lp_ rename=(lp_=lp_first)); set &ds.; by tick tradingday; if first.tradingday;run;quit;
data &ds._last (keep=tick tradingday timesec lp_ rename=(lp_=lp_last)); set &ds.; by tick tradingday; if last.tradingday;run;quit;

/*get ret and alike, merge outcome timesec=93100000 is gone, evne SAS log says not!?
*/
data &ds._; merge &ds. &ds._first &ds._last;
by tick tradingday;
run;quit;
proc sql;
create table &ds._2 as
select a.*, b.lp_first, c.lp_last
from &ds. as a
left join &ds._first as b
on a.tick=b.tick and a.tradingday=b.tradingday
left join &ds._last as c
on a.tick=c.tick and a.tradingday=c.tradingday
order by a.tick, a.tradingday, a.timesec;
quit;

proc sort data=&ds.; by tick tradingday timesec; run;quit;

%let ts=93100000;

data tick_out_all_ts_(keep=); set tick_out_all_(where=(timesec=&ts.)); run;quit;
data tick_out_all_ts_2(keep=); set tick_out_all_2(where=(timesec=&ts.)); run;quit;

 

______________________________________________________________________

 


3816 data &ds._; merge &ds. &ds._first &ds._last;
3817 by tick tradingday;
3818 run;

NOTE: There were 19397784 observations read from the data set WORK.TICK_OUT_ALL.
NOTE: There were 82194 observations read from the data set WORK.TICK_OUT_ALL_FIRST.
NOTE: There were 82194 observations read from the data set WORK.TICK_OUT_ALL_LAST.
NOTE: The data set WORK.TICK_OUT_ALL_ has 19397784 observations and 71 variables.
NOTE: Compressing data set WORK.TICK_OUT_ALL_ decreased size by 29.16 percent.
Compressed is 119484 pages; un-compressed would require 168677 pages.
NOTE: DATA statement used (Total process time):
real time 1:06.84
cpu time 1:02.64

 

3855 proc sql;
3856 create table &ds._2 as
3857 select a.*, b.lp_first, c.lp_last
3858 from &ds. as a
3859 left join &ds._first as b
3860 on a.tick=b.tick and a.tradingday=b.tradingday
3861 left join &ds._last as c
3862 on a.tick=c.tick and a.tradingday=c.tradingday
3863 order by a.tick, a.tradingday, a.timesec;
NOTE: Compressing data set WORK.TICK_OUT_ALL_2 decreased size by 29.16 percent.
Compressed is 119484 pages; un-compressed would require 168677 pages.
NOTE: Table WORK.TICK_OUT_ALL_2 created, with 19397784 rows and 71 columns.

3864 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 1:46.56
cpu time 1:25.90

 

3873
3874 data tick_out_all_ts_(keep=); set tick_out_all_(where=(timesec=&ts.)); run;

NOTE: There were 0 observations read from the data set WORK.TICK_OUT_ALL_.
WHERE timesec=93100000;
NOTE: The data set WORK.TICK_OUT_ALL_TS_ has 0 observations and 71 variables.
NOTE: DATA statement used (Total process time):
real time 30.95 seconds
cpu time 7.25 seconds


3874! quit;
3875 data tick_out_all_ts_2(keep=); set tick_out_all_2(where=(timesec=&ts.)); run;

NOTE: There were 82194 observations read from the data set WORK.TICK_OUT_ALL_2.
WHERE timesec=93100000;
NOTE: The data set WORK.TICK_OUT_ALL_TS_2 has 82194 observations and 71 variables.
NOTE: Compressing data set WORK.TICK_OUT_ALL_TS_2 decreased size by 40.70 percent.
Compressed is 424 pages; un-compressed would require 715 pages.
NOTE: DATA statement used (Total process time):
real time 34.50 seconds
cpu time 11.07 seconds


3875! quit;

Tom
Super User Tom
Super User

So you made two datasets out of your original dataset.

One with the first date and the other with that last date.

Then you did a data step merge.

So the value of TIMESEC will be the one from the LAST date dataset since it will be the last value loaded into that variable.

 

Then you did a join between the THREE datasets.

This time the value of TIMESEC will come from the original dataset since it is the first time a variable with that name is included in the list of variables the SELECT is listing.

 

In SQL joins you have to be very careful when you are have multiple variables with the same name.  It is best to be very explicit about which version of the variable you want to keep.  Do not use * in the list of columns.  Use COALESCE() function to control the precedence for the selection.

 

Try this example where ID is the key variable and DAY and VALUE are two other variables that are common to both datasets.

data have first last;
  input id day value ;
  if day=1 then output have first;
  else output have last;
cards;
1 1 10
1 2 20
;

data test1;
  merge first last;
  by id ;
run;

data test2;
  merge last first;
  by id;
run;

proc sql;
create table test3 as select * from first,last where first.id=last.id;
create table test4 as select * from last,first where first.id=last.id;
quit;

data all ;
 length dsn dsn2 $41;
 set test: indsname=dsn2;
 dsn=dsn2;
run;
Obs       dsn        id    day    value

 1     WORK.TEST1     1     2       20
 2     WORK.TEST2     1     1       10
 3     WORK.TEST3     1     1       10
 4     WORK.TEST4     1     2       20

 

hellohere
Pyrite | Level 9

Thanks, Pal. The original dataset has 19397784 row. The row counts from both treatments show up identical(19397784, right click). 

 

BUT run subset, timesec=93100000, one has zero and the other shows up thousands(as expected). 

 

Double click the dataset, timesec=93100000 does not show up in dataset-merge treatment(remember the row count is 19397784). 

That is the myth. 

Tom
Super User Tom
Super User

You cannot do a MERGE with SQL, only JOINs.  They are not exactly the same thing, but are close when you have one to one or one to many matches.

If you have non-key common variables you need to understand how each method will select which of the two values to use for those variables.  In general with a data step merge the last instance of the variable "wins" and with SQL the first instance.

PaigeMiller
Diamond | Level 26

@hellohere wrote:

Run merge by two ways. One by dataset merge, the other by SQL. The SAS log/Dataset shows same row count(as expected). 

But when run subset of the merge dataset, the outcomes are diff(one is zero, the other is count as expected)...

 

Awkward...


What is your question?

--
Paige Miller
hellohere
Pyrite | Level 9

Right click two datasets, both show up the same row count. BUT when simply do subset(same code), one is zero and the other is not. See code and log above.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 837 views
  • 3 likes
  • 4 in conversation