Solved: Re: Any possibility that compression cause data loss?!

hellohere · Posted 03-06-2022 09:16 AM

Run merge by two ways. One by dataset merge, the other by SQL. The SAS log/Dataset shows same row count(as expected).

But when run subset of the merge dataset, the outcomes are diff(one is zero, the other is count as expected)...

Awkward...

Tom · Posted 03-06-2022 07:55 PM

So you made two datasets out of your original dataset.

One with the first date and the other with that last date.

Then you did a data step merge.

So the value of TIMESEC will be the one from the LAST date dataset since it will be the last value loaded into that variable.

Then you did a join between the THREE datasets.

This time the value of TIMESEC will come from the original dataset since it is the first time a variable with that name is included in the list of variables the SELECT is listing.

In SQL joins you have to be very careful when you are have multiple variables with the same name. It is best to be very explicit about which version of the variable you want to keep. Do not use * in the list of columns. Use COALESCE() function to control the precedence for the selection.

Try this example where ID is the key variable and DAY and VALUE are two other variables that are common to both datasets.

data have first last;
  input id day value ;
  if day=1 then output have first;
  else output have last;
cards;
1 1 10
1 2 20
;

data test1;
  merge first last;
  by id ;
run;

data test2;
  merge last first;
  by id;
run;

proc sql;
create table test3 as select * from first,last where first.id=last.id;
create table test4 as select * from last,first where first.id=last.id;
quit;

data all ;
 length dsn dsn2 $41;
 set test: indsname=dsn2;
 dsn=dsn2;
run;

Obs       dsn        id    day    value

 1     WORK.TEST1     1     2       20
 2     WORK.TEST2     1     1       10
 3     WORK.TEST3     1     1       10
 4     WORK.TEST4     1     2       20

View solution in original post

mkeintz · Posted 03-06-2022 10:28 AM

Insufficient problem description.

Show the code and corresponding logs.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

hellohere · Posted 03-06-2022 07:29 PM

%let ds=tick_out_all;

proc sort data=&ds.; by tick tradingday timesec; run;quit;

data &ds._first(keep=tick tradingday timesec lp_ rename=(lp_=lp_first)); set &ds.; by tick tradingday; if first.tradingday;run;quit;
data &ds._last (keep=tick tradingday timesec lp_ rename=(lp_=lp_last)); set &ds.; by tick tradingday; if last.tradingday;run;quit;

/*get ret and alike, merge outcome timesec=93100000 is gone, evne SAS log says not!?
*/
data &ds._; merge &ds. &ds._first &ds._last;
by tick tradingday;
run;quit;
proc sql;
create table &ds._2 as
select a.*, b.lp_first, c.lp_last
from &ds. as a
left join &ds._first as b
on a.tick=b.tick and a.tradingday=b.tradingday
left join &ds._last as c
on a.tick=c.tick and a.tradingday=c.tradingday
order by a.tick, a.tradingday, a.timesec;
quit;

proc sort data=&ds.; by tick tradingday timesec; run;quit;

%let ts=93100000;

data tick_out_all_ts_(keep=); set tick_out_all_(where=(timesec=&ts.)); run;quit;
data tick_out_all_ts_2(keep=); set tick_out_all_2(where=(timesec=&ts.)); run;quit;

______________________________________________________________________

3816 data &ds._; merge &ds. &ds._first &ds._last;
3817 by tick tradingday;
3818 run;

NOTE: There were 19397784 observations read from the data set WORK.TICK_OUT_ALL.
NOTE: There were 82194 observations read from the data set WORK.TICK_OUT_ALL_FIRST.
NOTE: There were 82194 observations read from the data set WORK.TICK_OUT_ALL_LAST.
NOTE: The data set WORK.TICK_OUT_ALL_ has 19397784 observations and 71 variables.
NOTE: Compressing data set WORK.TICK_OUT_ALL_ decreased size by 29.16 percent.
Compressed is 119484 pages; un-compressed would require 168677 pages.
NOTE: DATA statement used (Total process time):
real time 1:06.84
cpu time 1:02.64

3855 proc sql;
3856 create table &ds._2 as
3857 select a.*, b.lp_first, c.lp_last
3858 from &ds. as a
3859 left join &ds._first as b
3860 on a.tick=b.tick and a.tradingday=b.tradingday
3861 left join &ds._last as c
3862 on a.tick=c.tick and a.tradingday=c.tradingday
3863 order by a.tick, a.tradingday, a.timesec;
NOTE: Compressing data set WORK.TICK_OUT_ALL_2 decreased size by 29.16 percent.
Compressed is 119484 pages; un-compressed would require 168677 pages.
NOTE: Table WORK.TICK_OUT_ALL_2 created, with 19397784 rows and 71 columns.

3864 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 1:46.56
cpu time 1:25.90

3873
3874 data tick_out_all_ts_(keep=); set tick_out_all_(where=(timesec=&ts.)); run;

NOTE: There were 0 observations read from the data set WORK.TICK_OUT_ALL_.
WHERE timesec=93100000;
NOTE: The data set WORK.TICK_OUT_ALL_TS_ has 0 observations and 71 variables.
NOTE: DATA statement used (Total process time):
real time 30.95 seconds
cpu time 7.25 seconds

3874! quit;
3875 data tick_out_all_ts_2(keep=); set tick_out_all_2(where=(timesec=&ts.)); run;

NOTE: There were 82194 observations read from the data set WORK.TICK_OUT_ALL_2.
WHERE timesec=93100000;
NOTE: The data set WORK.TICK_OUT_ALL_TS_2 has 82194 observations and 71 variables.
NOTE: Compressing data set WORK.TICK_OUT_ALL_TS_2 decreased size by 40.70 percent.
Compressed is 424 pages; un-compressed would require 715 pages.
NOTE: DATA statement used (Total process time):
real time 34.50 seconds
cpu time 11.07 seconds

3875! quit;

Tom · Posted 03-06-2022 07:55 PM

So you made two datasets out of your original dataset.

One with the first date and the other with that last date.

Then you did a data step merge.

So the value of TIMESEC will be the one from the LAST date dataset since it will be the last value loaded into that variable.

Then you did a join between the THREE datasets.

This time the value of TIMESEC will come from the original dataset since it is the first time a variable with that name is included in the list of variables the SELECT is listing.

In SQL joins you have to be very careful when you are have multiple variables with the same name. It is best to be very explicit about which version of the variable you want to keep. Do not use * in the list of columns. Use COALESCE() function to control the precedence for the selection.

Try this example where ID is the key variable and DAY and VALUE are two other variables that are common to both datasets.

data have first last;
  input id day value ;
  if day=1 then output have first;
  else output have last;
cards;
1 1 10
1 2 20
;

data test1;
  merge first last;
  by id ;
run;

data test2;
  merge last first;
  by id;
run;

proc sql;
create table test3 as select * from first,last where first.id=last.id;
create table test4 as select * from last,first where first.id=last.id;
quit;

data all ;
 length dsn dsn2 $41;
 set test: indsname=dsn2;
 dsn=dsn2;
run;

Obs       dsn        id    day    value

 1     WORK.TEST1     1     2       20
 2     WORK.TEST2     1     1       10
 3     WORK.TEST3     1     1       10
 4     WORK.TEST4     1     2       20

hellohere · Posted 03-06-2022 08:21 PM

Thanks, Pal. The original dataset has 19397784 row. The row counts from both treatments show up identical(19397784, right click).

BUT run subset, timesec=93100000, one has zero and the other shows up thousands(as expected).

Double click the dataset, timesec=93100000 does not show up in dataset-merge treatment(remember the row count is 19397784).

That is the myth.

Tom · Posted 03-06-2022 11:27 AM

You cannot do a MERGE with SQL, only JOINs. They are not exactly the same thing, but are close when you have one to one or one to many matches.

If you have non-key common variables you need to understand how each method will select which of the two values to use for those variables. In general with a data step merge the last instance of the variable "wins" and with SQL the first instance.

PaigeMiller · Posted 03-06-2022 11:32 AM

@hellohere wrote:

Run merge by two ways. One by dataset merge, the other by SQL. The SAS log/Dataset shows same row count(as expected).

But when run subset of the merge dataset, the outcomes are diff(one is zero, the other is count as expected)...

Awkward...

What is your question?

--
Paige Miller

hellohere · Posted 03-06-2022 07:32 PM

Right click two datasets, both show up the same row count. BUT when simply do subset(same code), one is zero and the other is not. See code and log above.

Registration is open

SAS Training: Just a Click Away