Re: Randomization error occurs while joining unsorted dataset with sor...

ramesh_it · Posted 11-12-2014 05:50 AM

Hi,

Actually I’m trying to replace the permanent dataset in my code with the data from DB2. Also i'm using SQL pass through to fetch the data from DB2 and creating a work table.

In this,I’m facing one issue while joining an unsorted dataset with a sorted dataset.

Issue : The observation order getting differ in the outputs while joining same set of tables from different locations.

i.e. While joining unsorted dataset from a permanent library with sorted dataset and joining unsorted dataset from work library with sorted dataset, the o/ps are different in terms of observation order. And how this observation order affecting is , I'm going to fetch the last observation of each primary key from the above result.

For eg.:

Let us take two tables "EMP" and "DET".

Here EMP is sorted table based on Id and DET is unsorted table. The join used herer is LEFT Join (EMP left join DET).

EMP

Id	name
101	Sam

DET

Id	Date	Amt
101	11/12/2014	200
101	11/12/2014	100

The DET in Permanent library and Work Library has the same observation order.

Output of join: (DET from permanent library).

Id	Date	Name	Amt
101	11/12/2014	Sam	200
101	11/12/2014	Sam	100

Output of join: (DET from work library)

Id	Date	Name	Amt
101	11/12/2014	Sam	100
101	11/12/2014	Sam	200

In the above o/ps the observation order is different for Amt variable. Beacuse of this I'm getting different o/p in my next step where I'm using last.Id to fetch the last observation from each ID.

Note: If I convert the permament dataset to Work dataset then the observation order is matching as like the dataset from DB2. But I want to match the order that I'm getting when using permanent dataset.

Please help me to resolve this issue.

Thanks in advance!

-Ramesh

RW9 · Posted 11-12-2014 06:07 AM

Is your Permanent DET on the database? I am thinking that what is happening is to do with where the code is being executed. The DB2 setup may not need to sort the DET dataset to do the join, however the SAS SQL might add a sort element in (something like an index) to get faster merging.

TBH though I think your logic may also not help. The EMP dataset is a codelist, so you would be better off left joining EMP onto DET, rather than the way you have it currently. It may also remove the need for sorting as the primary dataset would be DET which is unsorted.

ramesh_it · Posted 11-12-2014 07:55 AM

No. the permanenet DET is not from database.

I just used these tables to explain the issue and it is not the exact scenario. Actually more tables are being LEFT joined with the first table and if I change the joining order the business logic would get affected. Also only the columns from the unsorted dataset is not in proper order.

for eg.: Take the below queries.

Query 1:

libname perm "<path>";

proc sql;

create table output as select * from EMP left join perm.DET on EMP.Id=DET.Id left join <dataset> <conditions>....;

quit;

Query 2:

proc sql;

connect to db2(<connection string>);

create table DET as select * from connection to db2

( select * from <schema>.DET);

disconnect from db2;

quit;

proc sql;

create table output as select * from EMP left join DET on EMP.Id=DET.Id left join <dataset> <conditions>....;

quit;

Query 3:

libname perm "<path>";

data DET;

set perm.DET;

run;

proc sql;

create table output as select * from EMP left join DET on EMP.Id=DET.Id left join <dataset> <conditions>....;

quit;

/*********************************************************************************************************/

Here query 1 is the actual code and i converted the code as query 2. But the observation order from the both queries are not same for few columns that are from DET.

If I try the query 3, I'm getting the observation order as query2 and it is perfectly matching for all columns. But this not the case, I want to match the outputs of query 1 and query 2.

So i think the problem is not in db2, it should be in some other place.

-Ramesh

RW9 · Posted 11-12-2014 08:38 AM

Add into each of your statements:

proc sql _method _tree;

The above optiosn will show you the elements which are created by the given SQL. What happens behind the scenes is that all tables are looked at in order, indexes are pulled out for fast merging, sorts done etc. Also, have a look at the metadata table - SASHELP.VTABLE for each of the datasets in each event, particularly the variables: indxtype, sortname, sortchar. I.e. try to identify if there is an indicator in the metadata which would suggest to the compiler to use a sort method when it interprets the SQL.

Oh, to add, I agree with KurtBremser, you should be specifying a sort in your code, not leaving up to guesswork by the compiler. In the same way, avoid using * in select, it may work, but if tables have the same variables then you don't know what you will end up with.

Kurt_Bremser · Posted 11-12-2014 08:03 AM

If you _want_ a certain order, you have to _force_ it. If you don't make a determination, then don't expect a certain order.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Randomization error occurs while joining unsorted dataset with sorted dataset

Re: Randomization error occurs while joining unsorted dataset with sorted dataset

Re: Randomization error occurs while joining unsorted dataset with sorted dataset

Re: Randomization error occurs while joining unsorted dataset with sorted dataset

Re: Randomization error occurs while joining unsorted dataset with sorted dataset

Catch up on SAS Innovate 2026