Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Visual Data Mining and Machine Learning or just with programming

Randomization error occurs while joining unsorted dataset with sorted dataset

Reply
New Contributor
Posts: 2

Randomization error occurs while joining unsorted dataset with sorted dataset

Hi,

Actually I’m trying to replace the permanent dataset in my code with the data from DB2. Also i'm using SQL pass through to fetch the data from DB2 and creating a work table.

In this,I’m facing one issue while joining an unsorted dataset with a sorted dataset.

Issue : The observation order getting differ in the outputs while joining same set of tables from different locations.

i.e.  While joining unsorted dataset from a permanent library with sorted dataset and joining unsorted dataset from work library with sorted dataset, the o/ps are different in terms of observation order. And how this observation order affecting is , I'm going to fetch the last observation of each primary key from the above result.

For eg.:

Let us take two tables "EMP" and "DET".

Here EMP is sorted table based on Id and DET is unsorted table. The join used herer is LEFT Join (EMP left join DET).

EMP                                            

Id

name

101

Sam

DET

Id

Date

Amt

101

11/12/2014

200

101

11/12/2014

100

The DET in Permanent library and Work Library has the same  observation order.

Output of join: (DET from permanent library).

Id

Date

Name

Amt

101

11/12/2014

Sam

200

101

11/12/2014

Sam

100

Output of join: (DET from work library)

Id

Date

Name

Amt

101

11/12/2014

Sam

100

101

11/12/2014

Sam

200

In the above o/ps the observation order is different for Amt variable. Beacuse of this I'm getting different o/p in my next step where I'm using last.Id to fetch the last observation from each ID.

Note: If I convert the permament dataset to Work dataset then the observation order is matching as like the dataset from DB2. But I want to match the order that I'm getting when using permanent dataset.

Please help me to resolve this issue.

Thanks in advance!

-Ramesh

Super User
Super User
Posts: 7,417

Re: Randomization error occurs while joining unsorted dataset with sorted dataset

Is your Permanent DET on the database?  I am thinking that what is happening is to do with where the code is being executed.  The DB2 setup may not need to sort the DET dataset to do the join, however the SAS SQL might add a sort element in (something like an index) to get faster merging.

TBH though I think your logic may also not help.  The EMP dataset is a codelist, so you would be better off left joining EMP onto DET, rather than the way you have it currently.  It may also remove the need for sorting as the primary dataset would be DET which is unsorted.

New Contributor
Posts: 2

Re: Randomization error occurs while joining unsorted dataset with sorted dataset

No. the permanenet  DET is not from database.

I just used these tables to explain the issue and it is not the exact scenario. Actually more tables are being LEFT joined with the first table and if I change the joining order the business logic would get affected. Also only the columns from the unsorted dataset is not in proper order.

for eg.: Take the below queries.

Query 1:

libname perm "<path>";

proc sql;

create table output as select * from EMP left join perm.DET on EMP.Id=DET.Id left join <dataset> <conditions>....;

quit;

Query 2:

proc sql;

connect to db2(<connection string>);

create table DET as select * from connection to db2

     ( select * from <schema>.DET);

disconnect from db2;

quit;

proc sql;

create table output as select * from EMP left join DET on EMP.Id=DET.Id left join <dataset> <conditions>....;

quit;

Query 3:

libname perm "<path>";

data DET;

set perm.DET;

run;

proc sql;

create table output as select * from EMP left join DET on EMP.Id=DET.Id left join <dataset> <conditions>....;

quit;

/*********************************************************************************************************/

Here query 1  is the actual code and i converted the code as query 2. But the observation order from the both queries are not same for few columns that are from DET.

If I try the query 3, I'm getting the observation order as query2 and it is perfectly matching for all columns. But this not the case, I want to match the outputs of query 1 and query 2.

So i think the problem is not in db2, it should be in some other place.

-Ramesh

Super User
Super User
Posts: 7,417

Re: Randomization error occurs while joining unsorted dataset with sorted dataset

Add into each of your statements:

proc sql _method _tree;

The above optiosn will show you the elements which are created by the given SQL.  What happens behind the scenes is that all tables are looked at in order, indexes are pulled out for fast merging, sorts done etc.  Also, have a look at the metadata table - SASHELP.VTABLE for each of the datasets in each event, particularly the variables: indxtype, sortname, sortchar.  I.e. try to identify if there is an indicator in the metadata which would suggest to the compiler to use a sort method when it interprets the SQL.

Oh, to add, I agree with KurtBremser, you should be specifying a sort in your code, not leaving up to guesswork by the compiler.  In the same way, avoid using * in select, it may work, but if tables have the same variables then you don't know what you will end up with.

Super User
Posts: 6,964

Re: Randomization error occurs while joining unsorted dataset with sorted dataset

If you _want_ a certain order, you have to _force_ it. If you don't make a determination, then don't expect a certain order.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Ask a Question
Discussion stats
  • 4 replies
  • 304 views
  • 0 likes
  • 3 in conversation