05-22-2012 10:32 AM
I am wondering if there is anyone out there who has used DI Studio to build a data warehouse in SQL Server....
The specifics of my situation are that I am using DI Studio 4.4 (with SAS 9.3) to build a warehouse which is stored in SQL Server 2008. We are using SAS/ACCESS Interface to OLE DB to connect to SQL Server from SAS (because 1. it was recommended by Tech Support, and 2. it's our only licensed option). We are using the SQLNCLI10.1 provider as recommended here.
We have hit on a number of issues doing this, mostly to do with SQL Server data types, particularly for DATE / TIME / DATETIME columns.
We have got around quite a lot of these by using explicit pass-thru SQL, in DI User Written Code, to "manually" create the SQL Server table headers with the data types that we want before loading any data into them. These boil down to:
We then amend the SAS metadata for the SQL Server tables to include appropriate DATE / TIME / DATETIME formats so we see formatted values when we view the tables from DI Studio (which is a neat trick). Potentially the SAS data set options DBTYPE and SASDATEFMT could also be of use, but we've not used them.
However we still have a few issues that we are working on, particularly with the SCD Type 2 Loader. We are keen to push as much of the processing as possible to SQL Server as we will have a fairly short window for running what will eventually be a daily incremental load process. The SCD Type 2 Loader supports pass-thru via some of the options, however we keep getting errors.
The errors are related to temporary tables created by the Loader in SQL Server. The Loader creates these tables using a SAS Data Step, which means that our datetime variables (for valid from and valid to) are created with a SQL Server data type of DATETIME2 (rather than FLOAT which is what we've used elsewhere). Unfortunately when SAS tries to append data into these tables an OLE DB error is thrown because the SQL Server DATETIME2 variables cannot cope with the decimal part of the date time values SAS is trying to write. The exact error is:
ERROR: Execute error: IRowsetFastLoad::InsertRow failed. : The fractional part of the provided time value overflows the scale of the corresponding SQL Server parameter or column. Increase bScale in DBPARAMBINDINFO or column scale to correct this error.
Unfortunately there is nothing in the Loader options which allows me to control how these temporary tables are created so I can't use the techniques described above. I am also not keen to have to manually amend the code generated by the SCD Type 2 Loader transform to manually create the SQL Server temporary tables correctly (although I have done this with success - a job which took 45 minutes without pass-thru was reduced to 13 seconds), or create a new custom SQL Server SCD Type 2 Loader transform.
I have a track open with SAS Technical Support but I wondered if anyone had seen this before.
If anyone is interested enough I have attached a few files to this post which should allow you to recreate the error using DI Studio and SQL Server. The SAS program creates the tables required and the package file contains the job (you'll need to register the libraries and tables in metadata first).
Thanks in advance!
05-22-2012 11:10 AM
Yes, we are using DI and SQLServer (DI 4.2 though).
And yes, we have had some problems regarding the Type 2 loader. But not the same. In out case, the temporary tables used is loaded to SQL using it's own libname statement, which in our case was totally corrupt, all parameters in the connect string was duplicated, which lead to syntax errors.
We resolved the problem by unchecking the option of issuing SQL pass-thru code. That works fine in our case since our data volumes are small.
05-22-2012 11:27 AM
Thanks for your reply, unfortunately our data volumes are large and our load window relatively tight so switching off pass-thru SQL is not an option for us.
In terms of the issue you encountered, my first thought is to check the metadata for the libraries, and also to try and avoid using the INIT_STRING in the metadata for an OLE DB library, this has caused me similar issues in the past but in a different context.
05-24-2012 06:59 AM
We have found a feasible workaround for this Type 2 Loader issue. If you strip all formats from the SAS metadata of the SQL Server target table then you can successfully use pass-through with the Type 2 Loader. This is not ideal but is a better option than having to change the generated code.
In order to see the table with formats simply register the same physical table a second time in metadata and add the formats to this second metadata object for the table. Do not use this object in the metadata though.
SAS Institute have informed me a hot fix is planned for this issue, but no schedule is available yet.
09-13-2012 10:44 AM
A follow up to this posting about things to be wary of when using some of the workarounds detailed above.
Effectively what we have done by stripping all the formats from the SAS table metadata to get the DI Studio loader transforms to work is "tricked" SQL server into storing date / time / datetime values as (SAS date / time / datetime) numbers, in FLOAT type SQL Server columns. This causes issues later on down the line when you, for example, try to run a query against SQL Server tables using something like Enterprise Guide. If it can, then to optimise processing EG will use implicit pass-through SQL to run queries in the database, for example the following query is likely to be implicitly passed to SQL Server to execute (code presented is similar to that generated by the EG Query Builder task):
create table work.summary as
select year(date), count(*) as count
where date between '01JAN2011'd and '31DEC2012'd
group by 1;
This query uses both the YEAR function in the SELECT clause to extract the year from a date value and SAS date literals in the WHERE clause. Neither of these things prevent the query from being passed-thru to the database as SAS is able to translate the syntax into the SQL Server equivalents prior to passing the query to SQL Server. The issue comes in after the query has been passed through. The SQL Server equivalents of the YEAR function and SAS date literals are expecting to work against a SQL Server variable of an appropriate type (there are many date types in SQL Server), but of course we've stored the dates in numeric fields (FLOAT type to be precise in our case) so SQL Server cannot run the query and everything generally goes a bit wrong. The date literals could be replaced by the SAS date numbers but so far the only way around the issue with the YEAR function we have found is to bring the required data into a SAS table (a view does not suffice) and then perform the query of interest against the SAS table.
Another issue we have noticed is that when the results of a query are returned to Enterprise Guide sometimes the formats that are in the metadata for the SQL Server table are ignored so the raw data values are returned instead of the formatted values. We suspect this is related to the query being implictly passed to SQL Server (which has no knowledge of the formats in the SAS metadata) and the results being passed back to Enterprise Guide without the format being picked up from metadata.
Registering the tables in metadata for a second time with formats also proved to not quite work. When we did this initially we left the library the same for both sets of metadata and found that in EG even when accessing the metadata object with the formats the formats were not being applied. The resolution to this was to register a seperate library in SAS metadata (pointing to the same location) and to change the physical storage properties of the table metadata with formats to use this library instead.
Hopefully we will see tighter integration between SQL Server and SAS in the future.
09-14-2012 08:02 AM
I have encountered similar issues in the past and never found a nice fix. We basically circumvented the whole issue with type conversions and used character variables.
Of the projects that used DI, and similar tools, we ended up doing it in stages. First, is DI the primary interface for loading data in the future or just for design? In all but one case, we used the DI and similar tools for design and separate standard loading processes. The standard loading process would first create a SAS WORK data set with all variables character strings and then load that data set into temporary staging tables. A MS SQL stored procedure would then perform all the type transformations and derivations when loading the main tables. As the project matured, the project team added some data validation in both the SAS loader processes and the database procedures.
If DI will be used for loading as well, just have the stored procedure execute at the end.
09-14-2012 08:35 AM
Thank you for your reply, you make some interesting suggestions which I will certainly bear in mind for the future.
In our case we are using SAS DI Studio as the primary tool for loading the data. We have a complex data warehouse updated in a daily incremental manner employing slowly changing dimension (types 1 and 2) techniques and other complex transformations. As such I would not be keen to try and implement stored procedures in SQL Server to replicate what DI Studio should be able to do anyway (in my opinion).
11-07-2012 09:35 AM
We've encountered the exact same error message. It turned out that if we used date9. and datetime22.3 SAS formats for our date and datetime variables, everything went smoothly. This might be due to our specific time variables though.
We are also keen to push as much of the processing as possible to SQL Server. By selecting pass thru in the options tab of the scd2 loader, only the sql table update is submitted on the database. A bunch of pre-processing(e.g. work.etls_newrcds) is still done in SAS(DI Studio 4.21), compromising performance. The only option we see would be to manually amend the code generated by the SCD Type 2 Loader transform. That's several steps too far, in our opinion.
Did you find some way to avoid using SAS processing in the SCD2 loader to bring your job from 45 minutes to 13 seconds?
11-07-2012 12:52 PM
Thanks for your feedback,. I was actually about to update the topic as following a lot of trial and error we have eventually found something that worked!!
Our first findings match yours, if you use DATE9. for dates, DATETIME22.3 for datetimes it certainly smoothes things over. We've just been asked to add a time variable to warehouse, so we're figuring out the best time format to use for that, TIME5. looks like a good candidate.
Of course these are not the formats that our clients want to view the data with, so we're still registering the metadata for the tables twice in different metadata folders each with different formats on the columns. To make that work we've also found that you need to register two seperate libraries in metadata (pointing to the same physical location in SQL Server) and use one library for each set of metadata, otherwise the formats on one set of metadata seem to get ignored.
In order to push as much of the processing as possible to SQL Server make sure you specify both the "Use SQL Passthru" option and the "Upload library before target table update" options in the SCD2 Loader, ensuring that for the latter you use the same library that the target table is in. It would probably pay to set the "Use BULKLOAD=YES when adding records to temporary database tables" option to yes as well, although below we haven't (I think I might change it!):
I agree this should be a lot easier to do! I'm hoping SAS Institute will build in extra support for SQL Server in future versions.
11-07-2012 02:40 PM
Mark, are you aware that Microsoft have announced they will not be supporting OLE DB beyond SQL Server 2012? In a recent SAS install we switched to using ODBC as a consequence of that announcement. So far our experience with ODBC is positive but like you would be keen to see extra support for SQL Server in the future.
Our main problem at the moment is the lack of support for tables and columns more than 32 characters long. If you are doing the loading yourself from SAS then this would not be a problem as you just make sure you stay within the 32 character limit. However if you are just reading existing SQL Server databases it is very common practice to go over 32 characters. SAS's ability to deal with this is very rudimentary - pass-thru is the only option with long table names.
Thanks for the very useful post.
11-08-2012 02:58 AM
Another option you have to speed up processing is to use permanent cross reference tables. This could be beneficial if you many updates for your business key. This DB independent, but you could store this table in the same SQL Server schema.