Re: How to revisit observations in DS2

JacobSimonsen · Posted 10-17-2014 07:53 AM

Hello everyone,

I wonder whether it is possible to revisit observations with DS2. The condition is that I don't know beforehand how many times I will revisit an observation. This can be the case if for instance I use a newton raphson procedure to maximize a function, and I dont know how many iterations I need to get convergence. For each itereation I need to go through all observations. The reason why I want to use DS2 is to get access to the matrix package.

In the ordinary datastep I know of two ways to get random access to observations: The open() function and the point= option. Simple examples of their use:

data test2; do obs=1 to 3;output;end;run;

data _NULL_;

dsid=open('test2');

call set(dsid);

do i=1 to 2;

do j=1 to attrn(dsid,'nobs');

obsid=fetchobs(dsid,j);

put obs=;

end;

run;

data _NULL_;

do i=1 to 2;

do j=1 to 3 ;

set test2 point=j ;

put obs= ;

end;

stop;

run;

Unfortunately, none of these two methods are possible with DS2.

FriedEgg · Posted 10-17-2014 08:10 AM

In DS2 you could get set key= functionality using the sqlstmt package

Patrick · Posted 10-17-2014 08:28 AM

I don't have a lot of experience with DS2 - but isn't SQLSTMT only used in a Federation Server context?

Patrick · Posted 10-17-2014 08:25 AM

You could use the hash or hash iterator package. That should allow you to iterate through the data as many times as you need.

FriedEgg · Posted 10-17-2014 10:16 AM

You are confusing some not necessarily related things. The SAS Federation Server product and SAS FedSQL language. Which, given their names and such, I don't blame you for.

FedSQL is a implementation, by SAS, of ANSI SQL:1999, and attempts to be a 'vendor neutral SQL' that is compliant with most any DBMS that has a SQL interpreter. FedSQL in SAS is used as the SQL language in PROC FEDSQL and PROC DS2. These procedures and the FedSQL language are in no way dependent on the Federation Server product.

Patrick · Posted 10-17-2014 10:35 AM

Thanks. Good to know. I have definitely some more reading to do.

I've seen the value of DS2 for CI implementations using the Federation Server and I start to understand some of the other areas where it could add value (proc hpds2). I haven't found a real life application in my own line of work yet. But that's may be only because I don't fully understand. Still hoping for some exciting white papers showing how it's actually used in practice (I know that some solutions use it "in the background").

FriedEgg · Posted 10-17-2014 01:48 PM

One of the most common tasks I utilize PROC DS2 for is model scoring. Have seen significant benefits from the threading for these high compute tasks. Semantically, I also really like DS2. While it may take a few more lines of code to duplicate simple data steps, in many situations I find myself being able to overall reduce complexities, especially with highly recursive routines.

Some additional stream of consciousness thought:

From what I gathered at SGF2014, a lot of focus in DS2 is being placed on integration with Hadoop.

I am very excited about the recent release of the HTTP package and hope the ability to have non fixed and traverse-able data types will come soon so that I can start dynamically ingesting/creating and JSON and XML data to interact with WebService api's more simply.

PGStats · Posted 10-17-2014 02:42 PM

Hi Fried,

what is a non fixed and traverse-able data type?

PG

FriedEgg · Posted 10-17-2014 03:29 PM

A better word may be, non-sized instead of fixed. Is a hierarchical/self-defining data structure, such as JSON/XML that could be received in DS2 without necessary prior knowledge of it's size or components and then traversed as an iterable object to get the key/data components. Similar to a hash object, in essence, where instead of putting a dataset into the object, I instead put a JSON/XML document into it...

JacobSimonsen · Posted 11-10-2014 12:18 PM

I couldn't get the hash-object solution to work as intended. It is not a problem to define the hash-object and also easy enough to read data into it. But how do I let each thread read different parts of the hash-object. If I use an argument to the thread, then it is the same argument to all threads,

The goal is that the threads should work on different parts of the data - in a way so data can be revisited.

jakarman · Posted 11-10-2014 01:35 PM

Jacob, that makes sense that the hash object is not shared between all different threads. It would only work in a SAS dataset structure as the other threads in a RDBMS are out of reachable area for ds2.

---->-- ja karman --<-----

gergely_batho · Posted 11-10-2014 03:21 PM

Maybe the only way to do it, is to load the hash object with the hash.add() method inside the thread (Do not use "datasource" when you declare the hash object! 😞

thread my_thread;

method run();

set my_input;

my_hash.add();

...

end;

endthread;

run;

This way all your hash objects in different threads will be different.

And probably you will use the by statement as well. I think all this makes sense only if you do by group processing.

At the beginning of a group (first.group_var=1) you clear your hash object ( my_hash.clear() ).

At every read you just store the data in the hash object (my_hash.add()).

At the end of a group (last.group_var=1) you start to analyze data in the hash object.

jakarman · Posted 11-10-2014 03:28 PM

Gergely, The question would be can al threads share the same memory set up by one (hash is in memory).

This is requiring a sychronisation process with locking as in a RDBMS and a shared entrypoint for shared memory. I do not expect it is there. It are system programming techniques close to kernel level. The multi-users address spaces where dependent of those.

Of course you can define a hash in every thread, but it will fail in the benefits of hash usage (collecting all random events) while pushing pressure on the memory resource.

---->-- ja karman --<-----

gergely_batho · Posted 11-10-2014 03:43 PM

Hi Jaap,

I agree with you. DS2 theards do not share memory.

But my reply was just a short answer/idea on this topic: "The goal is that the threads should work on different parts of the data".

JacobSimonsen · Posted 11-15-2014 06:24 PM

I have succeeded now to write a program where I can revisit the observations. I have used the hash object which I define in each single thread as discussed above.

Unfortunately the method is not time-efficient, but it was a fun experiment anyway. I think the problem is that all threads use the same memory-resource (as Jaap points out). I observe that all my four cpu's are in use during the process, but in total they use only about 25% of total cpu. The log shows that real time usage is only slightly smaller than cpu-time usage.

An other problem is that it is not a very stable method. Most often it run without problems but crashes are not rare.

I have enclosed the program for those who are interested. It is a simple estimation of parameters in a logistic regression where I have a by-group variable.

(only made for test purpose, my real problem is far more complicated than logistic regression)

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away