Solved: Re: non-trivial algorythm for char fields - Page 3

FreelanceReinh · Posted 10-24-2019 12:53 PM

This is indeed a situation where the suggestion Allow Topics to have More Than One Response marked as the Solution would be applicable.

As far as I see, @PGStats's SQL approach is very elegant and correct and @hashman's hash approach (with either of my corrections applied) is very efficient and correct.

Example: With the HAVE dataset containing the 6**6=46656 combinations of '1', ..., '6' (analogous to the 5**5=3125 in one of my posts) the PROC SQL step took more than two minutes on my computer, whereas the (corrected) DATA step using the hash object finished within <0.1 seconds -- and, of course, produced the same WANT dataset containing 7826 observations, which is the number to be expected according to the OEIS (see link in the other post).

PGStats · Posted 10-24-2019 01:26 PM

@FreelanceReinh your assessment is fair. There is no doubt in my mind that the hash algorithm is more efficient than the SQL query. But I would like to point out that the 6**6 data represents a worst case scenario for the SQL algorithm, because every string built this way has the same length, which forces every string to be searched in half of the other strings, on average.

Thanks to all participants. This has been a great discussion.

PG

FreelanceReinh · Posted 10-24-2019 01:45 PM

@PGStats wrote:

But I would like to point out that the 6**6 data represents a worst case scenario for the SQL algorithm, because every string built this way has the same length, which forces every string to be searched in half of the other strings, on average.

Thanks, this is interesting and a good point. I hadn't analyzed the algorithms in terms of sensitivity to different types of input data. The "n**n" datasets were just an attempt to quickly create large input data covering a range of possible cases (not enough, as I had to realize :-), not to mention the obvious impossibility of substring matches within words) and with a somewhat predictable outcome, thanks to combinatorics/OEIS.

hashman · Posted 10-24-2019 03:51 PM

@FreelanceReinh:

Thank you for having tested the case.

Another thing differing the DATA step and SQL vis-a-vis this task is that SQL relies on the distinct monotonically increasing variable NUM - cannot work without it, while the hash approach couldn't care less. Of course, if NUM were absent, it could be added beforehand at the expense of another pass through the data (or creating a view, but in this case I'm afraid it could hinder SQL performance rather heavily). Hence, I wonder - and perhaps @PGStats could answer that best - if the MONOTONIC() function could be used in this query to simulate NUM, and if yes, then how.

Kind regards

Paul D.

PGStats · Posted 10-24-2019 05:06 PM

I wouldn't touch the monotonic() function with a 10 foot pole. If all we have is the name field then we could use:

proc sql;
create table want as
select unique name
from have as a 
where not exists (
	select * from have as b
	where a.name > b.name and length(a.name) = length(b.name) and
		index(cats(b.name, ".", b.name), trim(a.name)) > 0 );
quit;

else if we must handle other input fields (possibly non identical for matching names) then I would drop the unique keyword and follow the query with proc sort nodupkey.

PG

hashman · Posted 10-24-2019 05:21 PM

@PGStats: Thanks for the amended query; I see.

>I wouldn't touch the monotonic() function with a 10 foot pole.<

In this particular case or in general?

IMO, it has a number of good uses - and not necessarily in proc SQL.

PGStats · Posted 10-24-2019 05:30 PM

>I wouldn't touch the monotonic() function with a 10 foot pole.<

In this particular case or in general?

IMO, it has a number of good uses - and not necessarily in proc SQL.

In general, as long as it is surrouded by rumors of undocumented flaws. I prefer reliability to performance.

PG

hashman · Posted 10-24-2019 06:24 PM

@PGStats:

Hm. I've heard some rumors that monotonic() may do something untoward in a complex SQL query like resulting in nulls or zeros. However, outside it, I've used it a gazillion of times over at least 20 years and have never detected any "flaws" that would indicate that it might have done something I hadn't expected. Surely calling monotonic() in a DATA step flawlessly returns the next integer up by 1 every time it's called (tested at least up to 1E12) and is reset at the next step boundary. It's true, though, that it's undocumented. Suum cuique.

hashman · Posted 10-24-2019 03:32 PM

@FreelanceReinh:

>No, only N and n are the same. _n_ (=_N_) is an automatic variable (see documentation) in every DATA step. It counts DATA step iterations (but can also be used for other purposes by programmers who know what they're doing) and is dropped automatically.<

I'd add that the reason _N_ can be used for any other purpose between the top and the bottom of the step is that it is always set to the correct value at the top of the implied loop from an internal counter, i.e. it doesn't depend on whether and how _N_ is modified within the body of the implied loop. A simple example:

data _null_ ;                                                                                                                           
  put "Top  of implied loop: " _n_= ;                                                                                                   
  set sashelp.class (obs=3 drop=_all_) ;                                                                                                
  _n_ = _n_ * 101 ;                                                                                                                     
  put "Body of implied loop: " _n_= ;                                                                                                   
run ;                                                                                                                                   
-----------------------------                                                                                                           
Top  of implied loop: _N_=1                                                                                                             
Body of implied loop: _N_=101                                                                                                           
Top  of implied loop: _N_=2                                                                                                             
Body of implied loop: _N_=202                                                                                                           
Top  of implied loop: _N_=3                                                                                                             
Body of implied loop: _N_=303                                                                                                           
Top  of implied loop: _N_=4

Another very good "unused" auto (and auto-dropped) variable is _IORC_. Moreover, it is set by the compiler to 0 and auto-retained. In any step where it is not used for its direct purpose, i.e. returning the code from KEY= index search, it can be used for anything else. In particular, I find it quite useful to serve as a receptacle of the return code when a hash object method or attribute is called.

But it's not all! There other truly unused numeric auto variables: FIRST.X and LAST.X, as long as X is not in the BY statement. All of them are set to 0 by the compiler (if initialized, otherwise o missing) and auto-dropped. Curiously, such seemingly odd variables as FIRST._N_, LAST._IORC_, FIRST._ERROR_ are fully usable, too (the latter, along with LAST._ERROR_, is particularly good for creating a confusion).

As to the _ERROR_ itself, though usable, I never use it for anything other than its original purpose (except for resetting it to 0 in case of an unsuccessful KEY= index search) for obvious reasons.

Lastly, the elements of temporary arrays offer a virtually unlimited number of retained auto-dropped temporary variables - both numeric and character - if need be.

Kind regards

Paul D.

hashman · Posted 10-24-2019 02:38 PM

@Ivan555:

>When wryting the topic I did not expect to get such detailed response.<

You presented a curious logical/programming puzzle, and things of this nature always lure a number of folks here who love their minds to be challenged.

>I am a little confused which one I should mark as a solution. I think it might be fair ...<

What I think would be "fair" (and I've already stated it on this forum earlier) is being able to mark more than one reply as a solution. It would be even more fair if not only the OP could do that but others, too, perhaps with the OP's mark given somewhat more weight. This way, people searching this forum for solutions could see different approaches more readily and choose those more to their liking. Under the existing system, only one approach is highly visible, while others also leading to a correct solution are not, not to mention that it may lead to the false impression that only the approach marked "solution" is correct.

Kind regards

Paul D.

ChrisNZ · Posted 10-24-2019 04:59 PM

Thank you all for the great discussion. It's fantastic to be in such good company! 🙂

High-Performance SAS Coding - Third Edition

hashman · Posted 10-24-2019 05:24 PM

@ChrisNZ: I second your notion enthusiastically ;).

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!