Solved: Re: Is there a LIMIT function in PROC SQL

caveman529 · Posted 11-02-2013 06:07 PM

I hope to achieve something below in SAS. The syntax is based on MySQL I believe:

SELECT name FROM id_name WHERE number BETWEEN number_start AND number_end LIMIT 1

The things is that the number_start and number_end are big. And there are millions of entries in the database. So without the limit function, the process is extremely slow. Thank you!

Patrick · Posted 11-02-2013 07:40 PM

I'm not aware of something like a Limit clause in the SAS SQL flavour.

You can limit the number of rows processed and returned by using the INOBS= and OUTOBS= options in PROC SQL.

INOBS= restricts the number of rows that PROC SQL retrieves from any single data source.

OUTOBS= restricts the number of rows that PROC SQL includes in the output.

So if you want to consider all rows in your query but limit the number returned, use OUTOBS=.

If the source table is SAS then I would use a data step. Examples follow.

data have;
  number_start=5; number_end=7;
  do number=1 to 10;
    output;
  end;
  stop;
run;
 
data want;
  set have(WHERE=(number BETWEEN number_start AND number_end));
  output;
  stop;
run;

proc sql outobs=1;
  select *
  from have
  WHERE number BETWEEN number_start AND number_end
  ;
quit;

View solution in original post

Patrick · Posted 11-02-2013 07:40 PM

I'm not aware of something like a Limit clause in the SAS SQL flavour.

You can limit the number of rows processed and returned by using the INOBS= and OUTOBS= options in PROC SQL.

INOBS= restricts the number of rows that PROC SQL retrieves from any single data source.

OUTOBS= restricts the number of rows that PROC SQL includes in the output.

So if you want to consider all rows in your query but limit the number returned, use OUTOBS=.

If the source table is SAS then I would use a data step. Examples follow.

data have;
  number_start=5; number_end=7;
  do number=1 to 10;
    output;
  end;
  stop;
run;
 
data want;
  set have(WHERE=(number BETWEEN number_start AND number_end));
  output;
  stop;
run;

proc sql outobs=1;
  select *
  from have
  WHERE number BETWEEN number_start AND number_end
  ;
quit;

caveman529 · Posted 11-02-2013 08:05 PM

Thank you so much. It is just that it is taking too long, and I'm sure what is the best way to deal with this. The issue is also discussed in detail in the following website, but it does not offer concrete solution other than I need to "optimize..."

SAS ANALYSIS: Why sometimes use DATA Step instead of PROC SQL

Patrick · Posted 11-02-2013 09:15 PM

If your source data is stored in a data base and you know the SQL syntax there to speed up things then use pass-through SQL.

I believe there is no "one fits all" best solution but you need to decide on a case by case basis what works best.

jakarman · Posted 11-03-2013 03:36 AM

I needed to go back to your intention of your question. SQL SELECT TOP, LIMIT, ROWNUM As reading that it looks to return the number of observations specified by limits.
That is the same functionality as Patrick has given with the use of: outobs=

The outobs and more is describe at: SAS(R) 9.4 SQL Procedure User's Guide (Using PROC SQL Options to Create and Debug Queries)

Your link in SAS Analyses is a nice one as it describes very nice the difference in all methods. It does not tell you the backgrounds an why to think about those.
Patrick is right again: "I believe there is no "one fits all" best solution"

One thing to think about tis start with those differences.

Proc SQL:

The proc SQL is multihtreading and not aware of the order of the dataset.

Joining/merging can occur by an external DBMS when coding SQL (implict or explicit).

SAS datasteps:
The SAS-datasteps are single line executing processing and can take all the advantages of the order of the datasets.

Joining/merging must occur in the SAS execution environment

Back to you question. I did not see the complete SQL you are working on, or description on the problem your are working on.

If you are joining/merging in the SQL and wat 1 record back as output. -> USE proc SQL. But remember this one could be a random one as you are not sure of the input order.

If you are just subsetting a big dataset on some condition and the dataset is of the SAS type. Think about using the SAS data-step.

If you are just subsetting a big dataset in DBMS think about subsetting it in DBMS (where) first and do in next step more of your analyses.

---->-- ja karman --<-----

caveman529 · Posted 11-05-2013 04:40 PM

Hi, Jaap:

The dataset are all small, about 50 MB in size. But somehow the the SQL is trying to do Cartesian product. For example, the my identifiers are numeric like below:

number

1077857536

1074971904

1264943360

3236621056

1658826240

2927696128

2927735552

645046016

The numbers are large, apparently. I hope to know whether these numbers are between. Because there are millions of entry in my data, it takes about 12 days to complete the computation. Do you think there is some method to specify format so that the program will be more efficient? The only thing I care is whether the number fall between the number_start and number_end. Thanks -

number_start, number_end,

16843264,16844799,

16850944,16859135,

16859136,16875519,

16875520,16908287,

16909056,16909311,

16909312,16909567,

16912384,16941055,

16941056,16973823,

Vince28_Statcan · Posted 11-05-2013 05:29 PM

I fail to see the cartesian product...or do you mean that you have a second dataset with a list of number_start, number_end pairs for which if "number" is between any of the pair in the second dataset, you want to retrieve that name?

Like could you provide a little more context to your data? You said you had small datasets in the 50mb range yet there are millions of entry. It feels incompatible.

Tom · Posted 11-05-2013 06:49 PM

You could generate a format from the start/end pairs.

data format ;

fmtname='MYFMT';

label = '1';

input start end ;

cards;

16843264 16844799

16850944 16859135

16859136 16875519

16875520 16908287

16909056 16909311

16909312 16909567

16912384 16941055

16941056 16973823

run;

proc format cntlin=format;

run;

data x;

set mydata ;

where put(number,myfmt.)='1' ;

run;

Tom · Posted 11-05-2013 06:43 PM

That really looks like the type of code you might use in a Web or other interactive application where you want to display a little bit of data and wait for the user to click on something. I normally think of data processing with SAS as something that is applied to the entire set of observations in one operation.

jakarman · Posted 11-06-2013 01:35 AM

I have the same question as Vince28. I like to understand the real issue/question and the limitations you are running on

What we know now is:

- The input - dataset(s) one or more?

Are relative small ca 50MB. How many/record/observations? Size of the record? is there a unique key?

It is a having a number-field number.

- There must by a big dataset containing interval for the numbers number-start number-end

How big is that dataset MB's, records? What type is it SAS or external DBMS? If it is an external DBMS what type is it? Are there indexes defined or can they be defined?

Are there unique keys present?

On what system you are running. The interesting part is whether data fits easy on that or are there limitations make life more difficult. (work size CPU-s? memory etc.

By that choices can be made for optimal approaches, like:

1/ get to batch processing with a SAS-datastep apporach.

It is different to SQL in supporting first/last logic and having retain functionality. Tis is optimal for an ordered analyses on data not possible with SQL

Have the datasets ordered sorted first and use the first/last logic of the SAS language. Even with several 10's Gb's of data it should able to run within some hours.

2/ having unique keys present everywhere isolate the described selection process on temporary datasets just having those numbers and keys.

With the resulting datasets go back with an indicator on the key. This is splitting your process to more easy to handle partial steps.

This is an approach wiht some programming overhead that can dramatically reduce needed computerresourcers

3/ The numberstart-numberend can be logical wanted in the dataset. the disadvantage is having a combination of values in one record.

A possible idea is to have a dataset for that wiht number / unique (foreign key) and an indicator start/end (ordered).

4/ You do not need to go for an external DBMS when working wiht keys/indexes and constraints. SAS and SAS dataset also are having all those things.

5/ In the case you have to keep it to an external DBMS process, try to reverse some of your logic or look for special functions on those side.

Most DBMS-system are designed for single record processing (OLTP).

This is very very slow or analytics when using all data but as is designed for small number of records it could fit your question.

When there is a function for 1 record that works fine, repeat this for all your input records.

The cartesian product is not direct visible. I think you got a note on that.

It could be possible been caused by the between function. is no joins with a between are possible it must be solved in other way.

The SQL solution to something not been able to solve in join/set logic is creating a cartesian product as it will logical solve the question. The disadvantage is a tremendous resource usage.

Still many more ideas and options to think about wiht mor detailed information. SAS has that lot of choices arround you can be occupied years on that.

---->-- ja karman --<-----

caveman529 · Posted 11-12-2013 04:01 PM

Wow, thanks ! I need to think more about this following our discussion !

Classroom Training Available!