BookmarkSubscribeRSS Feed
Kleber_Augusto
Fluorite | Level 6

Hello,

My English is not very good but try to communicate ... 

At about the creation of composite or simple index.

Doubts:

1- If I create a composite index and then, in a join, I realize bases crossing using only one of the fields present in my composite index, this is not very performative, correct? In the future I'll use others fields presents in this composite index but at the present moment, in a determined join that I have to do, only one of the fields involved in this composite index I really need to use. It would be the best way if I create an index only for the field that I want to use now and create another (a second) composite index for the fields that will be used in the future?

2- The second doubt involves the following situation ... have a base of 30,000 records that will be my main base on a left join, the other base, that is may right side in this left join, has no less than 187 million records ... the second contain CPF (it's the name of field that we use in Brazil to identify a person... like a ID...) ... my question is ... the customer does not know yet that they wants to extract, what kind of information they'll use... I think that instead of crossing a giant base like this ... I should use a copy of this giant base but divided into periods, example ... base1, base2, base3, or even a copy that already  contain deleted inactive customers ... but as I said ... the problem that I have is the customer will still decide what information they prefer to work for example, for campaign of marketing in future... but .... to cross 30,000 records (my left base in left join) with 187 million records (my right base), even using a index, it' ll take an eternity, I'm correct?


Thanks all.

12 REPLIES 12
Kurt_Bremser
Super User

Indexes improve performance when only a small subset of a dataset is needed. If the whole dataset needs to be read anyway, other methods (depending on the needed result) will be better

Regarding your 187 million record dataset:

I'd try to store each time period in a separate dataset, sorted by CPF. That way you can do a direct merge.

If more than one period is needed, do

data several_periods;
set
  period_1
  period_2
  period_3
  ....
  period_n
;
by CPF;
run;

and several_periods will also be sorted by CPF.

ChrisHemedinger
Community Manager

Good advice from @Kurt_Bremser.  To learn more about SAS data sets and when indexes (or indices) can make a difference, read one of the many excellent conference papers.  Here's one from PharmaSUG 2014:

 

Indexing: A powerful technique for improving efficiency

It's time to register for SAS Innovate! Join your SAS user peers in Las Vegas on April 16-19 2024.
Kleber_Augusto
Fluorite | Level 6

Good morning Kurt,

 

I agree with your sugestion but, it'll take a long time to separate this giant base in other with periods... and the problem is... the client wants to begin to create reports and graphs in a short time... I forgot to comment this another problem... the time to do others alternatives... I think that your idea can be utilized if the client informe to me that they already contain these bases in periods... if this can be possible, this situation can to resolve the great part of my problem.... I'll see this with the client, if they contain separated bases in periods.


Thanks for your sugestion, I wish a great day for you!

Kurt_Bremser
Super User

Dear Customer,

 

you want

) a good solution

) fast

) for a reasonable price

?

 

Select any two.

 

😉

 

Separating the big dataset should not take more than two scans through the whole dataset. One to determine the periods present in the dataset, and one to split it. In the second run you'll have read and write, so your time will be size_of_dataset / storage_throughput * 3 (+ time needed to write the splitting datastep)

Once you have your small chunks, sorting them one by one will be a breeze, compared to handling the whole beast.

What is the physical size of your dataset, anyway?

Kleber_Augusto
Fluorite | Level 6

Hello Kurt,


In my opinion, I'd like a reasonable price and good solution but, for my client they want a fast solution with sure.....


I talked with others SAS specialists in Brazil and they suggested to break this giant base in others (by period for example) or questioning and forcing the my client ask to database area create (if not exist) this small bases and I receive these extraction. What you think about this?

 

The physical size of my base (with 187 millions of register) is 200 Gb without use the compress = binary.

 

Again, I'm thank for your contribution!

Astounding
PROC Star

If  CPF is anything like social security number, it would be character and contain a set of digits.  (If not,  you can always adapt this suggestion.)  Have you considered creating your own version of parallel processing?  Split up the 187M records according to the first character in CPF.  Create 10 data sets, each with 19M records where the first has all CPFs that begin with "0", the second has all CPFs that begin with "1", etc.  When a request comes in, split it similarly, generating 10 requests.  Then match each of the 10 requests to the matching 19M data set.  With some work, this might also let you run the 10 requests simultaneously rather than sequentially.  Of course, you have to aggregate the 10 results later ... the most difficult part of that might be determining when all 10 requests are finished.

Kleber_Augusto
Fluorite | Level 6

Good morning,


Your suggestion is very relevant but, the problem that I have envolve the time to process this sort (to separate the data set) and the time that the my client expect to begin to create reports and graphs.


CPF its a ID for each brazilian citizen... this field not works with caracter, only numbers.


Talking with my client we thinking in some options like....


1 - To Bring to database (we don't know what database is at moment...), to extract 30,000 registers into de 187M and I receive this extraction or,


2 - We'll verify with the arquitecture team if extract 30,000 registers into 187M directly in mainframe... if this technique can be more fast that to extract in a database... or,


3 - Today, by a security question, I cannot link SAS into a database, if this situation can turn possible, I can try to extract 30,000 into 187M before de join with the another base with de same 30,000 but, I don't know yet if run this select into SAS can take a long time too...

jklaverstijn
Rhodochrosite | Level 12

Hi @Kleber_Augusto,

 

Is this a SAS dataset? Have you thought of using the SPD engine? Index performance could very well benefit, expecially if you have multiple CPU's.

 

Also, it sounds like you are afraid performance will be bad but have not tested this yet to be sure. I suggest you do that, preferably together with the users; you may be surprised. I find SAS datasets in general performing remarkably well. Also, some people might no longer consider 187 million rows a giant anymore 😉 Modern hardware and a well designed storage system will make short work of these amounts of data.

 

Do keep in mind that in case of a compound index it will still be used if you use a subset of the variables starting from the left. So if the index is (a,b,c) it will be used if only (a,b) are involved, but not in case of (b) or (b,c).

 

Hope this helps,

- Jan.

Kleber_Augusto
Fluorite | Level 6

Hi @jklaverstijn,


I agree with you, my project is a pilot yet. Because this situation we don't have a server appropriate for this yet. But I agree with that, this base can be not so great front of a good hardware. This situation can turn a reality to me in next months but now, is only a pilot... the problem is, in this pilot the client thought in do a great join... as I said in my previous post we decided to present 3 suggestion and the arquitecture team of my client (Bradesco Bank), will point to us what they can do in this case... for example... try run this select directly in mainframe or database.


Thanks for you help! Good morning!

TomKari
Onyx | Level 15

Hi

 

Please email me at tom.kari.consulting@bell.net so we can have a private discussion.

 

Thanks!

   Tom

LinusH
Tourmaline | Level 20
Maybe I miss something here, but you have verified that index is not used? Options msglevel = i; and PROC SQL option _method will help you verify how the SQL optimizer handles your query.
Also, I don't think SPDE itself will trigger an indexed join strategy. But yes, its indexes are more effective wheb created and in some cases during query.
If we are talking about SAS technology you could look at SPD Server since it has its own optimizer AND a specific join index!
Either way, data at this size need better storing strategy if speed is required.
But as @Kurt_Bremser put it-chose any two...
Data never sleeps
Kleber_Augusto
Fluorite | Level 6

Hi @LinusH,


I forgot to use these commands, I'll try to use in my tests.

About SPD server, maybe we'll have one this in next months because at the moment this is only a pilot project, that's de reason why we have a small space yet.


Very Thanks for your ideas. Good afternoon!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 1783 views
  • 8 likes
  • 7 in conversation