05-18-2016 07:45 AM - last edited on 05-18-2016 08:07 AM by ChrisHemedinger
My English is not very good but try to communicate ...
At about the creation of composite or simple index.
1- If I create a composite index and then, in a join, I realize bases crossing using only one of the fields present in my composite index, this is not very performative, correct? In the future I'll use others fields presents in this composite index but at the present moment, in a determined join that I have to do, only one of the fields involved in this composite index I really need to use. It would be the best way if I create an index only for the field that I want to use now and create another (a second) composite index for the fields that will be used in the future?
2- The second doubt involves the following situation ... have a base of 30,000 records that will be my main base on a left join, the other base, that is may right side in this left join, has no less than 187 million records ... the second contain CPF (it's the name of field that we use in Brazil to identify a person... like a ID...) ... my question is ... the customer does not know yet that they wants to extract, what kind of information they'll use... I think that instead of crossing a giant base like this ... I should use a copy of this giant base but divided into periods, example ... base1, base2, base3, or even a copy that already contain deleted inactive customers ... but as I said ... the problem that I have is the customer will still decide what information they prefer to work for example, for campaign of marketing in future... but .... to cross 30,000 records (my left base in left join) with 187 million records (my right base), even using a index, it' ll take an eternity, I'm correct?
05-18-2016 08:17 AM
Indexes improve performance when only a small subset of a dataset is needed. If the whole dataset needs to be read anyway, other methods (depending on the needed result) will be better
Regarding your 187 million record dataset:
I'd try to store each time period in a separate dataset, sorted by CPF. That way you can do a direct merge.
If more than one period is needed, do
data several_periods; set period_1 period_2 period_3 .... period_n ; by CPF; run;
and several_periods will also be sorted by CPF.
05-18-2016 08:21 AM
Good advice from @KurtBremser. To learn more about SAS data sets and when indexes (or indices) can make a difference, read one of the many excellent conference papers. Here's one from PharmaSUG 2014:
05-18-2016 08:29 AM
Good morning Kurt,
I agree with your sugestion but, it'll take a long time to separate this giant base in other with periods... and the problem is... the client wants to begin to create reports and graphs in a short time... I forgot to comment this another problem... the time to do others alternatives... I think that your idea can be utilized if the client informe to me that they already contain these bases in periods... if this can be possible, this situation can to resolve the great part of my problem.... I'll see this with the client, if they contain separated bases in periods.
Thanks for your sugestion, I wish a great day for you!
05-18-2016 08:40 AM
) a good solution
) for a reasonable price
Select any two.
Separating the big dataset should not take more than two scans through the whole dataset. One to determine the periods present in the dataset, and one to split it. In the second run you'll have read and write, so your time will be size_of_dataset / storage_throughput * 3 (+ time needed to write the splitting datastep)
Once you have your small chunks, sorting them one by one will be a breeze, compared to handling the whole beast.
What is the physical size of your dataset, anyway?
05-18-2016 09:22 AM
In my opinion, I'd like a reasonable price and good solution but, for my client they want a fast solution with sure.....
I talked with others SAS specialists in Brazil and they suggested to break this giant base in others (by period for example) or questioning and forcing the my client ask to database area create (if not exist) this small bases and I receive these extraction. What you think about this?
The physical size of my base (with 187 millions of register) is 200 Gb without use the compress = binary.
Again, I'm thank for your contribution!
05-18-2016 09:53 AM
If CPF is anything like social security number, it would be character and contain a set of digits. (If not, you can always adapt this suggestion.) Have you considered creating your own version of parallel processing? Split up the 187M records according to the first character in CPF. Create 10 data sets, each with 19M records where the first has all CPFs that begin with "0", the second has all CPFs that begin with "1", etc. When a request comes in, split it similarly, generating 10 requests. Then match each of the 10 requests to the matching 19M data set. With some work, this might also let you run the 10 requests simultaneously rather than sequentially. Of course, you have to aggregate the 10 results later ... the most difficult part of that might be determining when all 10 requests are finished.
05-18-2016 10:46 AM
Your suggestion is very relevant but, the problem that I have envolve the time to process this sort (to separate the data set) and the time that the my client expect to begin to create reports and graphs.
CPF its a ID for each brazilian citizen... this field not works with caracter, only numbers.
Talking with my client we thinking in some options like....
1 - To Bring to database (we don't know what database is at moment...), to extract 30,000 registers into de 187M and I receive this extraction or,
2 - We'll verify with the arquitecture team if extract 30,000 registers into 187M directly in mainframe... if this technique can be more fast that to extract in a database... or,
3 - Today, by a security question, I cannot link SAS into a database, if this situation can turn possible, I can try to extract 30,000 into 187M before de join with the another base with de same 30,000 but, I don't know yet if run this select into SAS can take a long time too...
05-18-2016 10:38 AM
Is this a SAS dataset? Have you thought of using the SPD engine? Index performance could very well benefit, expecially if you have multiple CPU's.
Also, it sounds like you are afraid performance will be bad but have not tested this yet to be sure. I suggest you do that, preferably together with the users; you may be surprised. I find SAS datasets in general performing remarkably well. Also, some people might no longer consider 187 million rows a giant anymore ;-) Modern hardware and a well designed storage system will make short work of these amounts of data.
Do keep in mind that in case of a compound index it will still be used if you use a subset of the variables starting from the left. So if the index is (a,b,c) it will be used if only (a,b) are involved, but not in case of (b) or (b,c).
Hope this helps,
05-18-2016 11:06 AM
I agree with you, my project is a pilot yet. Because this situation we don't have a server appropriate for this yet. But I agree with that, this base can be not so great front of a good hardware. This situation can turn a reality to me in next months but now, is only a pilot... the problem is, in this pilot the client thought in do a great join... as I said in my previous post we decided to present 3 suggestion and the arquitecture team of my client (Bradesco Bank), will point to us what they can do in this case... for example... try run this select directly in mainframe or database.
Thanks for you help! Good morning!
05-18-2016 11:48 AM
05-18-2016 01:49 PM
I forgot to use these commands, I'll try to use in my tests.
About SPD server, maybe we'll have one this in next months because at the moment this is only a pilot project, that's de reason why we have a small space yet.
Very Thanks for your ideas. Good afternoon!