07-22-2013 09:50 AM
If I have a loop code, that says the following:
if (mod(_n_,100)=0 then count_loops = (int(_n_/100));
else count_loops = sum(1, int(_n_/100));
call symput ('loops needed', count_loops); run;
My concern with this is it looks at 100 records at a time until it is complete. Most of the projects we work on are based on pulling data from our claims system that is for an entire year. We process approx. 286-300 million claims per year. I am reluctant to do a loop that looks at 100 records at a time because that years worth of data might result in over 8 millions records and reading 100 at a time I would think would take forever. If I change the loop to look at 1 million records at a time and loop, will that cause my PC to crash, or my server? When I query our claims system, I do it remotely. I telecommuter and connect through VPN. So, another concern is the system might timeout. Not VPN but on the server side. I guess another concern is if I am looping at 1 million at a time, since it is through VPN, would it be slower than molasses?
07-22-2013 11:00 AM
There is actually a lot of extra work going on here. For every observation in the data set, this DATA step computes COUNT_LOOPS, and executes CALL SYMPUT. There is no concept of 100 observations at a time in this logic. At a bare minimum, this replacement would be possible:
set pc_temp end=done;
if done then call symputx('loops_needed', ceil(_n_/100));
That's all the program is doing ... calculating how many loops would be needed to process 100 observations at a time in a later program. Each CALL SYMPUT replaces the value assigned by the previous observation and the DATA step processes all 300M observations.
In most cases, there will be even quicker ways. Extracting the number of observations in a SAS data set is often speedy. With 300M observations, though, it is possible that your data is stored on tape rather than disk and that the speedier methods won't work. If the data is on disk, you could use:
call symput('loops_needed', ceil(_nobs_/100));
set pc_temp nobs=_nobs_;
There are complications however. If it is possible that observations have been marked for deletion but not physically removed from the data set, the count will be off. There are ways to compensate, if you need to, extracting the NLOBS attribute of the data set. But that's getting into problems that you may not need to face, so let's cut it short for the moment.
07-22-2013 11:10 AM
This is just a snippet of the entire code. The entire code is 19 1/2 pages. Basically, the analyst that created the code is hitting 1 table in our claims system at a time because when he completes the joins based our our DD, his results are duplicated. This is not because there are duplicates in our tables. It is because he has coded something incorrectly. So, he spent 2 months building this 19 1/2 page (so far) code that hits 1 table at a time. He has a total of 11 tables in his code he is hitting one at a time and uses macros and a loop to look at 100 records at a time and it continues until it has looked at everything. Most our projects look at 1 years worth of data, so he is looping through millions of records. Using a loop of 100 would take forever. Sure, we can increase and say look at 1 million at a time but still that would take forever when you are querying one table at a time.
07-23-2013 01:52 AM
Whether it is efficient or not depends on how he is doing the access. If the program is using FIRSTOBS or POINT to directly access individual records from a SAS dataset then it would not necessarily be any more inefficient (in terms of run time) than processing the file in one step.
But given the program that is setting the macro variable I doubt that is the case.
Why not just determine the source of the duplication?
If you are processing medical claims data then there is a much higher probability that there are natural duplicates in the data. Some one might fill two identical pharmacy scripts for the same drug, dose, days supply on the exact same date. Some one might enter two orders for a single unit of plasma rather than create one order with two units.
07-23-2013 10:32 AM
There are always a lot of very frustrating situations coming up regardless of what job you do. However, as a friendly advice, you should probably delete your last post and go take a coffee or a beer with a friend to cooldown. I haven't read SAS forums guidelines but I suspect this is out of the line of conduct but far beyond code of conduct is the inherent risk of a co-worker to identify you and the manager you are talking about.
As for the risk of losing your job to your upcoming new boss, while you may have started the relationship on the wrong track, if you can control your frustrations, it is often easier to make them change their decision with a calm rationale. In your reply, the way you've approached the whole "tell me where you have duplicates from your query and I will tell you why based on our DD" should be quite eye opening to a less DB knowledgeable manager and with some iterations over time and little teaching, I'm sure he will adopt your way.
07-22-2013 11:01 AM
The SET statement is an implied loop and to do the count you show you've already read through the entire data set anyway. I would look at the dictionary table to determine the number of records in the dataset instead of this approach.
It might be a good idea to explain what you want to do as there are many things to limit numbers of records and variables to improve effeciencies.
07-22-2013 01:17 PM
An even larger concern is that this snippet's loop is running through an entire table to effectively only find the largest _N_ and divide that by 100 to set a macro variable's value thats going to be used further. This is incredibly ineficient. Breaking down a larger job in sets of X records can be worthwhile but the above is not an appropriate way to figure out how many subsets of 100 records you will have to run through.
As for crashing your computer going from 100 to 1M records - SAS processes a single record at a time by default as it is built to handle larger jobs than whatever ram you have availible. So unless throughout your code, the breakdown by 100 records is doing some absurd manual merging with hash tables or arrays (both fully run in memory), 1M records shouldn't cause any memory crash issues.