Re: PROC SQL UPDATE taking forever to run

gdaymte · Posted 12-14-2015 11:45 AM

I am trying to update a master table shell with some demographics. The code I am using worked on a dataset of 32 million records, but now I am having to deal with a dataset that has almost 100 million records. I had to stop the code after 21 days of it running. If anyone has any suggestions about how I can improve the processing time, I am all ears. Someone suggested removing the select statements and replacing with inner joins, but I can't seem to get that working or put in the proper order. Here is the original code I was using. It's nothing fancy.

PROC SQL;
	UPDATE T.MASTER_PATIENT_RX A
		SET
			PERSONID=(SELECT PERSONID FROM ZW.MASTER_PATIENT AS B WHERE A.PATIENTID=B.PatientiD),	
			PT_LASTNAME=(SELECT (UPCASE(LastName)) FROM ZW.MASTER_PATIENT AS B WHERE A.PATIENTID=B.PatientiD),
			PT_FIRSTNAME=(SELECT (UPCASE(FirstName)) FROM ZW.MASTER_PATIENT AS B WHERE A.PATIENTID=B.PatientiD),
			PT_MIDDLENAME=(SELECT (UPCASE(MiddleName)) FROM ZW.MASTER_PATIENT AS B WHERE A.PATIENTID=B.PatientiD),
			PT_DOB=(SELECT DOB FROM ZW.MASTER_PATIENT AS B WHERE A.PATIENTID=B.PatientiD),
			PT_AGEYR=(
				SELECT 
					(
					CASE
						WHEN DOB NE . AND A.RX_DTFILLED NE . THEN FLOOR((INTCK('MONTH',DOB,RX_DTFILLED)-(DAY(RX_DTFILLED)<DAY(DOB)))/12)
						ELSE .
					END
					) 
				FROM ZW.MASTER_PATIENT AS B WHERE A.PATIENTID=B.PatientiD),
			PT_AGEMO=(
				SELECT 
					(
					CASE
						WHEN PT_AGEYR EQ 0 AND DOB NE . AND A.RX_DTFILLED NE . THEN FLOOR((INTCK('MONTH',DOB,RX_DTFILLED)-(DAY(RX_DTFILLED)<DAY(DOB))))
						ELSE .
					END
					) 
				FROM ZW.MASTER_PATIENT AS B WHERE A.PATIENTID=B.PatientiD),
			PT_GENDER=(SELECT PT_GENDER FROM ZW.MASTER_PATIENT AS B WHERE A.PATIENTID=B.PatientiD),
			PT_GENDERID=(SELECT GenderID FROM ZW.MASTER_PATIENT AS B WHERE A.PATIENTID=B.PatientiD),
			PT_SPECIES=(
				SELECT
					(
					CASE
						WHEN SpeciesCode=1 THEN 'HUMAN'
						WHEN SpeciesCode=2 THEN 'VETERINARY'
						ELSE 'UNKNOWN'
					END
					)
				FROM ZW.MASTER_PATIENT AS B WHERE A.PATIENTID=B.PatientiD),
			PT_SPECIESID=(SELECT SpeciesCode FROM ZW.MASTER_PATIENT AS B WHERE A.PATIENTID=B.PatientiD),
			PT_ADDR1=(SELECT (UPCASE(Street)) FROM ZW.MASTER_PATIENT AS B WHERE A.PATIENTID=B.PatientiD),
			PT_ADDR2=(SELECT (UPCASE(Street2)) FROM ZW.MASTER_PATIENT AS B WHERE A.PATIENTID=B.PatientiD),
			PT_CITY=(SELECT (UPCASE(City)) FROM ZW.MASTER_PATIENT AS B WHERE A.PATIENTID=B.PatientiD),
			PT_STATE=(SELECT (UPCASE((SUBSTR(State,1,2)))) FROM ZW.MASTER_PATIENT AS B WHERE A.PATIENTID=B.PatientiD),
			PT_ZIP=(SELECT (SUBSTR(Zip,1,5)) FROM ZW.MASTER_PATIENT AS B WHERE A.PATIENTID=B.PatientiD),
			PT_COUNTYFIPS=(SELECT CountyFIPS FROM ZW.MASTER_PATIENT AS B WHERE A.PATIENTID=B.PatientiD)
		WHERE PATIENTID IN (SELECT PatientiD FROM ZW.MASTER_PATIENT)
		;
QUIT;

Reeza · Posted 12-14-2015 11:56 AM

Is your data on a server? Can you use Pass Through SQL? If so, a native SQL query in that language may be faster.

gdaymte · Posted 12-14-2015 11:58 AM

The data is on an MS SQL Server, but the folks that own the data will not let us have access to it. I've asked for that, but was denied. Instead they are giving us copies of the tables in SAS.

Reeza · Posted 12-14-2015 12:05 PM

Ok, Process changes: I would suggest they give you read access to the server. You can read tables but not change them. If this is a regular process I would also suggest you ask if they can do it for you.

It looks like you're only joining to one table so you may want to consider a left join, but that would create a new table. Another option is a bunch of formats, but I'm not sure that would be any more efficient in the long run.

ballardw · Posted 12-14-2015 12:15 PM

If this were my project, working on a local copy of data and there is only one patientid in the master record set, I would be very tempted to 1) rename variables in the transaction data set to match the master data, 2) do the transformations in the transaction data before merging and 3) use as data set update, which does require sorting the data by the same variables used in a by statement to align things..

I have a sneaking suspicion that having 19 select -where clauses (especially since they are essentially the same clause) can't be very efficient.

Have you done anything like look at system memory and/or cpu usage while this is running?

Haikuo · Posted 12-14-2015 08:50 PM

Echo with @ballardw, SQL runs faster when it is simple, complex code confuses its compiler thus weaken its optimizer. Break it down to 19+ separated procedures, you will be surprised how fast it goes.

Ksharp · Posted 12-15-2015 04:56 AM

For big table. SQL is not efficient . Consider about using Hash Table ?

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!