BookmarkSubscribeRSS Feed
gdaymte
Obsidian | Level 7

I am trying to update a master table shell with some demographics. The code I am using worked on a dataset of 32 million records, but now I am having to deal with a dataset that has almost 100 million records. I had to stop the code after 21 days of it running. If anyone has any suggestions about how I can improve the processing time, I am all ears. Someone suggested removing the select statements and replacing with inner joins, but I can't seem to get that working or put in the proper order. Here is the original code I was using. It's nothing fancy.

 

PROC SQL;
	UPDATE T.MASTER_PATIENT_RX A
		SET
			PERSONID=(SELECT PERSONID FROM ZW.MASTER_PATIENT AS B WHERE A.PATIENTID=B.PatientiD),	
			PT_LASTNAME=(SELECT (UPCASE(LastName)) FROM ZW.MASTER_PATIENT AS B WHERE A.PATIENTID=B.PatientiD),
			PT_FIRSTNAME=(SELECT (UPCASE(FirstName)) FROM ZW.MASTER_PATIENT AS B WHERE A.PATIENTID=B.PatientiD),
			PT_MIDDLENAME=(SELECT (UPCASE(MiddleName)) FROM ZW.MASTER_PATIENT AS B WHERE A.PATIENTID=B.PatientiD),
			PT_DOB=(SELECT DOB FROM ZW.MASTER_PATIENT AS B WHERE A.PATIENTID=B.PatientiD),
			PT_AGEYR=(
				SELECT 
					(
					CASE
						WHEN DOB NE . AND A.RX_DTFILLED NE . THEN FLOOR((INTCK('MONTH',DOB,RX_DTFILLED)-(DAY(RX_DTFILLED)<DAY(DOB)))/12)
						ELSE .
					END
					) 
				FROM ZW.MASTER_PATIENT AS B WHERE A.PATIENTID=B.PatientiD),
			PT_AGEMO=(
				SELECT 
					(
					CASE
						WHEN PT_AGEYR EQ 0 AND DOB NE . AND A.RX_DTFILLED NE . THEN FLOOR((INTCK('MONTH',DOB,RX_DTFILLED)-(DAY(RX_DTFILLED)<DAY(DOB))))
						ELSE .
					END
					) 
				FROM ZW.MASTER_PATIENT AS B WHERE A.PATIENTID=B.PatientiD),
			PT_GENDER=(SELECT PT_GENDER FROM ZW.MASTER_PATIENT AS B WHERE A.PATIENTID=B.PatientiD),
			PT_GENDERID=(SELECT GenderID FROM ZW.MASTER_PATIENT AS B WHERE A.PATIENTID=B.PatientiD),
			PT_SPECIES=(
				SELECT
					(
					CASE
						WHEN SpeciesCode=1 THEN 'HUMAN'
						WHEN SpeciesCode=2 THEN 'VETERINARY'
						ELSE 'UNKNOWN'
					END
					)
				FROM ZW.MASTER_PATIENT AS B WHERE A.PATIENTID=B.PatientiD),
			PT_SPECIESID=(SELECT SpeciesCode FROM ZW.MASTER_PATIENT AS B WHERE A.PATIENTID=B.PatientiD),
			PT_ADDR1=(SELECT (UPCASE(Street)) FROM ZW.MASTER_PATIENT AS B WHERE A.PATIENTID=B.PatientiD),
			PT_ADDR2=(SELECT (UPCASE(Street2)) FROM ZW.MASTER_PATIENT AS B WHERE A.PATIENTID=B.PatientiD),
			PT_CITY=(SELECT (UPCASE(City)) FROM ZW.MASTER_PATIENT AS B WHERE A.PATIENTID=B.PatientiD),
			PT_STATE=(SELECT (UPCASE((SUBSTR(State,1,2)))) FROM ZW.MASTER_PATIENT AS B WHERE A.PATIENTID=B.PatientiD),
			PT_ZIP=(SELECT (SUBSTR(Zip,1,5)) FROM ZW.MASTER_PATIENT AS B WHERE A.PATIENTID=B.PatientiD),
			PT_COUNTYFIPS=(SELECT CountyFIPS FROM ZW.MASTER_PATIENT AS B WHERE A.PATIENTID=B.PatientiD)
		WHERE PATIENTID IN (SELECT PatientiD FROM ZW.MASTER_PATIENT)
		;
QUIT;
6 REPLIES 6
Reeza
Super User
Is your data on a server? Can you use Pass Through SQL? If so, a native SQL query in that language may be faster.
gdaymte
Obsidian | Level 7

The data is on an MS SQL Server, but the folks that own the data will not let us have access to it. I've asked for that, but was denied. Instead they are giving us copies of the tables in SAS.

Reeza
Super User
Ok, Process changes: I would suggest they give you read access to the server. You can read tables but not change them. If this is a regular process I would also suggest you ask if they can do it for you.

It looks like you're only joining to one table so you may want to consider a left join, but that would create a new table. Another option is a bunch of formats, but I'm not sure that would be any more efficient in the long run.

ballardw
Super User

If this were my project, working on a local copy of data and  there is only one patientid in the master record set, I would be very tempted to 1) rename variables in the transaction data set to match the master data, 2) do the transformations in the transaction data before merging and 3) use as data set update, which does require sorting the data by the same variables used in a by statement to align things..

 

I have a sneaking suspicion that having 19 select -where clauses (especially since they are essentially the same clause) can't be very efficient.

 

Have you done anything like look at system memory and/or cpu usage while this is running?

 

Haikuo
Onyx | Level 15

Echo with @ballardw, SQL runs faster when it is simple, complex code confuses its compiler thus weaken its optimizer. Break it down to 19+ separated procedures, you will be surprised how fast it goes.

Ksharp
Super User

For big table. SQL is not efficient .  Consider about using Hash Table ?

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 1788 views
  • 1 like
  • 5 in conversation