BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.

I noticed that the proc sql update has a big problem, and I understand why.

If I want to use a dataset to amend another, the proc sql goes in loops in the case where the dataset to be changed is very big. The same thing does not happen in the case of small size of the dataset.

The example that the port is as follows:

/ * Create the dataset to be modified (‘zipcode’ - 2088050 row), made it big through a loop: * /

%macro append_ds();

data Zipcode ;

set sashelp.Zipcode(obs=0);

run;

%do i=1 %to 50;

%put iterazione numero=&i;

data zipcode;

set zipcode sashelp.Zipcode;

run;

%end;

data zipcode;

set zipcode;

format zip z5.;

zip=_n_;

run;

%mend append_ds;

%append_ds;

/ *** Dataset with changes to be made (10 row): *** /

data zipcode_mod;

set zipcode (firstobs = 1000 obs=1009);

keep zip   alias_city;

alias_city = "XXXXXXXX";

run;

/ * Proc sql that to me goes in a loop: /

proc sql ;

update  zipcode as A

set alias_city = (select alias_city from zipcode_mod as n

                                 where A.zip=n.zip)

where A.zip in (select zip from zipcode_Mod) ;

quit;

1 ACCEPTED SOLUTION

Accepted Solutions
hbi
Quartz | Level 8 hbi
Quartz | Level 8

I did a few tests and Tests 1 and 2 did not help or ended up hurting performance.

 

Test 3 performed very well since it used a macro variable in place of the "IN" statement.

 

Test 4 (though it is not PROC SQL) performs very well.

 

 

/* TEST 1: add distinct - comparable performance to your original query */
proc sql;
  update zipcode as A
  set alias_city = (select alias_city 
                    from zipcode_mod as n
                    where A.zip=n.zip)
  where A.zip in (select DISTINCT zip from zipcode_Mod);
quit;


/* TEST 2: change to "exists" statement - performs poorly */
proc sql;
  update zipcode as A
  set alias_city = (select alias_city 
                    from zipcode_mod as n
                    where A.zip=n.zip)
  where exists (select 1
                from zipcode_Mod as nn
                where A.zip = nn.zip);
quit;


/* TEST 3: performs well - change "IN" statement to a macro variable */
proc sql;  
  SELECT zip INTO :list_of_zip SEPARATED BY ", "
  from zipcode_mod; 

  update zipcode as A
  set alias_city = (select alias_city 
                    from zipcode_mod as n
                    where A.zip=n.zip)
  where A.zip in (&list_of_zip);
quit;
%put &list_of_zip;


/* TEST 4: performs well - uses a SAS Hash Object */
DATA zipcode;

  SET zipcode;

  IF _N_ = 1 THEN DO;
  DECLARE HASH T(dataset: 'WORK.zipcode_mod');
    T.definekey('zip');
    T.definedata('alias_city');
    T.definedone();
  END;

  IF T.find() >= 0;

RUN;

 

 

 

View solution in original post

12 REPLIES 12
Doc_Duke
Rhodochrosite | Level 12

Mario,

 

I think that this is behaving as designed.  When you have a small dataset, SAS does the same looping; it is just doing it in memory.  When you have a large data set, it has to page the data in and out of memory.  That, obviously, takes much longer to get done.

 

You may want to look at making your update code more efficient.  I suspect that adding an index might help.

 

Doc Muhlbaier

Duke

mariopellegrini
Quartz | Level 8

Thanks Dok_Duke.
In fact I brought another solution that allows you to make the best result using the indexes, namely:

 

	    Proc Datasets library = work nolist;
	    modify zipcode;
	    index create zip / nomiss;
	    quit;
     	data zipcode;
		set  zipcode_mod (rename = (alias_city =alias_city_B )); 
		do until (_iorc_=%sysrc(_dsenom)); /*ciclo*/
		modify zipcode key=zip;  
			select(_iorc_); 
			      when(%sysrc(_sok)) do;
				       alias_city = alias_city_B;
					   replace;
			      end;
				  when(%sysrc(_dsenom)) do;
 	                  _error_ = 0;
	              end;
				  otherwise do;
		        	 stop;
		      	 end;
			end;
		end;
		run;
		/*cancellazione indice*/
		Proc Datasets library = work nolist;
		     modify zipcode;
		     index delete zip;
		quit;

 

Only I wanted to better understand the potential of proc sql, and I do not understand why, despite updating data is restricted to those who serve (in 'where condition' use: where A.zip in (select zip from zipcode_Mod)) times processing are very high, even infinite ...

RW9
Diamond | Level 26 RW9
Diamond | Level 26

TBH I think you would be better of with a join in this instance:

proc sql;

  create table WANT as

  select  A.*,

             coalesce(B.ALIAS_CITY,A.ALIAS_CITY) as NEW_ALIAS_CITY

  from    WORK.ZIP_CODE A

  left join WORK.ZIP_CODE_MOD B

  on        A.ZIP=B.ZIP;

quit;

 

Whilst the above puts the output in a new variable, I have only done this so I don't have to type each variable in A out 🐵

The coalesce does the main work here, if B is missing it uses A, else it uses B.

 

mariopellegrini
Quartz | Level 8

Thanks RW9.
For you carry the same considerations given to Doc_duke

Tom
Super User Tom
Super User

Why make updates at all?  SAS is not really a transactional system.  Why not just thinks in terms of creating a new dataset based on the input datasets.

 

If you really do want to try to treat SAS as a transactional system and update data in place then look at using MODIFY statement instead of SQL.   

hbi
Quartz | Level 8 hbi
Quartz | Level 8

I did a few tests and Tests 1 and 2 did not help or ended up hurting performance.

 

Test 3 performed very well since it used a macro variable in place of the "IN" statement.

 

Test 4 (though it is not PROC SQL) performs very well.

 

 

/* TEST 1: add distinct - comparable performance to your original query */
proc sql;
  update zipcode as A
  set alias_city = (select alias_city 
                    from zipcode_mod as n
                    where A.zip=n.zip)
  where A.zip in (select DISTINCT zip from zipcode_Mod);
quit;


/* TEST 2: change to "exists" statement - performs poorly */
proc sql;
  update zipcode as A
  set alias_city = (select alias_city 
                    from zipcode_mod as n
                    where A.zip=n.zip)
  where exists (select 1
                from zipcode_Mod as nn
                where A.zip = nn.zip);
quit;


/* TEST 3: performs well - change "IN" statement to a macro variable */
proc sql;  
  SELECT zip INTO :list_of_zip SEPARATED BY ", "
  from zipcode_mod; 

  update zipcode as A
  set alias_city = (select alias_city 
                    from zipcode_mod as n
                    where A.zip=n.zip)
  where A.zip in (&list_of_zip);
quit;
%put &list_of_zip;


/* TEST 4: performs well - uses a SAS Hash Object */
DATA zipcode;

  SET zipcode;

  IF _N_ = 1 THEN DO;
  DECLARE HASH T(dataset: 'WORK.zipcode_mod');
    T.definekey('zip');
    T.definedata('alias_city');
    T.definedone();
  END;

  IF T.find() >= 0;

RUN;

 

 

 

mariopellegrini
Quartz | Level 8

Thanks hbi.
Your test 3 explained me the problem. I think it's a limitation of the proc sql where condition: if given a list of data ready, processing times are cut down drastically. Otherwise if prompted query the proc sql does not respond well. In your opinion why?
Then useful to know that the use of hash object (test 4) is always optimal.

hbi
Quartz | Level 8 hbi
Quartz | Level 8

I believe that the processing time is high due to zipcode_mod being referenced twice in the same update statement.

 

In your example, it is being used in both the "set" clause (to return a value) and again in the "where" clause (to limit the number of observations that are updated). I do not know the exact impact (in terms of processing time), but it may differ from let's say, referencing the same dataset twice in a set clause; or referencing the same dataset twice in a where clause.

 

Having said that, ensuring that zipcode_mod is used only once in the update statement, may be worth a shot; create a copy of and see if it reduces processing time. I think test 3 would still likely be a clear winner over test 5. 

 

/* TEST 5: create a copy of zipcode_mod and call it zipcode_mod_copy */
proc sql; create table zipcode_mod_copy AS select * from zipcode_mod; update zipcode as A set alias_city = (select alias_city from zipcode_mod as n where A.zip=n.zip) where A.zip in (select zip from zipcode_mod_copy); quit;

 

I did not mention this in my previous post, but beware that the macro variable created in test 3 should not exceed 1,000 data values, since that is supposed to be the upper limit of SQL "in" clauses.

 

The version of SAS that I am using allows me to exceed 1,000 data values, but SQL standards generally prohibit list items that exceed the 1,000 cap. 

 

proc sql;
  SELECT zip        INTO :list_of_zip SEPARATED BY ", " FROM zipcode_mod;
  SELECT count(zip) INTO :count_zip FROM zipcode_mod;
quit;

data _null_;
  if &count_zip > 1000 then abort abend;
run;

 

mariopellegrini
Quartz | Level 8

Perfect, thanks hbi. Then:
or a list of macro variables (in the case of max thousand observations), or the copy of the data set can help you avoid the double request for data extraction from zipcode_Mod.
It remains a strange thing, I think this problem is limited to cases of very few changes to the data in relation to a very large datasets

Tom
Super User Tom
Super User

I would say the problem is attempting to use PROC SQL for something it is not good at.  If you really want to do inplace modification of datasets then use data step code as you will have control over what happens.

 

I find that inplace modification of tables is very rarely the most logical and straight forward way to describe or code a problem. Just combine the old table with the new information and make a new table that reflects the changes. Only in extreme cases would I result to using UPDATE statement or other inplace table manipulations.

 

 

hbi
Quartz | Level 8 hbi
Quartz | Level 8

I think one of the themes of the back-and-forth discussion (among community members) is that SAS is not a substitute for an OLTP (online transaction processing) system. If you have a desire to have PROC SQL behave like an OLTP system, then it might be time to migrate your datasets to a RDBMS like Oracle or Teradata.

Though SAS datasets superficially look like RDBMS tables, under the hoods, SAS' treatment of them are very different. (Note that PROC SQL is a much later incarnation of what had been almost the exclusive domain of data step programming for eons). PROC SQL, probably for good reason, does not provide transactional database commands such as "begin trans", "commit trans", "rollback", etc., nor does it log transactions in a separate (DB transaction log) file. Table-locking and record-locking are handled very differently too.

Additionally, parameters such as "fill factor", "percent free", etc. that are commonplace in the RDBMS world, do not have an analogue with SAS datasets.

A PROC SQL update must operate within these confines and still stay true to the fact that it is manipulating a bona fide dataset, rather than a table. The lack of empty space between records, means that everything that follows the updated observation [let's say a given field goes from varchar(1) to varchar(20)], likely requires (though don't quote me on this) that everything else be shifted downwards to ensure that each observation's sort position remains unchanged.

mariopellegrini
Quartz | Level 8

I noticed that the problem remains whether the data to change increase.
For example, starting from the following dataset of change:

 

data zipcode_mod;
set zipcode (firstobs = 1 obs=100000);
keep zip alias_city;
alias_city = "XXXXXXXXXX";
run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 4382 views
  • 0 likes
  • 5 in conversation