Hello,
I am updating huge table using update statement in proc sql.
Its taking very long time.
I need to just blank/null few char/num fields in the tables based of the dataset that provides the fields to be blanked.
Is there an efficient way to deal with this?
Thanks,
Mushy
This summarizes the results of the original code and responses received as of this date when executed against this table:
Data Set Name | LIBA.HAVE | Observations | 500000 |
---|---|---|---|
Member Type | DATA | Variables | 50 |
Engine | V9 | Indexes | 1 |
Created | 04/29/2024 09:02:45 | Observation Length | 400 |
Last Modified | 04/29/2024 09:02:45 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | |||
Data Representation | WINDOWS_64 | ||
Encoding | utf-8 Unicode (UTF-8) |
Original Code:
proc sql;
update liba.have set country1='', country3='', pop1=., pop2=., pop1_date=.;
quit;
Results:
NOTE: PROCEDURE SQL used (Total process time): real time 0.61 seconds cpu time 0.23 seconds
So, .61 seconds was the time to beat!
@Tom's solution
data liba.have(index=(ID));
set liba.have;
call missing(of country1 country3 pop1 pop2 pop1_date);
run;
and @Ksharp's solution
proc sql;
alter table have
drop country1, country3,pop1,pop2,pop1_date
add country1 char(20), country3 char(3),pop1 num,pop2 num,pop1_date num;
quit;
tied for first place
real time 0.40 seconds cpu time 0.20 seconds
With my solution:
data liba.have ;
modify liba.have;
call missing(country1, country3,pop1,pop2,pop1_date);
run;
coming in third,
real time 0.44 seconds cpu time 0.32 seconds
and data_null__'s solution:
data liba.have (index=(ID));
if 0 then set liba.have;
set liba.have (drop=country1 country3 pop1 pop2 pop1_date);
run;
still beating the original post:
real time 0.58 seconds cpu time 0.26 seconds
Of course, your mileage may vary depending on the size of your data.
To enable others to help more effectively, please share your SQL code, the log, and a sample of the table data (or at least a good description of the columns you are working with and the number of rows involved). Is your data in base SAS or in some external database?
I am working with SAS datasets:
In general for my case do you know a better and efficient way to update columns other than proc sql update statement?
The query I am using is:
Proc sql;
update liba.geography set country1='', country3='', pop1=., pop2=., pop1_date=.;
quit;
it is a table with 10m + records and 40+ fields.
real time 11:49.60
user cpu time 11:25.46
Why do you need to update an existing dataset?
In general with SAS datasets I never use SQL UPDATE. Instead I just create a program to MAKE the new dataset I need.
Your example code:
Proc sql;
update liba.geography set country1='', country3='', pop1=., pop2=., pop1_date=.;
quit;
I would normally run as :
data liba.geography;
set liba.geography;
call missing(of country1 country3 pop1 pop2 pop1_date);
run;
@Tom Indeed its faster , but we lose index after the set.
@Mushy wrote:
@Tom Indeed its faster , but we lose index after the set.
And why does the index matter?
If it is important then you can simple create the index also, easiest is to just include its definition in the DATA statement.
data new(index=(.....));
set old;
...
run;
Or if speed is an issue it is frequently much faster to make the dataset first and then create the index later. You can use PROC DATASETS or PROC SQL to create an index on an existing dataset.
But I am still curious what application is that requires you to update an existing dataset.
You could use a DATA step with a MODIFY statement:
data liba.geography ;
modify liba.geography;
call missing(country1, country3,pop1,pop2,pop1_date);
run;
That should run faster for you. If you are nulling values for variables involved in the index, your index isn't going to be worth much after the update anyway. In that case, @Tom's advice is best all-around: re-create the data set, do any additional data manipulation, and then rebuild the index in a separate step.
You could skip reassignment by dropping
data class(index=(name/unique));
if 0 then set sashelp.class;
set sashelp.class(drop=age weight);
run;
@Mushy wrote:
Hello,
I am updating huge table using update statement in proc sql.
Its taking very long time.
I need to just blank/null few char/num fields in the tables based of the dataset that provides the fields to be blanked.
Is there an efficient way to deal with this?
Thanks,
Mushy
You could drop these variables firstly, after that adding these variables in this table .
Here is an example:
data have;
set sashelp.class;
run;
proc sql;
alter table have
drop name,age
add name char(20),age num;
quit;
This summarizes the results of the original code and responses received as of this date when executed against this table:
Data Set Name | LIBA.HAVE | Observations | 500000 |
---|---|---|---|
Member Type | DATA | Variables | 50 |
Engine | V9 | Indexes | 1 |
Created | 04/29/2024 09:02:45 | Observation Length | 400 |
Last Modified | 04/29/2024 09:02:45 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | |||
Data Representation | WINDOWS_64 | ||
Encoding | utf-8 Unicode (UTF-8) |
Original Code:
proc sql;
update liba.have set country1='', country3='', pop1=., pop2=., pop1_date=.;
quit;
Results:
NOTE: PROCEDURE SQL used (Total process time): real time 0.61 seconds cpu time 0.23 seconds
So, .61 seconds was the time to beat!
@Tom's solution
data liba.have(index=(ID));
set liba.have;
call missing(of country1 country3 pop1 pop2 pop1_date);
run;
and @Ksharp's solution
proc sql;
alter table have
drop country1, country3,pop1,pop2,pop1_date
add country1 char(20), country3 char(3),pop1 num,pop2 num,pop1_date num;
quit;
tied for first place
real time 0.40 seconds cpu time 0.20 seconds
With my solution:
data liba.have ;
modify liba.have;
call missing(country1, country3,pop1,pop2,pop1_date);
run;
coming in third,
real time 0.44 seconds cpu time 0.32 seconds
and data_null__'s solution:
data liba.have (index=(ID));
if 0 then set liba.have;
set liba.have (drop=country1 country3 pop1 pop2 pop1_date);
run;
still beating the original post:
real time 0.58 seconds cpu time 0.26 seconds
Of course, your mileage may vary depending on the size of your data.
@SASJedi Thanks for the comparision and helping out with the efficient way to update. Indeed modify satatement is the better option.
Thanks
Mushy
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.