I am trying to read from Hive database that happens to have a single column consisting 60k characters. Since SAS can only support up to 32k words limit, as per forum advice, I split the column into 3 different columns using pass thru SQL statement. Then, in the next Data step, i concat these 3 columns and process. However, my output table is always blank.
What the script does it to read GPS coordinates from one column in Hive and my SAS Script will split the coordinates into row based with 1 set of LNG LAT in 1 row tying to 1 specific location.
What could be wrong here? Why is it always blank?
My Code:
proc sql;
connect to hadoop as mydb
(server=<removed>
database=<removed>
);
create table a as
select * from connection to mydb(
select cgis_area
,substr(geometry,1,30000) as geometry1
,substr(geometry,30001,30000) as geometry2
,substr(geometry,60001,30000) as geometry3
from GIS where cgis_area = '6224 - Bagan Serai' );
disconnect from mydb;
quit;
proc sort data=ncpdm.ncp_dim_ba
(keep=region zone state_code state business_area_code business_area gis_ba)
out=business_area nodupkeys;
by gis_ba business_area_code;
run;
data business_area;
set business_area;
by gis_ba business_area_code;
if first.business_area_code;
run;
proc sort data=ncpdm.ncp_dim_ba
(keep=region zone state_code state business_area_code business_area gis_ba)
out=business_area nodupkeys;
by gis_ba business_area_code;
run;
data business_area;
set business_area;
by gis_ba business_area_code;
if first.business_area_code;
run;
data ncp_dim_ba_geo (drop=excp_code cgis_area)
work.excp_ncp_dim_ba_geo;
length excp_code $50;
length gis_ba dotpos $50 zone $2 state_code $5
region state business_area_code $30 business_area $50 xval yval 8;
if _n_=1 then do;
declare hash h(dataset:"business_area");
h.definekey('gis_ba');
h.definedata('region','zone','state_code','state','business_area_code','business_area');
h.definedone();
call missing(region,zone,state_code,state,business_area_code,business_area);
end;
set a;
array geo {3} geometry1-geometry3;
** CREATE AND POPULATE VARCHAR FIELD **;
length longgeo varchar(160000);
do i = 1 to 3;
/* longgeo = cat(longgeo,geo(i));*/
longgeo = cat(geo(i));
end;
/* longgeo=cat(geo1,geo2,geo3);*/
/* set hpsncp.GIS (keep=cgis_area geometry cgis_label*/
/* dbsastype=(cgis_area='CHAR(50)' geometry='CHAR(32767)' cgis_label='CHAR(50)'));*/
gis_ba=upcase(cgis_area);
rc2=h.find();
if (rc2 ne 0) then do;
excp_code='Exception: Unable to map business area';
output work.excp_ncp_dim_ba_geo;
end;
if (rc2=0);
longgeo=substr(longgeo,15);
seq=1;
do until(dotpos eq '');
dotpos=compress(scan(longgeo, seq, '[]'),',','nk');
if dotpos not in ('' '{' '}' '[' ']' ',') then do;
xval=input(put(scan(dotpos,1,','),20.),20.)/1000;
yval=input(put(scan(dotpos,2,','),20.),20.)/1000;
if xval ne . and yval ne . then do;
output ncp_dim_ba_geo;
end;
end;
seq+1;
end;
keep excp_code cgis_area region zone state_code state business_area_code business_area seq xval yval;
run;
Am i doing something wrong? Mind u that before changing to longchar type in datastep, it was working fine. The reason why i change to Longchar(160000) is due to Character limit count. If i revert back to the original code without longchar and concatenation, it will produce data perfectly.
Hi @imdickson
First thing:
You split the geometry string in 3 variables to get around the SAS 32767 byte limit. And in makes no sense to concatenate the 3 variables into one in the data step, because the resulting variable will be longer than 32767 bytes, which is impossible. You need to process the 3 strings one by one and handle qeometry broken by variable boundaries.
You will see in the SAS log, that the statement length longgeo varchar(160000); gives the following note in the log, and the longgeo variable is set to 33676 bytes in length:
NOTE: One or more variables were converted because the data type is not supported by the V9 engine.
For more details, run with options MSGLEVEL=I.
Second:
Your use the cat function with to few arguments, so you always exit the loop with the value of geo(3) in longgeo, and because most geometries are probably shorter than 120k, it will always be blank.
----
Geographical data is not my main work area, but as far as I remember the data base contains functions to do the unpackaging, so you should be able to get your final result from the pass-thru.
I was told by SAS Pre-Sales for splitting and concatenate back for overcome.
Thanks for pointing out that this won't work. I will revert back to SAS Pre-sales & technical support for this.
You overwrite longgeo in your do loop instead of concatenating to it.
Change this statement
longgeo = cat(geo(i));
to this
longgeo = cat(longgeo,geo(i));
Hi @Kurt_Bremser I tried
longgeo = cat(longgeo,geo(i));
However, output is still showing 0 observations. 😞
Could it be longchar is not "working" in this case?
@imdickson wrote:
Hi @Kurt_Bremser I tried
longgeo = cat(longgeo,geo(i));
However, output is still showing 0 observations. 😞
Could it be longchar is not "working" in this case?
Wait, you get zero observations?
Please post the log from your code.
Hi @imdickson
We are not giving you much help so far.
You are trying to read from Hive database that happens to have a single column consisting of 60k characters. But you select 3 x 30k from it, = 90k. And then you try the impossible: concatenating these 3 x 30k into a 160k string, which cannot be done i SAS, because the longest string is 30k. What it the length of the input column? - Have you tried to make a select of max(length(geography)) or whatever the correct syntax in HIVE is?
If the max. used length is actually bigger than 32767, you need more than one variable, and a longitude/latitude pair is broken in each variable boundary. But because each variable is close to the SAS limit, you cannot do the concatenation from the start, but must wait until there is room for it.
Start with longgeo length=$32676; longgeo = geography1;
When you break longgeo up in longitude/latitude pairs, you need to output a pair and then remove the pair from the longgeo variable, so longgeo gets shorter until you come to the last pair, which will be incomplete, if geography2 is not blank.
But then there will be room to concatenate geograpgy2 because of the difference 30000 - 32676. [ longgeo = catt(longgeo,geograpgy2) ]. Any you can proceed taking out pairs until the last pair, and then concatenate geography3 (if needed) and so on.
This means that your algoritm for splitting the string must be changed, because you have to know the position of your delimiter, so you can pick a pair as substring up to the delimiter and continue in next iteration with the substring from the delimiter.
Why aren't you using the solution here https://communities.sas.com/t5/SAS-Programming/SAS-loading-from-Hive-is-truncated-to-32-767-characte... ?
Hi @ChrisNZ . I tried using the script and i have 2 issues here.
1st, i tried
lateral view outer explode(split(geometry,']\\,\\[')) t1 as exploded_sellers
I want to split every complete Lng Lat. However, somehow i dont feel it is right from my code.
2nd, i am using pass-thru in SAS to run this script. I wont see any new column created by using this script.
BTW, i do not have admin right in Hive database. Must i run this script in Hive before loading into SAS?
Mmm. Did you try much?
I don't know much at all about Teradata, but a bit of poking around and I got this to work:
proc sql;
connect using TERA;
select * from connection to TERA(
select *
from table ( strtok_split_to_table('a','"{""rings"":[[[794533591,649756392],[794460411,643714587],[797817321,640072798],[798129358,63669524', ']')
returns ( OUTKEY varchar(1) character set unicode
, TOKENNUM integer
, TOKEN varchar(50) character set unicode)
) as tab
);
OUTKEY | TOKENNUM | TOKEN |
---|---|---|
a | 1 | "{""rings"":[[[794533591,649756392 |
a | 2 | ,[794460411,643714587 |
a | 3 | ,[797817321,640072798 |
a | 4 | ,[798129358,63669524 |
You can easily extract from these pairs in SAS. Like so:
proc sql;
connect using NWP;
select scan(TOKEN ,2,'[') as COORDS
, scan(calculated COORDS,1,',') as COORD1
, scan(calculated COORDS,2,',') as COORD2
, * from connection to NWP(
select *
from table ( strtok_split_to_table('a','"{""rings"":[[[794533591,649756392],[794460411,643714587],[797817321,640072798],[798129358,63669524', ']')
returns ( OUTKEY varchar(1) character set unicode
, TOKENNUM integer
, TOKEN varchar(50) character set unicode)
) as tab
);
COORDS | COORD1 | COORD2 | OUTKEY | TOKENNUM | TOKEN |
---|---|---|---|---|---|
794533591,649756392 | 794533591 | 649756392 | a | 1 | "{""rings"":[[[794533591,649756392 |
794460411,643714587 | 794460411 | 643714587 | a | 2 | ,[794460411,643714587 |
797817321,640072798 | 797817321 | 640072798 | a | 3 | ,[797817321,640072798 |
798129358,63669524 | 798129358 | 63669524 | a | 4 | ,[798129358,63669524 |
Just replace the long hard-coded string with TABLE.COLUMN and it should all work.
Thanks for the code. I tried again on my side with my own table and settings as below BUT i dont get any output table at all. Must be something not right in my code.
proc sql; connect to hadoop as mydb (server=REMOVED database=REMOVED ); create table a as select * from connection to mydb( select cgis_area ,substr(geometry,1,30000) as geometry1 ,substr(geometry,30001,30000) as geometry2 ,substr(geometry,60001,30000) as geometry3 from GIS( select * from a ( strtok_split_to_table('a','"{""rings"":[[[794533591,649756392],[794460411,643714587],[797817321,640072798],[798129358,63669524', ']') returns ( OUTKEY varchar(1) character set unicode , TOKENNUM integer , TOKEN varchar(50) character set unicode) ) as tab ); ); disconnect from mydb; quit;
BTW, what is strtok?
<span class="token statement">select</span> <span class="token operator">*</span>
<span class="token keyword">from</span> <span class="token statement">table</span> <span class="token punctuation">(</span> strtok_split_to_table<span class="token punctuation">(</span><span class="token string">'a'</span><span class="token punctuation">,</span><span class="token string">'"{""rings"":[[[794533591,649756392],[794460411,643714587],[797817321,640072798],[798129358,63669524'</span><span class="token punctuation">,</span> <span class="token string">']'</span><span class="token punctuation">)</span>
returns <span class="token punctuation">(</span> OUTKEY varchar<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span> character <span class="token keyword">set</span> unicode
<span class="token punctuation">,</span> TOKENNUM integer
<span class="token punctuation">,</span> TOKEN varchar<span class="token punctuation">(</span><span class="token number">50</span><span class="token punctuation">)</span> character <span class="token keyword">set</span> unicode<span class="token punctuation">)</span>
<span class="token punctuation">)</span> as tab
<span class="token punctuation">)</span><span class="token punctuation">;</span>
I dont quite understand on this part. I want to tweak your script to fit into my script and not hardcoding the value of coordinate like your strtok. I assume strtok is just an alias. At the same time now, I am still trying to figure out how to tweak ur code to make mine work. If you see this message, i would hope that you can further guide me.
1.what is strtok
I searched the web to find your solution. I am sure you can too.
2.I dont get any output table at all.
It looks like you did not read my reply, and that you did not try to understand what it does.
In particular the words:
Just replace the long hard-coded string with TABLE.COLUMN and it should all work.
Just as a thought: What if you'd transpose the table within Hive (using explicit Hive SQL) so that you get only a single longitude/latitude per row. This would make the problem of a >32KB column "go away". It also would push processing to Hadoop and execute in parallel.
I didn't have a chance to use it yet but it appears Hive got functions which allow for some sort of array processing and for such a transpose - something along the line of this discussion: https://stackoverflow.com/questions/49885323/how-to-split-delimited-string-to-multiple-rows-in-hive-...
....just realized: I'm proposing the same as @ChrisNZ - so that must be good 🙂
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.