Re: Issue with Longchar(160000) in data step

imdickson · Posted 05-24-2019 03:39 AM

I am trying to read from Hive database that happens to have a single column consisting 60k characters. Since SAS can only support up to 32k words limit, as per forum advice, I split the column into 3 different columns using pass thru SQL statement. Then, in the next Data step, i concat these 3 columns and process. However, my output table is always blank.

What the script does it to read GPS coordinates from one column in Hive and my SAS Script will split the coordinates into row based with 1 set of LNG LAT in 1 row tying to 1 specific location.

What could be wrong here? Why is it always blank?

My Code:

proc sql;
	  connect to hadoop as mydb
	     (server=<removed>
	      database=<removed>
		);
			create table a as 
			select * from connection to mydb(
			select  cgis_area
			,substr(geometry,1,30000) as geometry1
			,substr(geometry,30001,30000) as geometry2
			,substr(geometry,60001,30000) as geometry3
			from GIS where cgis_area = '6224 - Bagan Serai' );
	  disconnect from mydb;
	quit;

proc sort data=ncpdm.ncp_dim_ba 
               (keep=region zone state_code state business_area_code business_area gis_ba)
           out=business_area nodupkeys;
by gis_ba business_area_code;
run;

data business_area;
  set business_area;
  by gis_ba business_area_code;
  if first.business_area_code;
run;
proc sort data=ncpdm.ncp_dim_ba 
               (keep=region zone state_code state business_area_code business_area gis_ba)
           out=business_area nodupkeys;
by gis_ba business_area_code;
run;

data business_area;
  set business_area;
  by gis_ba business_area_code;
  if first.business_area_code;
run;

data ncp_dim_ba_geo (drop=excp_code cgis_area)
     work.excp_ncp_dim_ba_geo;
  length excp_code $50;
  length gis_ba dotpos $50 zone $2 state_code $5
         region state business_area_code $30 business_area $50 xval yval 8;
  if _n_=1 then do;
     declare hash h(dataset:"business_area");
     h.definekey('gis_ba');
     h.definedata('region','zone','state_code','state','business_area_code','business_area');
     h.definedone();
     call missing(region,zone,state_code,state,business_area_code,business_area);
  end;
  set a;
    array geo {3} geometry1-geometry3;
** CREATE AND POPULATE VARCHAR FIELD **;
    length longgeo varchar(160000);
    do i = 1 to 3;
/*    longgeo = cat(longgeo,geo(i));*/
	longgeo = cat(geo(i));
    end;
/*	longgeo=cat(geo1,geo2,geo3);*/
/*  set hpsncp.GIS (keep=cgis_area geometry cgis_label*/
/*      dbsastype=(cgis_area='CHAR(50)' geometry='CHAR(32767)' cgis_label='CHAR(50)'));*/
  gis_ba=upcase(cgis_area);
  rc2=h.find();
  if (rc2 ne 0) then do;
     excp_code='Exception: Unable to map business area';
     output work.excp_ncp_dim_ba_geo;
  end;
  if (rc2=0);
  longgeo=substr(longgeo,15);
  seq=1;
  do until(dotpos eq '');
     dotpos=compress(scan(longgeo, seq, '[]'),',','nk');
     if dotpos not in ('' '{' '}' '[' ']' ',') then do;
        xval=input(put(scan(dotpos,1,','),20.),20.)/1000;
        yval=input(put(scan(dotpos,2,','),20.),20.)/1000;
        if xval ne . and yval ne . then do;
           output ncp_dim_ba_geo;
        end;
     end;
     seq+1;
  end;
  keep excp_code cgis_area region zone state_code state business_area_code business_area seq xval yval;
run;

Am i doing something wrong? Mind u that before changing to longchar type in datastep, it was working fine. The reason why i change to Longchar(160000) is due to Character limit count. If i revert back to the original code without longchar and concatenation, it will produce data perfectly.

ErikLund_Jensen · Posted 05-24-2019 05:11 AM

Hi @imdickson

First thing:

You split the geometry string in 3 variables to get around the SAS 32767 byte limit. And in makes no sense to concatenate the 3 variables into one in the data step, because the resulting variable will be longer than 32767 bytes, which is impossible. You need to process the 3 strings one by one and handle qeometry broken by variable boundaries.

You will see in the SAS log, that the statement length longgeo varchar(160000); gives the following note in the log, and the longgeo variable is set to 33676 bytes in length:

NOTE: One or more variables were converted because the data type is not supported by the V9 engine.
For more details, run with options MSGLEVEL=I.

Second:

Your use the cat function with to few arguments, so you always exit the loop with the value of geo(3) in longgeo, and because most geometries are probably shorter than 120k, it will always be blank.

----

Geographical data is not my main work area, but as far as I remember the data base contains functions to do the unpackaging, so you should be able to get your final result from the pass-thru.

imdickson · Posted 05-26-2019 10:23 PM

I was told by SAS Pre-Sales for splitting and concatenate back for overcome.

Thanks for pointing out that this won't work. I will revert back to SAS Pre-sales & technical support for this.

Kurt_Bremser · Posted 05-24-2019 05:15 AM

You overwrite longgeo in your do loop instead of concatenating to it.

Change this statement

longgeo = cat(geo(i));

to this

longgeo = cat(longgeo,geo(i));

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

imdickson · Posted 05-26-2019 10:31 PM

Hi @Kurt_Bremser I tried

longgeo = cat(longgeo,geo(i));

However, output is still showing 0 observations. 😞

Could it be longchar is not "working" in this case?

Kurt_Bremser · Posted 05-27-2019 12:28 AM

@imdickson wrote:

Hi @Kurt_Bremser I tried
longgeo = cat(longgeo,geo(i));
However, output is still showing 0 observations. 😞

Could it be longchar is not "working" in this case?

Wait, you get zero observations?

Please post the log from your code.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ErikLund_Jensen · Posted 05-24-2019 06:22 AM

Hi @imdickson

We are not giving you much help so far.

You are trying to read from Hive database that happens to have a single column consisting of 60k characters. But you select 3 x 30k from it, = 90k. And then you try the impossible: concatenating these 3 x 30k into a 160k string, which cannot be done i SAS, because the longest string is 30k. What it the length of the input column? - Have you tried to make a select of max(length(geography)) or whatever the correct syntax in HIVE is?

If the max. used length is actually bigger than 32767, you need more than one variable, and a longitude/latitude pair is broken in each variable boundary. But because each variable is close to the SAS limit, you cannot do the concatenation from the start, but must wait until there is room for it.

Start with longgeo length=$32676; longgeo = geography1;

When you break longgeo up in longitude/latitude pairs, you need to output a pair and then remove the pair from the longgeo variable, so longgeo gets shorter until you come to the last pair, which will be incomplete, if geography2 is not blank.

But then there will be room to concatenate geograpgy2 because of the difference 30000 - 32676. [ longgeo = catt(longgeo,geograpgy2) ]. Any you can proceed taking out pairs until the last pair, and then concatenate geography3 (if needed) and so on.

This means that your algoritm for splitting the string must be changed, because you have to know the position of your delimiter, so you can pick a pair as substring up to the delimiter and continue in next iteration with the substring from the delimiter.

ChrisNZ · Posted 05-24-2019 06:47 AM

Why aren't you using the solution here https://communities.sas.com/t5/SAS-Programming/SAS-loading-from-Hive-is-truncated-to-32-767-characte... ?

High-Performance SAS Coding - Third Edition

imdickson · Posted 05-26-2019 10:20 PM

Hi @ChrisNZ . I tried using the script and i have 2 issues here.

1st, i tried

lateral view outer explode(split(geometry,']\\,\\[')) t1 as exploded_sellers

I want to split every complete Lng Lat. However, somehow i dont feel it is right from my code.

2nd, i am using pass-thru in SAS to run this script. I wont see any new column created by using this script.

BTW, i do not have admin right in Hive database. Must i run this script in Hive before loading into SAS?

ChrisNZ · Posted 05-27-2019 12:04 AM

Mmm. Did you try much?

I don't know much at all about Teradata, but a bit of poking around and I got this to work:

proc sql;
  connect using TERA; 
  select * from connection to TERA(
    select * 
    from table ( strtok_split_to_table('a','"{""rings"":[[[794533591,649756392],[794460411,643714587],[797817321,640072798],[798129358,63669524', ']')
                 returns ( OUTKEY   varchar(1) character set unicode
                         , TOKENNUM integer
                         , TOKEN    varchar(50) character set unicode)
               ) as tab
  );

OUTKEY	TOKENNUM	TOKEN
a	1	"{""rings"":[[[794533591,649756392
a	2	,[794460411,643714587
a	3	,[797817321,640072798
a	4	,[798129358,63669524

You can easily extract from these pairs in SAS. Like so:

proc sql;
  connect using NWP; 
  select scan(TOKEN ,2,'[') as COORDS
       , scan(calculated COORDS,1,',') as COORD1
       , scan(calculated COORDS,2,',') as COORD2
       , * from connection to NWP(
    select  *  
    from table ( strtok_split_to_table('a','"{""rings"":[[[794533591,649756392],[794460411,643714587],[797817321,640072798],[798129358,63669524', ']')
                 returns ( OUTKEY   varchar(1) character set unicode
                         , TOKENNUM integer
                         , TOKEN    varchar(50) character set unicode)
               ) as tab
  );

COORDS	COORD1	COORD2	OUTKEY	TOKENNUM	TOKEN
794533591,649756392	794533591	649756392	a	1	"{""rings"":[[[794533591,649756392
794460411,643714587	794460411	643714587	a	2	,[794460411,643714587
797817321,640072798	797817321	640072798	a	3	,[797817321,640072798
798129358,63669524	798129358	63669524	a	4	,[798129358,63669524

Just replace the long hard-coded string with TABLE.COLUMN and it should all work.

High-Performance SAS Coding - Third Edition

imdickson · Posted 06-02-2019 10:36 PM

@ChrisNZ

Thanks for the code. I tried again on my side with my own table and settings as below BUT i dont get any output table at all. Must be something not right in my code.

proc sql;
	  connect to hadoop as mydb
	     (server=REMOVED
	      database=REMOVED
		);
			create table a as 
			select * from connection to mydb(
			select cgis_area
			,substr(geometry,1,30000) as geometry1
			,substr(geometry,30001,30000) as geometry2
			,substr(geometry,60001,30000) as geometry3
			from GIS(
  			select  *  
   			 from a ( strtok_split_to_table('a','"{""rings"":[[[794533591,649756392],[794460411,643714587],[797817321,640072798],[798129358,63669524', ']')
                 returns ( OUTKEY   varchar(1) character set unicode
                         , TOKENNUM integer
                         , TOKEN    varchar(50) character set unicode)
               ) as tab
    );


 );
	  disconnect from mydb;
	quit;

BTW, what is strtok?

  <span class="token statement">select</span>  <span class="token operator">*</span>  
    <span class="token keyword">from</span> <span class="token statement">table</span> <span class="token punctuation">(</span> strtok_split_to_table<span class="token punctuation">(</span><span class="token string">'a'</span><span class="token punctuation">,</span><span class="token string">'"{""rings"":[[[794533591,649756392],[794460411,643714587],[797817321,640072798],[798129358,63669524'</span><span class="token punctuation">,</span> <span class="token string">']'</span><span class="token punctuation">)</span>
                 returns <span class="token punctuation">(</span> OUTKEY   varchar<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span> character <span class="token keyword">set</span> unicode
                         <span class="token punctuation">,</span> TOKENNUM integer
                         <span class="token punctuation">,</span> TOKEN    varchar<span class="token punctuation">(</span><span class="token number">50</span><span class="token punctuation">)</span> character <span class="token keyword">set</span> unicode<span class="token punctuation">)</span>
               <span class="token punctuation">)</span> as tab
  <span class="token punctuation">)</span><span class="token punctuation">;</span>

I dont quite understand on this part. I want to tweak your script to fit into my script and not hardcoding the value of coordinate like your strtok. I assume strtok is just an alias. At the same time now, I am still trying to figure out how to tweak ur code to make mine work. If you see this message, i would hope that you can further guide me.

ChrisNZ · Posted 06-03-2019 03:15 AM

1.what is strtok

I searched the web to find your solution. I am sure you can too.

2.I dont get any output table at all.

It looks like you did not read my reply, and that you did not try to understand what it does.

In particular the words:

Just replace the long hard-coded string with TABLE.COLUMN and it should all work.

High-Performance SAS Coding - Third Edition

Patrick · Posted 05-24-2019 08:52 AM

@imdickson

Just as a thought: What if you'd transpose the table within Hive (using explicit Hive SQL) so that you get only a single longitude/latitude per row. This would make the problem of a >32KB column "go away". It also would push processing to Hadoop and execute in parallel.

I didn't have a chance to use it yet but it appears Hive got functions which allow for some sort of array processing and for such a transpose - something along the line of this discussion: https://stackoverflow.com/questions/49885323/how-to-split-delimited-string-to-multiple-rows-in-hive-...

....just realized: I'm proposing the same as @ChrisNZ - so that must be good 🙂

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away