BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Junyong
Pyrite | Level 9

Yahoo Finance HTMLs contain JSON data in the middle as follows (I somewhat truncated the HTML to show how it looks):

 

<!DOCTYPE html><html id="atomic" class="NoJs desktop" lang="en-US"><head prefix="og: http://ogp.me/ns#"><script>window.performance && window
</script><script>!function(e,s,f,p){var a=[],t={_version:"3.11.4",_config:{classPrefix:"",enableClasses:!0,enableJSClass:!0,usePrefixes:!0},
if (!window.YAHOO || !window.YAHOO.i13n || !window.YAHOO.i13n.Rapid) { return; }
var rapidConfig = {"async_all_clicks":true,"click_timeout":300,"client_only":1,"compr_type":"deflate","keys":{"ver":"ydotcom","navtype":"ser
window.rapidInstance = new window.YAHOO.i13n.Rapid(rapidConfig);
})();</script></head><body><div id="app"><div class="" data-reactroot="" data-reactid="1" data-react-checksum="1106760844"><div data-reactid
window._adPerfData = []; 
window._adPosMsg = []; 
window._perfMark = function _perfMark (name) {if (window.performance && window.performance.mark){try {if (window.performance.getEntriesByNam
window._perfMeasure = function _perfMeasure (name, start, end) {if (window.performance && window.performance.measure){try {if (window.perfor
window._pushAdPerfMetric = function _pushAdPerfMetric(key) {if (window.performance && window.performance.now) {_adPerfData.push([key, Math.r
window._fireAdPerfBeacon = function _fireAdPerfBeacon(eventName) {try {if (window && window.rapidInstance && window.performance) {var navCli
window.rapidInstance.beaconPerformanceData(perfData);}} catch (e) {console.warn('Could not send the beacon:',e);}};
window.DARLA_CONFIG = {"debug":false,"dm":1,"autoRotation":10000,"rotationTimingDisabled":true,"k2":{"res":{"rate":5,"pos":["BTN","BTN-1","B
window.DARLA_CONFIG.servicePath = window.location.protocol + "//fc.yahoo.com/sdarla/php/fc.php";window.DARLA_CONFIG.dm = 1;window.DARLA_CONF
window.DARLA_CONFIG.onFinishRequest = function() {window._perfMark('DARLA_REQEND');};
window.DARLA_CONFIG.onStartParse = function() {window._perfMark('DARLA_PSTART');};
window.DARLA_CONFIG.onSuccess = function(eventName) {if (eventName === 'AUTO') {return;}if (window._DarlaEvents) {window._DarlaEvents.emit("
window.DARLA_CONFIG.onStartPosRender = function(posItem) {var posId = posItem && posItem.pos;window._perfMark('DARLA_ADSTART_' + posId);if (
window.DARLA_CONFIG.onFinishPosRender = function(posId, reqList, posItem) {var ltime;window._perfMark('DARLA_ADEND_' + posId);window._perfMe
window.DARLA_CONFIG.onBeforePosMsg = function(msg, posId) {var maxWidth = 970, maxHeight = 600;var newWidth, newHeight, pos;if (window._Darl
window.DARLA_CONFIG.onFinishParse = function(eventName, response) {try {if (eventName !== "AUTO") {var positionlist = response.ps();var foun
};
window.DARLA_CONFIG.onStartPrefetchRequest = function(eventName) {window._perfMark('DARLA_PFSTART');};
window.DARLA_CONFIG.onFinishPrefetchRequest = function(eventName, status) {window._perfMark('DARLA_PFEND');try {window._DarlaEvents.emit('fi
};
window.DARLA_CONFIG.onPosMsg = function(cmd, pos, msg) {try {if (window._DarlaEvents && cmd === "cmsg") {var posmsg = {pos: pos,msg: msg};wi
};
(function () {var _onloadEvt = function _onloadEvtHandler() {window._loadEvt = true;if (window._darlaSuccessEvt) {window._fireAdPerfBeacon(w
window.$sf = window.sf = {};
window.$sf.host = {onReady: function (autorender, deferrender, firstRenderPos, deferRenderDelay) {window._perfMark('DARLA_ONREADY');window._
};
window.sf_host = window.$sf.host;
document.onreadystatechange = function () {if (document.readyState == "interactive") {window._perfMark('DOM_INTERACTIVE');}};</script>
<script type="text/x-safeframe" id="fc" _ver="4-9-0">{"positions":[{"id":"FB2A","html":"<!-- SpaceID=0 robot -->\n","lowHTML":"","meta":{"y"

<!-- bf1-sdarlaws-046.adx.bf1.yahoo.com Sat Sep 18 16:26:34 UTC 2021 -->
<script type="text/javascript">if (typeof DARLA !== "undefined" && DARLA) {DARLA.config(window.DARLA_CONFIG);window.sf_host.onReady(true,tru
<style>
.hide-lrec-ad .darla-lrec-ad {display: none;}.hide-non-lrec34 .darla-nonlrec34-ad {display: none;}.hide-non-lrec34 .sticky-outer-wrapper .td
</div><script>
(function (root) {
/* -- Data -- */
root.App || (root.App = {});
root.App.now = 1631982394842;
root.App.main = {"context":{"dispatcher":{"stores":{"PageStore":{"currentPageName":"quote","currentEvent":{"eventName":"NEW_PAGE_SUCCESS"},"
}(this));
</script><script>
    (function(win) {
      win.vzm = win.vzm || {};

 

The sixth line from the bottom above contains a JSON value root.App.main. I wanted to isolate these JSON data, create a new JSON file WANT.json as follows (a trouble occurred at this point), and then read the created JSON using libname JSON.

filename HAVE temp;

proc http url="https://finance.yahoo.com/quote/KO/financials?p=KO" out=HAVE;
run;

data _null_;
	infile HAVE;
	input;
	file "/home/junyong0/WANT.json";
	if _infile_=:"root.App.main" then put _infile_;
run;

The problem is the 32,767 LRECL limit applied to _INFILE_. The resultant WANT.json contains only the first 32,767 letters truncating all the remainders. Is there any way to handle this long line to get an appropriate JSON file? Or I wonder if there is a simpler way to read HTML-embedded JSON.

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

That is why I said to use RECFM=N in the first step so that the end of line character(s) are not added.

file "!USERPROFILE\Desktop\WANT.json" lrecl=3276700 recfm=n;

If you can find the size of the file, perhaps using operating system command like ls or dir, then you could just do something like this to read in the file byte by byte and only on the last three bytes eliminate any semi-colons, CR or LF characters.

%let filesize=123456;
data _null_;
  infile step1 lrecl=1 recfm=f ;
  file want recfm=n;
  do i=1 to &filesize;
    input ch $char1. ;
    if I<&filesize-3 then put ch $char1.;
    else if ch not in (';','0D'x,'0A'x) then put ch $char1.;
  end;
run;

View solution in original post

8 REPLIES 8
sbxkoenk
SAS Super FREQ

Hello,

 

SAS variables are limited to length of 32767, but LRECL of an INFILE is not.  So if your real file has record length longer than max-length of a character variable then you cannot read it into a single variable.  Should be easy to read into multiple variables instead.

 

See here:

Losing characters during reading data from file.
Posted 08-21-2012 03:29 AM
https://communities.sas.com/t5/SAS-Procedures/Losing-characters-during-reading-data-from-file/m-p/99...

 

Koen

Tom
Super User Tom
Super User

SAS can use much larger LRECL than 32,767.  You just cannot treat _INFILE_ as if it was a normal variable if the LRECL is longer than the limit for a variable.  But you only need to read the beginning of the line to implement your logic.

data _null_;
  infile HAVE lrecl=2000000 truncover;
  file "/home/junyong0/WANT.json" lrecl=2000000;
  input beginning $char20.;
  if beginning=:"root.App.main" then put _infile_;
run;

You could even use some more smarts to find multiple line values and not include that root.App.main stuff .

Let's make a sample file.

filename have temp;
data _null_;
  file have lrecl=1000000 ;
  put '</div><script>'
    / '(function (root) {'
    / '/* -- Data -- */'
    / 'root.App || (root.App = {});'
    / 'root.App.now = 1631982394842;'
    / 'root.App.main = {"context":{"dispatcher":{"stores":{"PageStore":{"currentPageName":"quote",'
    / '"currentEvent":{"eventName":"NEW_PAGE_SUCCESS"},"'
    / '}(this));'
    / '</script><script>'
    / '    (function(win) {'
    / '     win.vzm = win.vzm || {};'
  ;
run;

Now we can try to pull out those lines of JSON.  We can use the START= variable to start the output with the opening { .  We can retain a flag to let us know which lines to write.  We can stop when we see the </script> tag.  We could even try to force them into a single line be adding trailing @ to the PUT statement.

filename copy temp;
data _null_;
  infile have lrecl=1000000 truncover start=s;
  file copy lrecl=1000000;
  retain found 0;
  s=1;
  input beginning $char40.;
  if beginning=:'root.App.main' then do;
     s=index(beginning,'{');
     put _infile_ @;
     found=1;
     s=1;
  end;
  else if found then do;
     if beginning=:'</script>' then stop;
     else put _infile_ @;
  end;
run;

data _null_;
  infile copy ;
  input;
  put _infile_;
run;

Results:

{"context":{"dispatcher":{"stores":{"PageStore":{"currentPageName":"quote","currentEvent":{"eventName":"NEW_PAGE_SUCCESS"},"}(this));
NOTE: 1 record was read from the infile COPY.
      The minimum record length was 133.
      The maximum record length was 133.
Junyong
Pyrite | Level 9

Thanks for these details. Here's my modified code,

data _null_;
	infile "https://finance.yahoo.com/quote/KO/financials?p=KO" url lrecl=3276700 truncover start=S;
	input HEAD $80.;
	file "!USERPROFILE\Desktop\WANT.json" lrecl=3276700;
	if HEAD=:"root.App.main" then do;
		S=index(HEAD,"{");
		put _infile_;
	end;
run;

and the outcome ending with

 

:{"strings":1},"yahoodotcom-layout":{"strings":1}}},"options":{"defaultBundle":"td-app-finance"}}}};

 

I wanted to also remove the very last semicolon in the outcome JSON. The problem then is that an END and LAST approach, unlike the START and HEAD approach above, cannot be used when deleting the right-end semicolon (for example, END=INDEX(_INFILE_,";")-1 is impossible). How should I do in this case?

Tom
Super User Tom
Super User

Process the file again.

Write the first version using RECFM=N so that end of line characters are not inserted.

Then read it back byte by byte and skip writing back out the last byte.

Example:

filename step1 temp;
data _null_;
  file step1 recfm=n;
  put '{...:"td-app-finance"}}}};';
run;

filename want temp;
data _null_;
  infile step1 lrecl=1 recfm=f end=eof;
  file want recfm=n;
  input ch $char1. ;
  if not eof then put ch $char1.;
run;

Results:

660   data _null_;
661    infile want;
662    input;
663    put _infile_;
664   run;

NOTE: The infile WANT is:
      Filename=...\#LN00152,
      RECFM=V,LRECL=32767,File Size (bytes)=25,
      Last Modified=19Sep2021:20:10:19,
      Create Time=19Sep2021:20:10:19

{...:"td-app-finance"}}}}
NOTE: 1 record was read from the infile WANT.
      The minimum record length was 25.
      The maximum record length was 25.
Junyong
Pyrite | Level 9

Sorry for this repeated bothering, but what if I need to erase the last two or three characters? I found that my previous code attaches either CRLF or LF right after the last semicolon.

data _null_;
	infile "https://finance.yahoo.com/quote/KO/financials?p=KO" url lrecl=3276700 truncover start=S;
	input HEAD $80.;
	file "!USERPROFILE\Desktop\WANT.json" lrecl=3276700;
	if HEAD=:"root.App.main" then do;
		S=index(HEAD,"{");
		put _infile_;
	end;
run;

The last three characters (LF, CR, and then the semicolon) can be deleted by repeating your code three times, but I wonder if there is a solution smarter than this.

Tom
Super User Tom
Super User

That is why I said to use RECFM=N in the first step so that the end of line character(s) are not added.

file "!USERPROFILE\Desktop\WANT.json" lrecl=3276700 recfm=n;

If you can find the size of the file, perhaps using operating system command like ls or dir, then you could just do something like this to read in the file byte by byte and only on the last three bytes eliminate any semi-colons, CR or LF characters.

%let filesize=123456;
data _null_;
  infile step1 lrecl=1 recfm=f ;
  file want recfm=n;
  do i=1 to &filesize;
    input ch $char1. ;
    if I<&filesize-3 then put ch $char1.;
    else if ch not in (';','0D'x,'0A'x) then put ch $char1.;
  end;
run;
Patrick
Opal | Level 21

You could investigate if sample code Example: Reading JSON Text from HTTP and Creating a SAS Data Set is any good for your use case.

Ksharp
Super User

You can make it as a stream file which don't have line break character by option " recfm=n ". and import it .

data want ;
infile HAVE recfm=n dlm='; ={}';
input temp : $200. ;
retain found 0 ;
if temp=:"root.App.main" then found=1;
if found then do;
input json : $200. @@;
output;
end;
run;

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1130 views
  • 3 likes
  • 5 in conversation