BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
France
Quartz | Level 8

dear all, 

 

could you please give me some suggestion to import the below text file into sas. the sample of it has been added in the attachment.

 

three of the information is below,

 

{"company_number":"09145694","data":{"address":{"address_line_1":"St. Andrews Road","country":"England","locality":"Henley-On-Thames","postal_code":"RG9 1HP","premises":"2"},"ceased_on":"2018-05-14","country_of_residence":"England","date_of_birth":{"month":2,"year":1977},"etag":"3b8caf795c03af63921e381f7bb8300a51ebb73d","kind":"individual-person-with-significant-control","links":{"self":"/company/09145694/persons-with-significant-control/individual/bIhuKnMFctSnjrDjUG8n3NgOrlU"},"name":"Mrs Nga Thanh Wildman","name_elements":{"forename":"Nga","middle_name":"Thanh","surname":"Wildman","title":"Mrs"},"nationality":"Vietnamese","natures_of_control":["ownership-of-shares-50-to-75-percent"],"notified_on":"2016-04-06"}}

 

{"company_number":"08581893","data":{"address":{"address_line_1":"High Street","address_line_2":"Wendover","country":"England","locality":"Aylesbury","postal_code":"HP22 6EA","premises":"14a","region":"Buckinghamshire"},"ceased_on":"2016-07-01","country_of_residence":"England","date_of_birth":{"month":9,"year":1947},"etag":"45f9c9e5494b574eb52abc3990a49bd96fe09df3","kind":"individual-person-with-significant-control","links":{"self":"/company/08581893/persons-with-significant-control/individual/RgR9Zhc7yGhV0SBys8_WJ6H9O1o"},"name":"Mr Stephen Robert Charles Davies","name_elements":{"forename":"Stephen","middle_name":"Robert Charles","surname":"Davies","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-25-to-50-percent","ownership-of-shares-25-to-50-percent-as-firm"],"notified_on":"2016-06-30"}}

 

{"company_number":"08581893","data":{"address":{"address_line_1":"High Street","address_line_2":"Wendover","country":"England","locality":"Aylesbury","postal_code":"HP22 6EA","premises":"14a","region":"Buckinghamshire"},"ceased_on":"2016-07-01","country_of_residence":"England","date_of_birth":{"month":6,"year":1965},"etag":"d55168c49f85ab1ef38a12ed76238d68f79f5a01","kind":"individual-person-with-significant-control","links":{"self":"/company/08581893/persons-with-significant-control/individual/-6HQmkhiomEBXJI2rgHccU67fpM"},"name":"Mr Quentin Colin Maxwell Solt","name_elements":{"forename":"Quentin","middle_name":"Colin Maxwell","surname":"Solt","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-25-to-50-percent","voting-rights-25-to-50-percent"],"notified_on":"2016-06-30"}}

 

thanks in advance

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

Agreed. Seems like it's designed to be used via an API and someone is just appending results in every day for each company. Interesting method. 

 

@France this should get you started. It'll get you each line into a separate observation and you can work on separating it out using SCAN(). 

 

Note the OBS option at the moment which is limiting the number of records. You'll want to remove that, but for testing, it should get you started. 

 

filename psc '/folders/myfolders/psc-snapshot-2019-01-24_1of13.json' termstr=LF;

data psc1;
length myVar $30000. companyName $100.;
infile psc lrecl=30000 obs=1000;
input;

myVar = _infile_;

companyName = dequote(scan(myVar, 2, ":,"));


run;

View solution in original post

12 REPLIES 12
Reeza
Super User
Is it identified as XML or JSON originally?
France
Quartz | Level 8
Dear Reeza,

thanks for your message,

sorry, I do not understand the meaning of XML and JSON. besides, the dataset is download by following the website (http://download.companieshouse.gov.uk/en_output.html)
Reeza
Super User

When I download the zip file, that's been separated I get a clean well formatted CSV file, though it maybe in a different encoding. 

 

data demo;
infile "..\BasicCompanyData-2019-01-01-part1_6\BasicCompanyData-2019-01-01-part1_6.csv" lrecl=32000 obs=100;
file 'c:\_localdata\demo.csv';
input ;
put _infile_;
run;

This creates another text file with the first 100 rows for testing, because the file is too big to open manually. 

The first 100 records are attached. 

Is this from a specific file in the link?

 

 

 


@France wrote:
Dear Reeza,

thanks for your message,

sorry, I do not understand the meaning of XML and JSON. besides, the dataset is download by following the website (http://download.companieshouse.gov.uk/en_output.html)

 

France
Quartz | Level 8
Dear Reeza,

apology for my mistake. the dataset is downloaded from the below website.
http://download.companieshouse.gov.uk/en_pscdata.html
Reeza
Super User
France
Quartz | Level 8

dear Reeza,

 

Thanks for your kind advice.

 

But I do not understand the meaning of it. Could you please explain it for me by using the following example.

 

{

"company_number":"09145694",

"data":

{

"address":

{

"address_line_1":"St. Andrews Road",

"country":"England"

,"locality":"Henley-On-Thames",

"postal_code":"RG9 1HP",

"premises":"2"

},

"ceased_on":"2018-05-14",

"country_of_residence":"England",

"date_of_birth":

{

"month":2,

"year":1977

},

"etag":"3b8caf795c03af63921e381f7bb8300a51ebb73d",

"kind":"individual-person-with-significant-control",

"links":

{

"self":"/company/09145694/persons-with-significant-control/individual/bIhuKnMFctSnjrDjUG8n3NgOrlU"

},

"name":"Mrs Nga Thanh Wildman",

"name_elements":

{

"forename":"Nga",

"middle_name":"Thanh",

"surname":"Wildman",

"title":"Mrs"

},

"nationality":"Vietnamese",

"natures_of_control":

["ownership-of-shares-50-to-75-percent"],

"notified_on":"2016-04-06"

}

}

 

this is a group of record in my dataset. could you please give me some suggestion about this?

 

ChrisNZ
Tourmaline | Level 20

@Reeza pointed you to the answer.

Time for you to:

1. Read about JSON files.

2. Read the link provided.

One line in that page is 

libname space JSON fileref=resp;

Start from there.

 

 

Reeza
Super User

JSON is a pain and the files there are too big for the JSON libname in my example. Because of that you have to manually parse the file. It's not easy, I agree, and it is a significant amount of work to test and get it right. When I did this for a different open data set, I think it ended up taking two days to get it fully working, testing and automated. Unfortunately I don't have that amount of time to spend on this. I'm not saying it's going to take that long  - I was trying the JSON libname here, in R and python - and was trying to use a more 'modern' method. In the end, I gave up and used SCAN() , FIND() and INDEX() to get the data I needed. 

 

You do have a learning curve here for sure. Sorry I can't help further. Good Luck. 

 


@France wrote:

dear Reeza,

 

Thanks for your kind advice.

 

But I do not understand the meaning of it. Could you please explain it for me by using the following example.

 

{

"company_number":"09145694",

"data":

{

"address":

{

"address_line_1":"St. Andrews Road",

"country":"England"

,"locality":"Henley-On-Thames",

"postal_code":"RG9 1HP",

"premises":"2"

},

"ceased_on":"2018-05-14",

"country_of_residence":"England",

"date_of_birth":

{

"month":2,

"year":1977

},

"etag":"3b8caf795c03af63921e381f7bb8300a51ebb73d",

"kind":"individual-person-with-significant-control",

"links":

{

"self":"/company/09145694/persons-with-significant-control/individual/bIhuKnMFctSnjrDjUG8n3NgOrlU"

},

"name":"Mrs Nga Thanh Wildman",

"name_elements":

{

"forename":"Nga",

"middle_name":"Thanh",

"surname":"Wildman",

"title":"Mrs"

},

"nationality":"Vietnamese",

"natures_of_control":

["ownership-of-shares-50-to-75-percent"],

"notified_on":"2016-04-06"

}

}

 

this is a group of record in my dataset. could you please give me some suggestion about this?

 


Background on JSON: https://www.w3schools.com/js/js_json_intro.asp

 

ChrisNZ
Tourmaline | Level 20

@Reeza 

Interesting. I did't know the libname engine had size limitations.

Do you know if PROC DS2's JSON package has similar constraints?

 

Reeza
Super User

After further checking that doesn't seem to be the issue. It seems to be a non standard JSON file. R only read the first two rows for some reason and that's the same place SAS is complaining about. I'm guessing a different EOL or encoding, since it's European data?

 


@ChrisNZ wrote:

@Reeza 

Interesting. I did't know the libname engine had size limitations.

Do you know if PROC DS2's JSON package has similar constraints?

 


 

ChrisNZ
Tourmaline | Level 20

After checking, each record is an independent JSON structure that looks like the below, with a LF to delimit the record.

So manual parsing seems like the way to go.

{"company_number":"03100413"
,"data":{"etag":"e0c7cf72c430d379d849c89225a4d6bad8face98"
,"kind":"persons-with-significant-control-statement"
,"links":{"self":"/company/03100413/persons-with-significant-control-statements/E9L3Ktzi07T3zgnWCSSuZ5-OjJE"}
,"notified_on":"2016-08-16"
,"statement":"no-individual-or-entity-with-signficant-control"}
} {"company_number":"SC287552"
,"data":{"ceased_on":"2017-10-13"
,"etag":"5aba35df221cc2bfc1e5fa3f571390522098d755"
,"kind":"persons-with-significant-control-statement"
,"links":{"self":"/company/SC287552/persons-with-significant-control-statements/VxNqbL-lSWIGfhgeaHSy2qYV1hw"}
,"notified_on":"2016-07-15"
,"statement":"psc-details-not-confirmed"}
}

 

Reeza
Super User

Agreed. Seems like it's designed to be used via an API and someone is just appending results in every day for each company. Interesting method. 

 

@France this should get you started. It'll get you each line into a separate observation and you can work on separating it out using SCAN(). 

 

Note the OBS option at the moment which is limiting the number of records. You'll want to remove that, but for testing, it should get you started. 

 

filename psc '/folders/myfolders/psc-snapshot-2019-01-24_1of13.json' termstr=LF;

data psc1;
length myVar $30000. companyName $100.;
infile psc lrecl=30000 obs=1000;
input;

myVar = _infile_;

companyName = dequote(scan(myVar, 2, ":,"));


run;

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 12 replies
  • 5294 views
  • 3 likes
  • 3 in conversation