dear all,
could you please give me some suggestion to import the below text file into sas. the sample of it has been added in the attachment.
three of the information is below,
{"company_number":"09145694","data":{"address":{"address_line_1":"St. Andrews Road","country":"England","locality":"Henley-On-Thames","postal_code":"RG9 1HP","premises":"2"},"ceased_on":"2018-05-14","country_of_residence":"England","date_of_birth":{"month":2,"year":1977},"etag":"3b8caf795c03af63921e381f7bb8300a51ebb73d","kind":"individual-person-with-significant-control","links":{"self":"/company/09145694/persons-with-significant-control/individual/bIhuKnMFctSnjrDjUG8n3NgOrlU"},"name":"Mrs Nga Thanh Wildman","name_elements":{"forename":"Nga","middle_name":"Thanh","surname":"Wildman","title":"Mrs"},"nationality":"Vietnamese","natures_of_control":["ownership-of-shares-50-to-75-percent"],"notified_on":"2016-04-06"}}
{"company_number":"08581893","data":{"address":{"address_line_1":"High Street","address_line_2":"Wendover","country":"England","locality":"Aylesbury","postal_code":"HP22 6EA","premises":"14a","region":"Buckinghamshire"},"ceased_on":"2016-07-01","country_of_residence":"England","date_of_birth":{"month":9,"year":1947},"etag":"45f9c9e5494b574eb52abc3990a49bd96fe09df3","kind":"individual-person-with-significant-control","links":{"self":"/company/08581893/persons-with-significant-control/individual/RgR9Zhc7yGhV0SBys8_WJ6H9O1o"},"name":"Mr Stephen Robert Charles Davies","name_elements":{"forename":"Stephen","middle_name":"Robert Charles","surname":"Davies","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-25-to-50-percent","ownership-of-shares-25-to-50-percent-as-firm"],"notified_on":"2016-06-30"}}
{"company_number":"08581893","data":{"address":{"address_line_1":"High Street","address_line_2":"Wendover","country":"England","locality":"Aylesbury","postal_code":"HP22 6EA","premises":"14a","region":"Buckinghamshire"},"ceased_on":"2016-07-01","country_of_residence":"England","date_of_birth":{"month":6,"year":1965},"etag":"d55168c49f85ab1ef38a12ed76238d68f79f5a01","kind":"individual-person-with-significant-control","links":{"self":"/company/08581893/persons-with-significant-control/individual/-6HQmkhiomEBXJI2rgHccU67fpM"},"name":"Mr Quentin Colin Maxwell Solt","name_elements":{"forename":"Quentin","middle_name":"Colin Maxwell","surname":"Solt","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-25-to-50-percent","voting-rights-25-to-50-percent"],"notified_on":"2016-06-30"}}
thanks in advance
Agreed. Seems like it's designed to be used via an API and someone is just appending results in every day for each company. Interesting method.
@France this should get you started. It'll get you each line into a separate observation and you can work on separating it out using SCAN().
Note the OBS option at the moment which is limiting the number of records. You'll want to remove that, but for testing, it should get you started.
filename psc '/folders/myfolders/psc-snapshot-2019-01-24_1of13.json' termstr=LF; data psc1; length myVar $30000. companyName $100.; infile psc lrecl=30000 obs=1000; input; myVar = _infile_; companyName = dequote(scan(myVar, 2, ":,")); run;
When I download the zip file, that's been separated I get a clean well formatted CSV file, though it maybe in a different encoding.
data demo;
infile "..\BasicCompanyData-2019-01-01-part1_6\BasicCompanyData-2019-01-01-part1_6.csv" lrecl=32000 obs=100;
file 'c:\_localdata\demo.csv';
input ;
put _infile_;
run;
This creates another text file with the first 100 rows for testing, because the file is too big to open manually.
The first 100 records are attached.
Is this from a specific file in the link?
@France wrote:
Dear Reeza,
thanks for your message,
sorry, I do not understand the meaning of XML and JSON. besides, the dataset is download by following the website (http://download.companieshouse.gov.uk/en_output.html)
dear Reeza,
Thanks for your kind advice.
But I do not understand the meaning of it. Could you please explain it for me by using the following example.
{
"company_number":"09145694",
"data":
{
"address":
{
"address_line_1":"St. Andrews Road",
"country":"England"
,"locality":"Henley-On-Thames",
"postal_code":"RG9 1HP",
"premises":"2"
},
"ceased_on":"2018-05-14",
"country_of_residence":"England",
"date_of_birth":
{
"month":2,
"year":1977
},
"etag":"3b8caf795c03af63921e381f7bb8300a51ebb73d",
"kind":"individual-person-with-significant-control",
"links":
{
"self":"/company/09145694/persons-with-significant-control/individual/bIhuKnMFctSnjrDjUG8n3NgOrlU"
},
"name":"Mrs Nga Thanh Wildman",
"name_elements":
{
"forename":"Nga",
"middle_name":"Thanh",
"surname":"Wildman",
"title":"Mrs"
},
"nationality":"Vietnamese",
"natures_of_control":
["ownership-of-shares-50-to-75-percent"],
"notified_on":"2016-04-06"
}
}
this is a group of record in my dataset. could you please give me some suggestion about this?
@Reeza pointed you to the answer.
Time for you to:
1. Read about JSON files.
2. Read the link provided.
One line in that page is
libname space JSON fileref=resp;
Start from there.
JSON is a pain and the files there are too big for the JSON libname in my example. Because of that you have to manually parse the file. It's not easy, I agree, and it is a significant amount of work to test and get it right. When I did this for a different open data set, I think it ended up taking two days to get it fully working, testing and automated. Unfortunately I don't have that amount of time to spend on this. I'm not saying it's going to take that long - I was trying the JSON libname here, in R and python - and was trying to use a more 'modern' method. In the end, I gave up and used SCAN() , FIND() and INDEX() to get the data I needed.
You do have a learning curve here for sure. Sorry I can't help further. Good Luck.
@France wrote:
dear Reeza,
Thanks for your kind advice.
But I do not understand the meaning of it. Could you please explain it for me by using the following example.
{
"company_number":"09145694",
"data":
{
"address":
{
"address_line_1":"St. Andrews Road",
"country":"England"
,"locality":"Henley-On-Thames",
"postal_code":"RG9 1HP",
"premises":"2"
},
"ceased_on":"2018-05-14",
"country_of_residence":"England",
"date_of_birth":
{
"month":2,
"year":1977
},
"etag":"3b8caf795c03af63921e381f7bb8300a51ebb73d",
"kind":"individual-person-with-significant-control",
"links":
{
"self":"/company/09145694/persons-with-significant-control/individual/bIhuKnMFctSnjrDjUG8n3NgOrlU"
},
"name":"Mrs Nga Thanh Wildman",
"name_elements":
{
"forename":"Nga",
"middle_name":"Thanh",
"surname":"Wildman",
"title":"Mrs"
},
"nationality":"Vietnamese",
"natures_of_control":
["ownership-of-shares-50-to-75-percent"],
"notified_on":"2016-04-06"
}
}
this is a group of record in my dataset. could you please give me some suggestion about this?
Background on JSON: https://www.w3schools.com/js/js_json_intro.asp
Interesting. I did't know the libname engine had size limitations.
Do you know if PROC DS2's JSON package has similar constraints?
After further checking that doesn't seem to be the issue. It seems to be a non standard JSON file. R only read the first two rows for some reason and that's the same place SAS is complaining about. I'm guessing a different EOL or encoding, since it's European data?
@ChrisNZ wrote:
Interesting. I did't know the libname engine had size limitations.
Do you know if PROC DS2's JSON package has similar constraints?
After checking, each record is an independent JSON structure that looks like the below, with a LF to delimit the record.
So manual parsing seems like the way to go.
{"company_number":"03100413"
,"data":{"etag":"e0c7cf72c430d379d849c89225a4d6bad8face98"
,"kind":"persons-with-significant-control-statement"
,"links":{"self":"/company/03100413/persons-with-significant-control-statements/E9L3Ktzi07T3zgnWCSSuZ5-OjJE"}
,"notified_on":"2016-08-16"
,"statement":"no-individual-or-entity-with-signficant-control"}
} {"company_number":"SC287552"
,"data":{"ceased_on":"2017-10-13"
,"etag":"5aba35df221cc2bfc1e5fa3f571390522098d755"
,"kind":"persons-with-significant-control-statement"
,"links":{"self":"/company/SC287552/persons-with-significant-control-statements/VxNqbL-lSWIGfhgeaHSy2qYV1hw"}
,"notified_on":"2016-07-15"
,"statement":"psc-details-not-confirmed"}
}
Agreed. Seems like it's designed to be used via an API and someone is just appending results in every day for each company. Interesting method.
@France this should get you started. It'll get you each line into a separate observation and you can work on separating it out using SCAN().
Note the OBS option at the moment which is limiting the number of records. You'll want to remove that, but for testing, it should get you started.
filename psc '/folders/myfolders/psc-snapshot-2019-01-24_1of13.json' termstr=LF; data psc1; length myVar $30000. companyName $100.; infile psc lrecl=30000 obs=1000; input; myVar = _infile_; companyName = dequote(scan(myVar, 2, ":,")); run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.