我想要读json
份档案.现在,我正在做以下逻辑,这并不是动态的.
df = spark.read.option("multiline", True).json(loc)
df = df.select("data.*", "event.*", "resource_id", "resource_kind", "resource_uri")
我将不得不多次写入column.*
次,因为文件非常嵌套,它有多个StructType
其方案如下所示:
root
|-- data: struct (nullable = true)
| |-- accounts: struct (nullable = true)
| | |-- accounting_reference_date: struct (nullable = true)
| | | |-- day: string (nullable = true)
| | | |-- month: string (nullable = true)
| | |-- last_accounts: struct (nullable = true)
| | | |-- made_up_to: string (nullable = true)
| | | |-- period_end_on: string (nullable = true)
| | | |-- period_start_on: string (nullable = true)
| | | |-- type: string (nullable = true)
| | |-- next_accounts: struct (nullable = true)
| | | |-- due_on: string (nullable = true)
| | | |-- overdue: boolean (nullable = true)
| | | |-- period_end_on: string (nullable = true)
| | | |-- period_start_on: string (nullable = true)
| | |-- next_due: string (nullable = true)
| | |-- next_made_up_to: string (nullable = true)
| | |-- overdue: boolean (nullable = true)
| |-- can_file: boolean (nullable = true)
| |-- company_name: string (nullable = true)
| |-- company_number: string (nullable = true)
| |-- company_status: string (nullable = true)
| |-- confirmation_statement: struct (nullable = true)
| | |-- last_made_up_to: string (nullable = true)
| | |-- next_due: string (nullable = true)
| | |-- next_made_up_to: string (nullable = true)
| | |-- overdue: boolean (nullable = true)
| |-- date_of_creation: string (nullable = true)
| |-- etag: string (nullable = true)
| |-- has_charges: boolean (nullable = true)
| |-- is_community_interest_company: boolean (nullable = true)
| |-- jurisdiction: string (nullable = true)
| |-- last_full_members_list_date: string (nullable = true)
| |-- links: struct (nullable = true)
| | |-- charges: string (nullable = true)
| | |-- filing_history: string (nullable = true)
| | |-- officers: string (nullable = true)
| | |-- persons_with_significant_control: string (nullable = true)
| | |-- persons_with_significant_control_statements: string (nullable = true)
| | |-- registers: string (nullable = true)
| | |-- self: string (nullable = true)
| |-- previous_company_names: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- ceased_on: string (nullable = true)
| | | |-- effective_from: string (nullable = true)
| | | |-- name: string (nullable = true)
| |-- registered_office_address: struct (nullable = true)
| | |-- address_line_1: string (nullable = true)
| | |-- address_line_2: string (nullable = true)
| | |-- country: string (nullable = true)
| | |-- locality: string (nullable = true)
| | |-- po_box: string (nullable = true)
| | |-- postal_code: string (nullable = true)
| | |-- region: string (nullable = true)
| |-- registered_office_is_in_dispute: boolean (nullable = true)
| |-- sic_codes: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- subtype: string (nullable = true)
| |-- type: string (nullable = true)
|-- event: struct (nullable = true)
| |-- published_at: string (nullable = true)
| |-- timepoint: long (nullable = true)
| |-- type: string (nullable = true)
|-- resource_id: string (nullable = true)
|-- resource_kind: string (nullable = true)
|-- resource_uri: string (nullable = true)
由于很少有字段具有相同的名称,因此我需要从根捕获字段名.
例如.字段period_start_on
在last_accounts
和next_accounts
中都存在.
因此,我需要将列名设置如下:
data.accounts.last_accounts.period_start_on
个
data.accounts.next_accounts.period_start_on
个
我认为我采取的方法不会花我更长的时间.你能建议一下阅读JSON的有效方法吗?另外,我们如何才能识别具有相同名称的2个字段.
谢谢