我正在从我的SEIM中提取一些JSON格式的日志(log)数据,并将其放入Pandas 数据框中.我能够轻松地将JSON转换为DataFrame中的多个列,但JSON中有一个"Message"字段,其中包含一个带引号的CSV,如下所示.
# dummy data
dfMyData = pd.DataFrame({"_raw": [\
"""{"timestamp":1691096387000,"message":"20230803 20:59:47,ip-123-123-123-123,mickey,321.321.321.321,111111,10673010,type,,'I am a, quoted, string, with commas,',0,,","logstream":"Blah1","loggroup":"group 1"}""",
"""{"timestamp":1691096386000,"message":"20230803 21:00:47,ip-456-456-456-456,mouse,654.654.654.654,222222,10673010,type,,'I am another quoted string',0,,","logstream":"Blah2","loggroup":"group 2"}"""
]})
# Column names for the _raw.message field that is generated.
MessageColumnNames = ["Timestamp","dest_host","username","src_ip","port","number","type","who_knows","message_string","another_number","who_knows2","who_knows3"]
# Convert column to json object/dict
dfMyData['_raw'] = dfMyData['_raw'].map(json.loads)
# convert JSON into columns within the dataframe
dfMyData = pd.json_normalize(dfMyData.to_dict(orient='records'))
我以前见过使用str.split()
对列进行拆分,然后将其连接回原始数据帧,但是str.split
方法不处理CSV内的引号值.pd.read_csv
可以正确处理引用的CSV,但我不知道如何将其应用于整个数据帧并将其输出扩展到新的数据帧列中.
此外,当我将dfMyData['_raw.message']
分成新列时,我还想提供数据的列名列表,并使用这些名称创建新列.
有谁知道一种简单的方法,可以将数据帧中引用的CSV字符串拆分到数据帧中的新命名列中?