假设网络数据帧的IpCidr
个数据块不重叠,则可以将IPv4地址转换为pl.Int64
并获得CIDR块内的最大值.
仅使用pl.Expr
将IPv4地址转换为pl.Int64
的函数
import polars as pl
def ip_addr4_int64_expr(ipv4_str_expr: pl.Expr):
return (
ipv4_str_expr.str.split(".")
.list.eval(
pl.element().cast(pl.Int64)
* (2 ** (8 * (pl.element().cumcount(reverse=True)))).cast(pl.Int64)
)
.list.sum()
)
通过获取可用主机的数量并将其添加到基本的IPV4‘S Int64表示法中,可以从CIDR的前缀派生出一系列地址.
cidr_split_ipv4_expr = pl.col("IpCidr").str.split("/").list.get(0)
cidr_prefix_expr = pl.col("IpCidr").str.split("/").list.get(1).cast(pl.Int64)
ip_cidr_df = ip_cidr_df.with_columns(
ip_addr4_int64_expr(cidr_split_ipv4_expr).alias("ip_addr4_int64"),
(
ip_addr4_int64_expr(cidr_split_ipv4_expr)
- 1
+ ((2 ** (32 - cidr_prefix_expr)).cast(pl.Int64))
).alias("cidr_ip_max"),
)
client_df = client_df.with_columns(
ip_addr4_int64_expr(pl.col("ClientIP")).alias("ip_addr4_int64"),
)
使用join_asof
可以进行范围查找.然后将返回的高于最大IP范围的值设置为空.
client_df = (
client_df.sort("ip_addr4_int64")
.join_asof(ip_cidr_df.sort("ip_addr4_int64"), on="ip_addr4_int64")
.select(
"ClientIP",
"Timestamp",
pl.when(pl.col("ip_addr4_int64") <= pl.col("cidr_ip_max"))
.then(pl.col("Info"))
.alias("Info"),
)
)
例如:
ip_cidr_df = pl.DataFrame(
{
"IpCidr": [
"99.96.0.0/13", "99.88.0.0/13", "1.0.136.0/22", "1.0.128.0/21",
"1.0.0.0/24", "10.0.0.0/8", "127.0.0.0/8", "172.16.0.0/12",
"192.168.0.0/16",
],
"Info": [
"ATT-INTERNET4", "ATT-INTERNET4", "TOT-NET TOT Public Company Limit",
"TOT-NET TOT Public Company Limit", "CLOUDFLARENET", "The 10.0.0.0/8 Range",
"The 127.0.0.0/8 Range", "The 172.16.0.0/12 Range", "The 192.168.0.0/16 Range",
],
}
)
client_df = pl.DataFrame(
{
"Timestamp": [
"2023-06-01 00:00:00", "2023-06-01 00:00:00", "2023-06-01 00:00:00",
"2023-06-01 00:00:00", "2023-06-30 23:59:00", "2023-06-30 23:59:00",
"2023-06-30 23:59:00",
],
"ClientIP": [
"1.0.0.14", "99.96.1.5", "99.87.29.96", "10.0.0.1", "127.0.0.1", "172.16.0.1", "192.168.0.1",
],
}
)
输出:
shape: (7, 3)
┌─────────────┬─────────────────────┬──────────────────────────┐
│ ClientIP ┆ Timestamp ┆ Info │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════════╪═════════════════════╪══════════════════════════╡
│ 1.0.0.14 ┆ 2023-06-01 00:00:00 ┆ CLOUDFLARENET │
│ 10.0.0.1 ┆ 2023-06-01 00:00:00 ┆ The 10.0.0.0/8 Range │
│ 99.87.29.96 ┆ 2023-06-01 00:00:00 ┆ null │
│ 99.96.1.5 ┆ 2023-06-01 00:00:00 ┆ ATT-INTERNET4 │
│ 127.0.0.1 ┆ 2023-06-30 23:59:00 ┆ The 127.0.0.0/8 Range │
│ 172.16.0.1 ┆ 2023-06-30 23:59:00 ┆ The 172.16.0.0/12 Range │
│ 192.168.0.1 ┆ 2023-06-30 23:59:00 ┆ The 192.168.0.0/16 Range │
└─────────────┴─────────────────────┴──────────────────────────┘
注:此答案假定数据帧仅由IPv4地址组成,并且ip_cidr_df
中没有重叠的CIDR块.
可以通过将IPv6地址转换为由pl.Int64
组成的pl.Struct
来应用相同的逻辑.