我有一个包含以下列的DataFrame:INVOICE_DATE、COUNTRY、CUSTOMER_ID、INVOICE_ID、DESCRIPTION、USIM和DEMANDQTY.我想根据特定条件过滤DataFrame.
条件是,如果Description列包含单词"Kids"或"Baby",我希望在过滤后的DataFrame中包含该INVOICE_ID中的所有值.换句话说,要包括整个交易,交易中至少应该有一个项目属于 children 或婴儿类别.
我曾try 将str.containes()方法与正则表达式模式结合使用,但在获得所需结果时遇到了问题.
以下是我的代码:
import pandas as pd
# Assuming the DataFrame is named 'df'
# Filter the DataFrame based on the condition
filtered_df = df[df['DESCRIPTION'].str.contains('kids|baby', case=False, regex=True)]
# Print the filtered DataFrame
filtered_df
但是,此代码没有提供预期的结果.它基于单个行筛选数据框,而不是考虑整个事务.
请在下面找到测试数据:
import pandas as pd
import random
import string
import numpy as np
random.seed(42)
np.random.seed(42)
num_transactions = 100
max_items_per_transaction = 6
# Generate a list of possible items
possible_items = [
"Kids T-shirt", "Baby Onesie", "Kids Socks",
"Men's Shirt", "Women's Dress", "Kids Pants",
"Baby Hat", "Women's Shoes", "Men's Pants",
"Kids Jacket", "Baby Bib", "Men's Hat",
"Women's Skirt", "Kids Shoes", "Baby Romper",
"Men's Sweater", "Kids Gloves", "Baby Blanket"
]
# Create the DataFrame
rows = []
for i in range(num_transactions):
num_items = random.randint(1, max_items_per_transaction)
items = random.sample(possible_items, num_items)
invoice_dates = pd.date_range(start='2022-01-01', periods=num_items, freq='D')
countries = random.choices(['USA', 'Canada', 'UK'], k=num_items)
customer_id = i + 1
invoice_id = 1001 + i
for j in range(num_items):
item = items[j]
usim = ''.join(random.choices(string.ascii_uppercase + string.digits, k=6)) # Generate a random 6-character USIM value
demand_qty = random.randint(1, 10)
row = {
'INVOICE_DATE': invoice_dates[j],
'COUNTRY': countries[j],
'CUSTOMER_ID': customer_id,
'INVOICE_ID': invoice_id,
'DESCRIPTION': item,
'USIM': usim,
'DEMANDQTY': demand_qty
}
rows.append(row)
df = pd.DataFrame(rows)
# Print the DataFrame
df
有人能指导我如何根据描述的条件正确过滤DataFrame吗?如有任何帮助或建议,我将不胜感激.谢谢!