概述
I have a set of records in Python with an id, at least one attribute, and a set of date ranges. I would like code that takes each id, and combines all the records where the attributes match and there is no gap in the date range.
By no gap in date range, I mean that the end date of one record is greater than or equal to the next record for that id.
For instance, a record with id "10", start date "2016-01-01" and end date "2017-01-01" could be merged with another record with that id, a start date of "2017-01-01", and an end date of "2018-01-01", but it could NOT be merged with a record that started on "2017-01-10", because there'd be a gap from 2017-01-01 to 2017-01-09.
Here are some examples --
Have:
FruitID,FruitType,StartDate,EndDate
1,Apple,2015-01-01,2016-01-01
1,Apple,2016-01-01,2017-01-01
1,Apple,2017-01-01,2018-01-01
2,Orange,2015-01-01,2016-01-01
2,Orange,2016-05-31,2017-01-01
2,Orange,2017-01-01,2018-01-01
3,Banana,2015-01-01,2016-01-01
3,Banana,2016-01-01,2017-01-01
3,Blueberry,2017-01-01,2018-01-01
4,Mango,2015-01-01,2016-01-01
4,Kiwi,2016-09-15,2017-01-01
4,Mango,2017-01-01,2018-01-01
Want:
FruitID,FruitType,NewStartDate,NewEndDate
1,Apple,2015-01-01,2018-01-01
2,Orange,2015-01-01,2016-01-01
2,Orange,2016-05-31,2018-01-01
3,Banana,2015-01-01,2017-01-01
3,Blueberry,2017-01-01,2018-01-01
4,Mango,2015-01-01,2016-01-01
4,Kiwi,2016-09-15,2017-01-01
4,Mango,2017-01-01,2018-01-01
My current solution is below. It provides the results I'm looking for, but performance doesn't seem great for large datasets. Also, my impression is that you generally want to avoid iterating over individual rows of a dataframe when possible. Thank you very much for any assistance you can provide!
import pandas as pd
from dateutil.parser import parse
have = pd.DataFrame.from_items([('FruitID', [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4]),
('FruitType', ['Apple', 'Apple', 'Apple', 'Orange', 'Orange', 'Orange', 'Banana', 'Banana', 'Blueberry', 'Mango', 'Kiwi', 'Mango']),
('StartDate', [parse(x) for x in ['2015-01-01', '2016-01-01', '2017-01-01', '2015-01-01', '2016-05-31',
'2017-01-01', '2015-01-01', '2016-01-01', '2017-01-01', '2015-01-01', '2016-09-15', '2017-01-01']]),
('EndDate', [parse(x) for x in ['2016-01-01', '2017-01-01', '2018-01-01', '2016-01-01', '2017-01-01',
'2018-01-01', '2016-01-01', '2017-01-01', '2018-01-01', '2016-01-01', '2017-01-01', '2018-01-01']])
])
have.sort_values(['FruitID', 'StartDate'])
rowlist = []
fruit_cur_row = None
for row in have.itertuples():
if fruit_cur_row is None:
fruit_cur_row = row._asdict()
fruit_cur_row.update(NewStartDate=row.StartDate, NewEndDate=row.EndDate)
elif not(fruit_cur_row.get('FruitType') == row.FruitType):
rowlist.append(fruit_cur_row)
fruit_cur_row = row._asdict()
fruit_cur_row.update(NewStartDate=row.StartDate, NewEndDate=row.EndDate)
elif (row.StartDate <= fruit_cur_row.get('NewEndDate')):
fruit_cur_row['NewEndDate'] = max(fruit_cur_row['NewEndDate'], row.EndDate)
else:
rowlist.append(fruit_cur_row)
fruit_cur_row = row._asdict()
fruit_cur_row.update(NewStartDate=row.StartDate, NewEndDate=row.EndDate)
rowlist.append(fruit_cur_row)
have_mrg = pd.DataFrame.from_dict(rowlist)
print(have_mrg[['FruitID', 'FruitType', 'NewStartDate', 'NewEndDate']])
解决方案
Using a nested groupby approach:
def merge_dates(grp):
# Find contiguous date groups, and get the first/last start/end date for each group.
dt_groups = (grp['StartDate'] != grp['EndDate'].shift()).cumsum()
return grp.groupby(dt_groups).agg({'StartDate': 'first', 'EndDate': 'last'})
# Perform a groupby and apply the merge_dates function, followed by formatting.
df = df.groupby(['FruitID', 'FruitType']).apply(merge_dates)
df = df.reset_index().drop('level_2', axis=1)
Note that this method assumes your dates are already sorted. If not, you'll need to use sort_values on your DataFrame first. This method may not work if you have nested date spans.
The resulting output:
FruitID FruitType StartDate EndDate
0 1 Apple 2015-01-01 2018-01-01
1 2 Orange 2015-01-01 2016-01-01
2 2 Orange 2016-05-31 2018-01-01
3 3 Banana 2015-01-01 2017-01-01
4 3 Blueberry 2017-01-01 2018-01-01
5 4 Kiwi 2016-09-15 2017-01-01
6 4 Mango 2015-01-01 2016-01-01
7 4 Mango 2017-01-01 2018-01-01
最后
以上就是现代毛豆为你收集整理的python dataframe合并行后保留最新日期_在Pandas Dataframe中合并日期范围的全部内容,希望文章能够帮你解决python dataframe合并行后保留最新日期_在Pandas Dataframe中合并日期范围所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复