python dataframe索引字段相同的合并_Pandas DataFrame-将具有相同索引的一列值组合到列表中...

110 阅读 0 评论 73 点赞

我是靠谱客的博主忧虑外套，最近开发中收集的这篇文章主要介绍python dataframe索引字段相同的合并_Pandas DataFrame-将具有相同索引的一列值组合到列表中...，觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

I've been at this issue for awhile to no avail. This is almost a duplicate of at least one other question on here, but I can't quite figure out how to do exactly what I'm looking for from related answers online.

I have a Pandas DataFrame (we'll call it df) that looks something like:

Name Value Value2

'A' '8.8.8.8' 'x'

'B' '6.6.6.6' 'y'

'A' '6.6.6.6' 'x'

'A' '8.8.8.8' 'x'

Where Name is the index. I want to convert this to something like that looks like:

Name Value Value2

'A' ['8.8.8.8', '6.6.6.6'] 'x'

'B' ['6.6.6.6'] 'y'

So, basically, every Value that corresponds to the same index should be combined into a list (or a set, or a tuple) and that list made to be the Value for the corresponding index. And, as shown, Value2 is the same between like-indexed rows, so it should just stay the same in the end.

All I've done (successfully) is figure out how to make each element in the Value column into a list with:

df['Value'] = pd.Series([[val] for val in df['Value']])

In the question I linked at the start of this post, the recommended way to combine columns with duplicate indices offers a solution using df.groupby(df.index).sum(). I know I need something besides df.index as an argument to groupby since the Value column is treated as special, and I'm not sure what to put in place of sum() since that's not quite what I'm looking for.

Hopefully it's clear what I'm looking for, let me know if there's anything I can elaborate on. I've also tried simply looping through the DataFrame myself, finding rows with the same index, combining the Values into a list and updating df accordingly. After trying to get this method to work for a bit I thought I'd look for a more Pandas-esque way of handling this problem.

Edit: As a follow up to dermen's answer, that solution kind of worked. The Values did seem to concatenate correctly into a list. One thing I realized was that the unique function returns a Series, as opposed to a DataFrame. Also, I do have more columns in the actual setup than just Name, Value, and Value2. But I think I was able to get around both of the issues successfully with the following:

gb = df.groupby(tuple(df.columns.difference(['Value'])))

result = pd.DataFrame(gb['Value'].unique(), columns=df.columns)

Where the first line gives an argument to groupby of the list of columns minus the Value column, and the second line converts the Series returned by unique into a DataFrame with the same columns as df.

But I think with all of that in place (unless anyone sees an issue with this), almost everything works as intended. There does seem to be something that's a bit off here, though. When I try to output this to a file with to_csv, there are duplicate headers across the top (but only certain headers are duplicated, and there's no real pattern as to which as far as I can tell). Also, the Value lists are truncated, which is probably a simpler issue to fix. The csv output currenlty looks like:

Name Value Value2 Name Value2

'A' ['8.8.8.8' '7.7.7.7' 'x'

'B' ['6.6.6.6'] 'y'

The above looks weird, but that is exactly how it looks in the output. Note that, contrary to the example presented at the start of this post, there are assumed to be more than 2 Values for A (so that I can illustrate this point). When I do this with the actual data, the Value lists get cut off after the first 4 elements.

解决方案

I think you are looking to use pandas.Series.unique. First, make the 'Name' index a column

# Value2 Value

#Name

#A x 8.8

#B y 6.6

#A x 6.6

#A x 8.8

df.reset_index(inplace=True)

# Name Value2 Value

#0 A x 8.8

#1 B y 6.6

#2 A x 6.6

#3 A x 8.8

Next call groupby and call the unique function on the 'Value' series

gb = df.groupby(('Name','Value2'))

result = gb['Value'].unique()

result.reset_index(inplace=True) #lastly, reset the index

# Name Value2 Value

#0 A x [8.8, 6.6]

#1 B y [6.6]

Finally, if you want 'Name' as the index again, just do

result.set_index( 'Name', inplace=True)

# Value2 Value

#Name

#A x [8.8, 6.6]

#B y [6.6]

UPDATE

As a follow up, make sure you re-assign result after resetting the index

result = gb['Value'].unique()

type(result)

#pandas.core.series.Series

result = result.reset_index()

type(result)

#pandas.core.frame.DataFrame

saving as CSV (rather TSV)

You don't want to use CSV here because there are commas in the Value column entries. Rather, save as TSV, you still use the same method to_csv, just change the sep arg:

result.to_csv( 'result.txt', sep='t')

If I load result.txt in EXCEL as a TSV, I get