pyspark.pandas.merge#
- pyspark.pandas.merge(obj, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, suffixes=('_x', '_y'))[source]#
- Merge DataFrame objects with a database-style join. - The index of the resulting DataFrame will be one of the following:
- 0…n if no index is used for merging 
- Index of the left DataFrame if merged only on the index of the right DataFrame 
- Index of the right DataFrame if merged only on the index of the left DataFrame 
- All involved indices if merged using the indices of both DataFrames
- e.g. if left with indices (a, x) and right with indices (b, x), the result will be an index (x, a, b) 
 
 
 - Parameters
- right: Object to merge with.
- how: Type of merge to be performed.
- {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’ - left: use only keys from left frame, like a SQL left outer join; preserve key
- order. 
- right: use only keys from right frame, like a SQL right outer join; preserve key
- order. 
- outer: use union of keys from both frames, like a SQL full outer join; sort keys
- lexicographically. 
- inner: use intersection of keys from both frames, like a SQL inner join;
- preserve the order of the left keys. 
 
- on: Column or index level names to join on. These must be found in both DataFrames. If on
- is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames. 
- left_on: Column or index level names to join on in the left DataFrame. Can also
- be an array or list of arrays of the length of the left DataFrame. These arrays are treated as if they are columns. 
- right_on: Column or index level names to join on in the right DataFrame. Can also
- be an array or list of arrays of the length of the right DataFrame. These arrays are treated as if they are columns. 
- left_index: Use the index from the left DataFrame as the join key(s). If it is a
- MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels. 
- right_index: Use the index from the right DataFrame as the join key. Same caveats as
- left_index. 
- suffixes: Suffix to apply to overlapping column names in the left and right side,
- respectively. 
 
- Returns
- DataFrame
- A DataFrame of the two merged objects. 
 
 - Notes - As described in #263, joining string columns currently returns None for missing values
- instead of NaN. 
 - Examples - >>> df1 = ps.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'], ... 'value': [1, 2, 3, 5]}, ... columns=['lkey', 'value']) >>> df2 = ps.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'], ... 'value': [5, 6, 7, 8]}, ... columns=['rkey', 'value']) >>> df1 lkey value 0 foo 1 1 bar 2 2 baz 3 3 foo 5 >>> df2 rkey value 0 foo 5 1 bar 6 2 baz 7 3 foo 8 - Merge df1 and df2 on the lkey and rkey columns. The value columns have the default suffixes, _x and _y, appended. - >>> merged = ps.merge(df1, df2, left_on='lkey', right_on='rkey') >>> merged.sort_values(by=['lkey', 'value_x', 'rkey', 'value_y']) lkey value_x rkey value_y ...bar 2 bar 6 ...baz 3 baz 7 ...foo 1 foo 5 ...foo 1 foo 8 ...foo 5 foo 5 ...foo 5 foo 8 - >>> left_psdf = ps.DataFrame({'A': [1, 2]}) >>> right_psdf = ps.DataFrame({'B': ['x', 'y']}, index=[1, 2]) - >>> ps.merge(left_psdf, right_psdf, left_index=True, right_index=True).sort_index() A B 1 2 x - >>> ps.merge(left_psdf, right_psdf, left_index=True, right_index=True, how='left').sort_index() A B 0 1 None 1 2 x - >>> ps.merge(left_psdf, right_psdf, left_index=True, right_index=True, how='right').sort_index() A B 1 2.0 x 2 NaN y - >>> ps.merge(left_psdf, right_psdf, left_index=True, right_index=True, how='outer').sort_index() A B 0 1.0 None 1 2.0 x 2 NaN y