What is the difference between rowsBetween and rangeBetween?

From PySpark docs rangeBetween:

rangeBetween(start, end)

Defines frame borders from the beginning (inclusive) to the end (inclusive).

The beginning and the end are relative from the current line. For example, “0” means “current row,” while “-1” means one trip before the current line, and “5” means five trips after the current line.

Parameters:

  • beginning - boundary beginning, inclusive. A frame is not limited if it is -sys.maxsize (or lower).
  • end - the end of the border inclusive. A frame is not limited if it is sys.maxsize (or higher). New in version 1.4.

while rowsBetween

rowsBetween(start, end)

Defines frame borders from the beginning (inclusive) to the end (inclusive).

, . , "0" " ", "-1" , "5" .

Parameters:

  • - , . , -sys.maxsize ( ).
  • - . , sys.maxsize ( ). 1.4.

, rangeBetween "1 " "1 "?

+17
3

:

  • ROWS BETWEEN . .
  • RANGE BETWEEN .

, :

  • ORDER BY x ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
  • ORDER BY x RANGE BETWEEN 2 PRECEDING AND CURRENT ROW

+---+
|  x|
+---+
| 10|
| 20|
| 30|
| 31|
+---+

, 31 , ( ):

+---+----------------------------------------------------+
|  x|ORDER BY x ROWS BETWEEN 2  PRECEDING AND CURRENT ROW|
+---+----------------------------------------------------+
| 10|                                               false|
| 20|                                                true|
| 30|                                                true|
| 31|                                                true|
+---+----------------------------------------------------+

( , x> = 31 - 2):

+---+-----------------------------------------------------+
|  x|ORDER BY x RANGE BETWEEN 2  PRECEDING AND CURRENT ROW|
+---+-----------------------------------------------------+
| 10|                                                false|
| 20|                                                false|
| 30|                                                 true|
| 31|                                                 true|
+---+-----------------------------------------------------+
+24

Java Spark : https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/expressions/WindowSpec.html#rowsBetween-long-long-

rangeBetween

() ORDER BY. ORDER BY, , 10, -3, 10 - 3. = 7. , , ORDER BY: , . , , , ORDER BY.

rowBetween

. , . , -1 +2. 5 4 6.

+2

linesBetween: - rowBetween , .

rowBetween orderBy.

df = spark.read.csv(r'C:\Users\akashSaini\Desktop\TT.csv',inferSchema =True, header=True).na.drop()
w =Window.partitionBy('DEPARTMENT').orderBy('SALARY').rowsBetween(Window.unboundedPreceding,Window.currentRow)
df.withColumn('RowsBetween', F.sum(df.SALARY).over(w)).show()


first_name|Department|Salary|RowsBetween|

 Sofia|     Sales| 20000| 20000|
Gordon|     Sales| 25000| 45000|
Gracie|     Sales| 25000| 70000|
Cellie|     Sales| 25000| 95000|
Jervis|     Sales| 30000|125000|
 Akash|  Analysis| 30000| 30000|
Richard|   Account| 12000| 12000|
 Joelly|   Account| 15000| 27000|
Carmiae|   Account| 15000| 42000|
    Bob|   Account| 20000| 62000|
  Gally|   Account| 28000| 90000

rangeBetween: - rangeBetween , .

rowBetween orderBy. rangeBetween , orderBy, Gordon, Gracie Cellie , .

. : -

df = spark.read.csv(r'C:\Users\asaini28.EAD\Desktop\TT.csv',inferSchema =True, header=True).na.drop()
w =Window.partitionBy('DEPARTMENT').orderBy('SALARY').rangeBetween(Window.unboundedPreceding,Window.currentRow)
df.withColumn('RangeBetween', F.sum(df.SALARY).over(w)).select('first_name','Department','Salary','Test').show()

 first_name|Department|Salary|RangeBetween|
  Sofia|     Sales| 20000| 20000|
 Gordon|     Sales| 25000| 95000|
 Gracie|     Sales| 25000| 95000|
 Cellie|     Sales| 25000| 95000|
 Jervis|     Sales| 30000|125000|
  Akash|  Analysis| 30000| 30000|
Richard|   Account| 12000| 12000|
 Joelly|   Account| 15000| 42000|
Carmiae|   Account| 15000| 42000|
    Bob|   Account| 20000| 62000|
  Gally|   Account| 28000| 90000|
0

Source: https://habr.com/ru/post/1657800/


All Articles