lundi 13 mars 2023

Why Spark Dataframe functions' parameters (eg: groupby, select) are designed like (col1: String, cols: String*) instead of (cols: String*)?

In Spark, there are a batch of operators or functions like 'select', 'groupby','dropDuplicates'...

The parameters of those functions are always like (col1: String, cols: String*), eg:

@scala.annotation.varargs
  def dropDuplicates(col1: String, cols: String*): Dataset[T] = {
    val colNames: Seq[String] = col1 +: cols
    dropDuplicates(colNames)
  }

  @scala.annotation.varargs
  def groupBy(col1: String, cols: String*): RelationalGroupedDataset = {
    val colNames: Seq[String] = col1 +: cols
    RelationalGroupedDataset(
      toDF(), colNames.map(colName => resolve(colName)), RelationalGroupedDataset.GroupByType)
  }

  @scala.annotation.varargs
  def select(col: String, cols: String*): DataFrame = select((col +: cols).map(Column(_)) : _*)

When the type of parameters is String, the functions are always defined as (col: String, cols: String*) instead of (cols: String*).

In every function, the first statement is always combining the two paras like "val colNames: Seq[String] = col1 +: cols", nothing more else.

Considering that there is only one parameter if the type of parameters is Column. eg:

  @scala.annotation.varargs
  def groupBy(cols: Column*): RelationalGroupedDataset = {
    RelationalGroupedDataset(toDF(), cols.map(_.expr), RelationalGroupedDataset.GroupByType)
  }

  @scala.annotation.varargs
  def select(cols: Column*): DataFrame = withPlan {
    Project(cols.map(_.named), logicalPlan)
  }

So I'm wondering why not using (cols: String*) instead of (col1: String, cols: String*)? It seems like (cols: String*) makes more sense?

the details are listed above.

Aucun commentaire:

Enregistrer un commentaire