In Spark, there are a batch of operators or functions like 'select', 'groupby','dropDuplicates'...
The parameters of those functions are always like (col1: String, cols: String*), eg:
@scala.annotation.varargs
def dropDuplicates(col1: String, cols: String*): Dataset[T] = {
val colNames: Seq[String] = col1 +: cols
dropDuplicates(colNames)
}
@scala.annotation.varargs
def groupBy(col1: String, cols: String*): RelationalGroupedDataset = {
val colNames: Seq[String] = col1 +: cols
RelationalGroupedDataset(
toDF(), colNames.map(colName => resolve(colName)), RelationalGroupedDataset.GroupByType)
}
@scala.annotation.varargs
def select(col: String, cols: String*): DataFrame = select((col +: cols).map(Column(_)) : _*)
When the type of parameters is String, the functions are always defined as (col: String, cols: String*) instead of (cols: String*).
In every function, the first statement is always combining the two paras like "val colNames: Seq[String] = col1 +: cols", nothing more else.
Considering that there is only one parameter if the type of parameters is Column. eg:
@scala.annotation.varargs
def groupBy(cols: Column*): RelationalGroupedDataset = {
RelationalGroupedDataset(toDF(), cols.map(_.expr), RelationalGroupedDataset.GroupByType)
}
@scala.annotation.varargs
def select(cols: Column*): DataFrame = withPlan {
Project(cols.map(_.named), logicalPlan)
}
So I'm wondering why not using (cols: String*) instead of (col1: String, cols: String*)? It seems like (cols: String*) makes more sense?
the details are listed above.
Aucun commentaire:
Enregistrer un commentaire