Some time has passed since I posted the question, and it seems that some other people would also like to get an answer. Below I have found.
Thus, the initial task was to add a column with row identifiers (basically a 1 to numRows sequence) to any data frame, so the order / presence of rows can be monitored (for example, when selected). This can be achieved by something in this direction:
sqlContext.textFile(file). zipWithIndex(). map(case(d, i)=>i.toString + delimiter + d). map(_.split(delimiter)). map(s=>Row.fromSeq(s.toSeq))
Regarding the general case of adding any column to any data frame:
The "closest" to this functionality in the Spark API are withColumn and withColumnRenamed . According to Scala docs , the former returns a new DataFrame by adding a column. In my opinion, this is a bit confusing and incomplete definition. Both of these functions can only work with this data frame, i.e. Given two data frames df1 and df2 with col column:
val df = df1.withColumn("newCol", df1("col") + 1) // -- OK val df = df1.withColumn("newCol", df2("col") + 1) // -- FAIL
Therefore, if you cannot convert a column of an existing data frame to the form you need, you cannot use withColumn or withColumnRenamed to add arbitrary columns (stand-alone or other data frames).
As noted above, a workaround might be to use join - this would be rather messy, although it is possible - adding unique keys, like the one above, using zipWithIndex to both data frames or columns, may work. Although the effectiveness ...
It is clear that adding a column to the data frame is not simple functionality for a distributed environment, and there may not be a very efficient, neat method for this. But I think it's still very important to have this basic functionality, even with performance warnings.