I have data that looks like
+--------------+---------+-------+---------+
| dataOne|OtherData|dataTwo|dataThree|
+--------------+---------|-------+---------+
| Best| tree| 5| 533|
| OK| bush| e| 3535|
| MEH| cow| -| 3353|
| MEH| oak| none| 12|
+--------------+---------+-------+---------+
and I'm trying to get it out
+--------------+---------+
| dataOne| Count|
+--------------+---------|
| Best| 1|
| OK| 1|
| Meh| 2|
+--------------+---------+
I have no problem getting the dataOne into a dataframe by itself and displaying its contents to make sure that I just grab the dataOne column.However, I cannot find the correct syntax to turn this sql query into the data I need. I tried to create this next framework from a temporary view created by the whole dataset
Dataset<Row> dataOneCount = spark.sql("select dataOne, count(*) from
dataFrame group by dataOne");
dataOneCount.show();
But the spark The documentation, which I could only find on this, showed how to make this type of aggregation in spark category 1.6 earlier, so any help would be appreciated.
An error message appears here, but I checked the data and there is no indexing error there.
java.lang.ArrayIndexOutOfBoundsException: 11
functions() countDistinct
Column countNum = countDistinct(dataFrame.col("dataOne"));
Dataset<Row> result = dataOneDataFrame.withColumn("count",countNum);
result.show();
dataOneDataFrame - dataFrame,
select dataOne from dataFrame
, , , , , / countDistinct
edit: , , ( )
Dataset<Row> dataFrame
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("Log File Reader")
.getOrCreate();
JavaRDD<String> logsRDD = spark.sparkContext()
.textFile(args[0],1)
.toJavaRDD();
String schemaString = "dataOne OtherData dataTwo dataThree";
List<StructField> fields = new ArrayList<>();
String[] fieldName = schemaString.split(" ");
for (String field : fieldName){
fields.add(DataTypes.createStructField(field, DataTypes.StringType, true));
}
StructType schema = DataTypes.createStructType(fields);
JavaRDD<Row> rowRDD = logsRDD.map((Function<String, Row>) record -> {
String[] attributes = record.split(" ");
return RowFactory.create(attributes[0],attributes[1],attributes[2],attributes[3]);
});
Dataset<Row> dF = spark.createDataFrame(rowRDD, schema);
dF.groupBy(col("dataOne")).count().show();
dF.createOrReplaceTempView("view");
dF.sparkSession().sql("select command, count(*) from view group by command").show();
, , RowFactory? , , , .
best tree 5 533
OK bush e 3535
MEH cow - 3353
MEH oak none 12