Hello!
I am relatively new to Spark and couldn’t find any reference to do what I need. I am using Spark 2.1 and PySpark.
I have a big dataframe, with each column being the quantities acquired from a given product and each line corresponding to an account. This dataframe has tens of thousands of lines and columns.
Each product code, heading each column, is composed of a four-digit, up to six-digit codes (ex: 0742).
The quantities in this dataframe are positive, zero and, in some cases negative, as the example table below (the first line corresponds to the product codes):
0742 6542 5431 303837 0.0 11.61 38.22 2198.0 0.0 0.0 -637288 0.0 0.0 10.00 0.0 0.0
I need to define a function (it can be an UDF) that does the following: 1) Identify the quantities strictly positive on each given line. 2) Codify all the product codes as integer numbers, such as below:
Table of codified products:
0742 1
6542 2
5431 3
303837 4
1) Read the dataframe, element-wise along each line and identify the strictly positive quantities. 2) For these quantities, identify the product corresponding to this quantity and its corresponding code. 3) To store the codes of these products into two forms:
a) a text file (to be saved on HDFS) with elements separated by simple spaces. Following the example:
2 3 4
2
That is, on the .txt file, the last element on each line does not have any space behind it.
b) A dataframe, where the data should be storage as follows: Sample of desired dataframe:
1 2 3 4
2 3 4
2
The data on it should be read so to produce a list of lists, such as: [["2", "3", "4"],[ "2"]] Please note that there’s a space behind each comma, inside each list.
Definetly the most urgent format is the .txt file to be storaged on HDFS. I’d appreciate any help, suggestions so far!
Regards, Alfredo