@himanshu and I work on the IDE team at Salesforce and have been learning recently about TransmogrifAI, our new AutoML library. We’ve put together this blog to help others get started and hope you enjoy it!
Can you guess how long it takes to build a machine learning application? days? weeks? months?
Generally, it takes months! We have asked ourselves this question and started focusing on improving this time to days, or even hours, to increase our productivity. TransmogrifAI (pronounced trans-mog-ri-phi) was born out of the necessity to reduce this time and assist with mundane tasks involved in machine learning. It is an end to end AutoML library for structured data which includes a rich type system to minimize runtime errors along with powerful features like automated feature engineering, feature selection, model selection, and hyper-parameter tuning. For more details on key ideas and high-level design, please refer to Open Sourcing TransmogrifAI.
In this blog, we cover the magic behind TransmogrifAI. We’ll demonstrate the level of effort needed in building a baseline machine learning application using SparkML vs. TransmogrifAI:
- Building & Training a simple real estate app using California Housing Dataset with Spark ML.
- Building the same app using TransmogrifAI and diving into the internals to reveal the magic.
The complete source code used in this blog is hosted on GitHub.
California Housing Dataset
The California Housing Dataset includes summary statistics of houses found in a given California district based on 1990 census data. These spatial data contain 20,640 observations on housing prices with 9 economic variables. The dependent variable is median house value.
- Median house value
- Longitude
- Latitude
- Housing median age
- Total rooms
- Total bedrooms
- Population
- Households
- Median income
Real Estate House Price Prediction using Apache SparkML
Let’s dive into the code and steps for building the model with Spark ML.
- Set up Spark Context: This checks whether there is a valid thread-local or global default SparkSession and returns it if it’s available. If not, the method creates a new SparkSession and assigns the newly created SparkSession as the global default.
// set up spark context implicit val spark = SparkSession.builder.master("local") .appName("California Housing Dataset Prediction").getOrCreate
sparkContextSetup.scala hosted with ❤ by GitHub
- Read the DataSet from CSV as DataFrame: A DataFrame is a Dataset organized into named columns. It’s conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.
//read the data as data frame val df = spark.read.format("csv").option("header", "true") .load("/path/to/file.csv")
readDataset.scala hosted with ❤ by GitHub
DataFrame Output:
// Displays the content of the DataFrame to stdout df.show() Output: +-------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ |houseId| medianHouseValue| medianIncome| housingMedianAge| totalRooms| totalBedrooms| population| households| latitude| longitude| +-------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ | 1|4.526000000000000...|8.325200000000000...|4.100000000000000...|8.800000000000000...|1.290000000000000...|3.220000000000000...|1.260000000000000...|3.788000000000000...|-1.22230000000000...| | 2|3.585000000000000...|8.301399999999999...|2.100000000000000...|7.099000000000000...|1.106000000000000...|2.401000000000000...|1.138000000000000...|3.785999999999999...|-1.22220000000000...| | 3|3.521000000000000...|7.257399999999999...|5.200000000000000...|1.467000000000000...|1.900000000000000...|4.960000000000000...|1.770000000000000...|3.785000000000000...|-1.22239999999999...| | 4|3.413000000000000...|5.643099999999999...|5.200000000000000...|1.274000000000000...|2.350000000000000...|5.580000000000000...|2.190000000000000...|3.785000000000000...|-1.22250000000000...| | 5|3.422000000000000...|3.846200000000000...|5.200000000000000...|1.627000000000000...|2.800000000000000...|5.650000000000000...|2.590000000000000...|3.785000000000000...|-1.22250000000000...| | 6|2.697000000000000...|4.036800000000000...|5.200000000000000...|9.190000000000000...|2.130000000000000...|4.130000000000000...|1.930000000000000...|3.785000000000000...|-1.22250000000000...| | 7|2.992000000000000...|3.659100000000000...|5.200000000000000...|2.535000000000000...|4.890000000000000...|1.094000000000000...|5.140000000000000...|3.784000000000000...|-1.22250000000000...| | 8|2.414000000000000...|3.120000000000000...|5.200000000000000...|3.104000000000000...|6.870000000000000...|1.157000000000000...|6.470000000000000...|3.784000000000000...|-1.22250000000000...| | 9|2.267000000000000...|2.080400000000000...|4.200000000000000...|2.555000000000000...|6.650000000000000...|1.206000000000000...|5.950000000000000...|3.784000000000000...|-1.22260000000000...| | 10|2.611000000000000...|3.691199999999999...|5.200000000000000...|3.549000000000000...|7.070000000000000...|1.551000000000000...|7.140000000000000...|3.784000000000000...|-1.22250000000000...| | 11|2.815000000000000...|3.203100000000000...|5.200000000000000...|2.202000000000000...|4.340000000000000...|9.100000000000000...|4.020000000000000...|3.785000000000000...|-1.22260000000000...| | 12|2.418000000000000...|3.270500000000000...|5.200000000000000...|3.503000000000000...|7.520000000000000...|1.504000000000000...|7.340000000000000...|3.785000000000000...|-1.22260000000000...| | 13|2.135000000000000...|3.075000000000000...|5.200000000000000...|2.491000000000000...|4.740000000000000...|1.098000000000000...|4.680000000000000...|3.785000000000000...|-1.22260000000000...| | 14|1.913000000000000...|2.673600000000000...|5.200000000000000...|6.960000000000000...|1.910000000000000...|3.450000000000000...|1.740000000000000...|3.784000000000000...|-1.22260000000000...| | 15|1.592000000000000...|1.916700000000000...|5.200000000000000...|2.643000000000000...|6.260000000000000...|1.212000000000000...|6.200000000000000...|3.785000000000000...|-1.22260000000000...| | 16|1.400000000000000...|2.125000000000000...|5.000000000000000...|1.120000000000000...|2.830000000000000...|6.970000000000000...|2.640000000000000...|3.785000000000000...|-1.22260000000000...| | 17|1.525000000000000...|2.774999999999999...|5.200000000000000...|1.966000000000000...|3.470000000000000...|7.930000000000000...|3.310000000000000...|3.785000000000000...|-1.22270000000000...| | 18|1.555000000000000...|2.120200000000000...|5.200000000000000...|1.228000000000000...|2.930000000000000...|6.480000000000000...|3.030000000000000...|3.785000000000000...|-1.22270000000000...| | 19|1.587000000000000...|1.991100000000000...|5.000000000000000...|2.239000000000000...|4.550000000000000...|9.900000000000000...|4.190000000000000...|3.784000000000000...|-1.22260000000000...| | 20|1.629000000000000...|2.603299999999999...|5.200000000000000...|1.503000000000000...|2.980000000000000...|6.900000000000000...|2.750000000000000...|3.784000000000000...|-1.22270000000000...| +-------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
dataframeOutput.scala hosted with ❤ by GitHub
- Cast to DoubleType: Since all the inputs features are Double, casting String to Double
// cast all the strings to Double type in Data Frame val castedDF = df.columns.foldLeft(df)((current, c) => current.withColumn(c, current(c).cast(DoubleType))) castedDF.describe().show()
casting.scala hosted with ❤ by GitHub
Output:
+-------+-----------------+------------------+------------------+------------------+------------------+-----------------+------------------+-----------------+-----------------+-------------------+ |summary| houseId| medianHouseValue| medianIncome| housingMedianAge| totalRooms| totalBedrooms| population| households| latitude| longitude| +-------+-----------------+------------------+------------------+------------------+------------------+-----------------+------------------+-----------------+-----------------+-------------------+ | count| 20640| 20640| 20640| 20640| 20640| 20640| 20640| 20640| 20640| 20640| | mean| 10320.5|206855.81690891474|3.8706710029070246|28.639486434108527|2635.7630813953488|537.8980135658915|1425.4767441860465|499.5396802325581| 35.6318614341087|-119.56970445736148| | stddev|5958.399113856003|115395.61587441359| 1.899821717945263| 12.58555761211163|2181.6152515827944| 421.247905943133| 1132.46212176534|382.3297528316098|2.135952397457101| 2.003531723502584| | min| 1.0| 14999.0| 0.4999| 1.0| 2.0| 1.0| 3.0| 1.0| 32.54| -124.35| | max| 20640.0| 500001.0| 15.0001| 52.0| 39320.0| 6445.0| 35682.0| 6082.0| 41.95| -114.31| +-------+-----------------+------------------+------------------+------------------+------------------+-----------------+------------------+-----------------+-----------------+-------------------+
castinOutput.scala hosted with ❤ by GitHub
- Create Feature and Label Set: Divide the Dataframe into features and labels. We are using
VectorAssembler
to create a feature set.VectorAssembler
is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector in order to train ML models like logistic regression and decision trees.
// create features val featureCols = Array("medianIncome", "housingMedianAge", "totalRooms", "totalBedrooms", "population", "households", "latitude", "longitude") val vectorAssembler = new VectorAssembler().setInputCols(featureCols).setOutputCol("features") //create Label and Features val featuresDf = vectorAssembler.transform(castedDF).select("medianHouseValue", "features") //print label/feature featuresDf.show()
createFeature.scala hosted with ❤ by GitHub
Output:
+----------------+--------------------+ |medianHouseValue| features| +----------------+--------------------+ | 452600.0|[8.3252,41.0,880....| | 358500.0|[8.3014,21.0,7099...| | 352100.0|[7.2574,52.0,1467...| | 341300.0|[5.6431,52.0,1274...| | 342200.0|[3.8462,52.0,1627...| | 269700.0|[4.0368,52.0,919....| | 299200.0|[3.6591,52.0,2535...| | 241400.0|[3.12,52.0,3104.0...| | 226700.0|[2.0804,42.0,2555...| | 261100.0|[3.6912,52.0,3549...| | 281500.0|[3.2031,52.0,2202...| | 241800.0|[3.2705,52.0,3503...| | 213500.0|[3.075,52.0,2491....| | 191300.0|[2.6736,52.0,696....| | 159200.0|[1.9167,52.0,2643...| | 140000.0|[2.125,50.0,1120....| | 152500.0|[2.775,52.0,1966....| | 155500.0|[2.1202,52.0,1228...| | 158700.0|[1.9911,50.0,2239...| | 162900.0|[2.6033,52.0,1503...| +----------------+--------------------+
featuresOut.scala hosted with ❤ by GitHub
- Scaling the features using Standard Scaler:
//Using Standard scaler to scale the feature set val standardScaler = new StandardScaler() .setInputCol("features") .setOutputCol("scaledFeatures") .setWithStd(true) .setWithMean(true) val scaler = standardScaler.fit(featuresDf) val scaledFeatures = scaler.transform(featuresDf) //print the output scaledFeatures.show()
featureScaling.scala hosted with ❤ by GitHub
Output:
+----------------+--------------------+--------------------+ |medianHouseValue| features| scaledFeatures| +----------------+--------------------+--------------------+ | 452600.0|[8.3252,41.0,880....|[2.34470895611761...| | 358500.0|[8.3014,21.0,7099...|[2.33218146484030...| | 352100.0|[7.2574,52.0,1467...|[1.78265621721384...| | 341300.0|[5.6431,52.0,1274...|[0.93294490759373...| | 342200.0|[3.8462,52.0,1627...|[-0.0128806838430...| | 269700.0|[4.0368,52.0,919....|[0.08744451941137...| | 299200.0|[3.6591,52.0,2535...|[-0.1113636089684...| | 241400.0|[3.12,52.0,3104.0...|[-0.3951270773548...| | 226700.0|[2.0804,42.0,2555...|[-0.9423363181905...| | 261100.0|[3.6912,52.0,3549...|[-0.0944672866994...| | 281500.0|[3.2031,52.0,2202...|[-0.3513861309202...| | 241800.0|[3.2705,52.0,3503...|[-0.3159091178071...| | 213500.0|[3.075,52.0,2491....|[-0.4188135104422...| | 191300.0|[2.6736,52.0,696....|[-0.6300964935813...| | 159200.0|[1.9167,52.0,2643...|[-1.0285022981105...| | 140000.0|[2.125,50.0,1120....|[-0.9188604311750...| | 152500.0|[2.775,52.0,1966....|[-0.5767230643578...| | 155500.0|[2.1202,52.0,1228...|[-0.9213869840377...| | 158700.0|[1.9911,50.0,2239...|[-0.9893407287394...| | 162900.0|[2.6033,52.0,1503...|[-0.6670999657155...| +----------------+--------------------+--------------------+
featureScalingOut.scala hosted with ❤ by GitHub
- Divide the dataset into Test and Training set
//divide in test and train val split = scaledFeatures.randomSplit(Array(.8, .2))
trainTestSplit.scala hosted with ❤ by GitHub
- Build a Linear Regression Model with the Scaled Features
//linear regression model val lr = new LinearRegression().setLabelCol("medianHouseValue").setFeaturesCol("scaledFeatures") .setMaxIter(100) .setRegParam(0.3) .setElasticNetParam(0.8) // Using Training set for model building val lrModel = lr.fit(split(0)) // Print the coefficients and intercept for linear regression println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}") // Summarize the model over the training set and print out some metrics val trainingSummary = lrModel.summary println(s"numIterations: ${trainingSummary.totalIterations}") println(s"objectiveHistory: [${trainingSummary.objectiveHistory.mkString(",")}]") trainingSummary.residuals.show() println(s"RMSE: ${trainingSummary.rootMeanSquaredError}") println(s"r2: ${trainingSummary.r2}")
buildingModel.scala hosted with ❤ by GitHub
Output:
Coefficients: [76738.94522230534,14903.054273122021,-18332.757718162775,50003.22137242165,-42713.72451462165,16547.521499666713,-90676.93313902401,-85365.66935383904] Intercept: 206772.90073196447 numIterations: 101 objectiveHistory: [0.5,0.429025021421256,0.26855114619613496,0.22652891956525684,0.21499857945908582,0.20773178001678289,0.19514784970174676,0.1885232671505026,0.1859671425971781,0.1853561326312991,0.18457399881973188,0.18441537651204076,0.18373103793910683,0.18350208412916771,0.1823597273673587,0.18229316947052004,0.18218622679968952,0.18215858300437648,0.18213798185173202,0.1821246023017618,0.18211459711626019,0.18209961828343316,0.18208837731785227,0.18205588604904863,0.1820434687407733,0.1820208677601439,0.18200624291405224,0.18200467173854357,0.18199278217947065,0.18199099118402143,0.18198050302769553,0.1819789044800163,0.18196944209901966,0.1819675578082704,0.18195935806645552,0.18195583674748167,0.18194791834626967,0.1819391408347191,0.1819311038867398,0.1819225377989266,0.18191655895966014,0.18190851174762848,0.18190363033484025,0.1818959409176114,0.18189156441553492,0.18188376965718034,0.18187948424747338,0.18187360174546469,0.18186895573232031,0.1818635331831013,0.1818584295438534,0.1818534353328115,0.1818477940447873,0.18184338131598446,0.1818370520009538,0.18183019469423445,0.18182623652061064,0.18181992818323542,0.18181479669876457,0.18180980523170026,0.18180828991921147,0.18180289998437538,0.1818009106789837,0.1817973011799926,0.18179518581366086,0.18179204915301264,0.18178992998002239,0.18178753203578832,0.18178555677441152,0.18178259277789513,0.18178126191707356,0.1817752061445554,0.18177052282864672,0.1817656813431612,0.18175784283297136,0.1817571920872549,0.18175514410840932,0.18175445013903024,0.18175257430793876,0.1817519142927131,0.1817502604701024,0.18174969009013336,0.18174822566214843,0.18174793601506734,0.18174738819241323,0.18174606666583532,0.1817457064933114,0.1817448402686882,0.18174442260634774,0.18174371450337473,0.18174341969111288,0.1817427069503691,0.18174243139117785,0.18174180325170708,0.18174152008293612,0.18174083826101903,0.1817405847778775,0.18173984443252902,0.18173934292707547,0.18173882279925963,0.18173828427045613] RMSE: 69593.10829328775 r2: 0.6365386822854913
linearRegressionModelOut.scala hosted with ❤ by GitHub
The model can be used for predictions now.
It takes a fair amount of time and effort to create the simple app in the Apache Spark example above. In contrast, we can quickly build an app using the TransmogrifAI CLI module that comes out of the box with TransmogrifAI and explore the power of AutoML.
Real Estate House Price Prediction using TransmogrifAI
For the purpose of this blog, we are going to demonstrate how we can quickly generate a real estate housing price prediction application and train it using the California Housing dataset described above.
Code Generation
The fundamental idea behind code generation is type inference. Before going into further details on how type inferencing works, lets quickly go over the steps that are required to generate the app and train the model by diving into the internal details of the code generation.
- Clone TransmogrifAI & Build CLI module
Clone the TransmogrifAI project from GitHub:
# Cloning TransmogrifAI Repo git clone https://github.com/salesforce/TransmogrifAI.git
Check out the latest release branch (in this example, 0.4.0
):
cd ./TransmogrifAI git checkout 0.4.0
checkout.sh hosted with ❤ by GitHub
Build the TransmogrifAI CLI:
./gradlew cli:shadowJar alias transmogrifai="java -cp `pwd`/cli/build/libs/\* com.salesforce.op.cli.CLI"
- Fetch Dataset & Generate Real Estate App
Create the required directories and download the dataset:
cd .. && mkdir -p blog/data && cd blog/data wget https://raw.githubusercontent.com/ajayborra/TransmogrifAI-CaliforniaHousing/master/data/cadata.csv && cd ..
Generate and build the real estate app using the California housing dataset:
transmogrifai gen --input data/cadata.csv --id houseId --response medianHouseValue --overwrite --auto HouseObject RealEstateApp cd realestateapp && ./gradlew compileTestScala installDist
generate.sh hosted with ❤ by GitHub
Step 3: Training
Run training on the dataset with the generated app. This step takes about five to ten minutes:
./gradlew sparkSubmit -Dmain=com.salesforce.app.RealEstateApp -Dargs="--run-type=train --model-location=./house-model --read-location HouseObject=`pwd`/../data/cadata.csv"
After running the above steps, we get an output the looks similar to the one below:
"bestModelUID" : "gbtr_ee53956069aa", "testSetEvaluationResults" : { "(regEval)_R2" : 0.7763428813232043, "(regEval)_RootMeanSquaredError" : 54308.675029060905, "(regEval)_MeanAbsoluteError" : 36619.733388656015, "(regEval)_MeanSquaredError" : 2.9494321834121437E9 }, "bestModelName" : "gbtr_ee53956069aa_13", "trainingSetEvaluationResults" : { "(regEval)_R2" : 0.8184847478105526, "(regEval)_RootMeanSquaredError" : 49200.74279966309, "(regEval)_MeanAbsoluteError" : 33837.91117985914, "(regEval)_MeanSquaredError" : 2.420713092038599E9 } 18/09/11 16:37:57 INFO OpWorkflowRunner: Total run time: 6m54.662s
trainOut.scala hosted with ❤ by GitHub
This particular result means that we successfully trained our model and identified that the gradient boosted tree (gbtr_ee53956069aa)
is a good fit for the dataset with the above-mentioned training and test scores.
Type Inference
So far we have seen that using the TransmogrifAI CLI module we can quickly get an application off the ground and train it without any code. Now let’s dive a bit deeper and explore type inferencing.
In the example above, we leveraged the automatic type inferencing capability of TransmogrifAI which assigns a type to each field in the CSV file and detects the schema of the dataset. In this example, type assignment happens in two phases when we invoke transmogrifai gen --input data/cadata.csv
with the --auto
flag. In the first phase, the input dataset is passed on to the Spark CSV module which scans the entire dataset and assigns a primitive spark type (string, long, double) to each field to construct the schema of the dataset. This schema is passed on to the second phase to map these primitive type to rich data types that are supported by TransmogrifAI described in detail here. For the curious mind, here’s the link to the code that handles automatic type inference.
In general, if you want more control over which type gets assigned to a field during code generation for the datasets, provide the --schema
flag to the transmogrifai gen
command. This flag allows the users to pass in an Apache Avro schema file with the field to primitive type mappings. In this case, the Avro schema passed to the CLI module is used to infer the primitive types of the fields. These primitive types are then used to infer the rich types supported by TransmogrifAI type hierarchy.
In both the cases mentioned above, we covered the most common scenario where the schema of the datasets is flat without any nested objects. But there can be use cases where the data reading part of the pipeline can present challenges with reading data from complex, nested data structures with nested schemas. For this scenarios, we recommend using the Custom Reader capabilities of the framework by providing customized Apache Avro schema file. You can tailor the dataset parsing to your needs and map these types to the TransmogrifAI type hierarchy.
The resulting fields composed of TransmogrifAI types are used to generate the feature abstraction for each field in the dataset. This schema comprising of feature abstractions enables the AutoML pipeline.
The real estate house price prediction use case with a handful of numeric features is a simplistic illustration of the power of TransmogrifAI. TransmogrifAI shines more with diverse feature types needing sophisticated feature engineering, real-world data with hindsight bias or data leakage, etc. Stay tuned for future blog posts where we uncover different aspects of the automated data pipeline capabilities which include feature engineering, feature selection, model selection, and hyper-parameter tuning offered by TransmogrifAI.