Skip to main content

TransmogrifAI: Building ML Apps simplified with AutoML

Ajay Borra
Nov 05 - 8 min read

@himanshu and I work on the IDE team at Salesforce and have been learning recently about TransmogrifAI, our new AutoML library. We’ve put together this blog to help others get started and hope you enjoy it!

Can you guess how long it takes to build a machine learning application? days? weeks? months?

Generally, it takes months! We have asked ourselves this question and started focusing on improving this time to days, or even hours, to increase our productivity. TransmogrifAI (pronounced trans-mog-ri-phi) was born out of the necessity to reduce this time and assist with mundane tasks involved in machine learning. It is an end to end AutoML library for structured data which includes a rich type system to minimize runtime errors along with powerful features like automated feature engineering, feature selection, model selection, and hyper-parameter tuning. For more details on key ideas and high-level design, please refer to Open Sourcing TransmogrifAI.
In this blog, we cover the magic behind TransmogrifAI. We’ll demonstrate the level of effort needed in building a baseline machine learning application using SparkML vs. TransmogrifAI:

  • Building & Training a simple real estate app using California Housing Dataset with Spark ML.
  • Building the same app using TransmogrifAI and diving into the internals to reveal the magic.

The complete source code used in this blog is hosted on GitHub.

California Housing Dataset

The California Housing Dataset includes summary statistics of houses found in a given California district based on 1990 census data. These spatial data contain 20,640 observations on housing prices with 9 economic variables. The dependent variable is median house value.

  • Median house value
  • Longitude
  • Latitude
  • Housing median age
  • Total rooms
  • Total bedrooms
  • Population
  • Households
  • Median income

Real Estate House Price Prediction using Apache SparkML

Let’s dive into the code and steps for building the model with Spark ML.

  1. Set up Spark Context: This checks whether there is a valid thread-local or global default SparkSession and returns it if it’s available. If not, the method creates a new SparkSession and assigns the newly created SparkSession as the global default.
// set up spark context
implicit val spark = SparkSession.builder.master("local")
      .appName("California Housing Dataset Prediction").getOrCreate
  1. Read the DataSet from CSV as DataFrame: A DataFrame is a Dataset organized into named columns. It’s conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.
//read the data as data frame
val df = spark.read.format("csv").option("header", "true")
    .load("/path/to/file.csv")

readDataset.scala hosted with ❤ by GitHub

DataFrame Output:

// Displays the content of the DataFrame to stdout
df.show()
Output:
+-------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|houseId|    medianHouseValue|        medianIncome|    housingMedianAge|          totalRooms|       totalBedrooms|          population|          households|            latitude|           longitude|
+-------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|      1|4.526000000000000...|8.325200000000000...|4.100000000000000...|8.800000000000000...|1.290000000000000...|3.220000000000000...|1.260000000000000...|3.788000000000000...|-1.22230000000000...|
|      2|3.585000000000000...|8.301399999999999...|2.100000000000000...|7.099000000000000...|1.106000000000000...|2.401000000000000...|1.138000000000000...|3.785999999999999...|-1.22220000000000...|
|      3|3.521000000000000...|7.257399999999999...|5.200000000000000...|1.467000000000000...|1.900000000000000...|4.960000000000000...|1.770000000000000...|3.785000000000000...|-1.22239999999999...|
|      4|3.413000000000000...|5.643099999999999...|5.200000000000000...|1.274000000000000...|2.350000000000000...|5.580000000000000...|2.190000000000000...|3.785000000000000...|-1.22250000000000...|
|      5|3.422000000000000...|3.846200000000000...|5.200000000000000...|1.627000000000000...|2.800000000000000...|5.650000000000000...|2.590000000000000...|3.785000000000000...|-1.22250000000000...|
|      6|2.697000000000000...|4.036800000000000...|5.200000000000000...|9.190000000000000...|2.130000000000000...|4.130000000000000...|1.930000000000000...|3.785000000000000...|-1.22250000000000...|
|      7|2.992000000000000...|3.659100000000000...|5.200000000000000...|2.535000000000000...|4.890000000000000...|1.094000000000000...|5.140000000000000...|3.784000000000000...|-1.22250000000000...|
|      8|2.414000000000000...|3.120000000000000...|5.200000000000000...|3.104000000000000...|6.870000000000000...|1.157000000000000...|6.470000000000000...|3.784000000000000...|-1.22250000000000...|
|      9|2.267000000000000...|2.080400000000000...|4.200000000000000...|2.555000000000000...|6.650000000000000...|1.206000000000000...|5.950000000000000...|3.784000000000000...|-1.22260000000000...|
|     10|2.611000000000000...|3.691199999999999...|5.200000000000000...|3.549000000000000...|7.070000000000000...|1.551000000000000...|7.140000000000000...|3.784000000000000...|-1.22250000000000...|
|     11|2.815000000000000...|3.203100000000000...|5.200000000000000...|2.202000000000000...|4.340000000000000...|9.100000000000000...|4.020000000000000...|3.785000000000000...|-1.22260000000000...|
|     12|2.418000000000000...|3.270500000000000...|5.200000000000000...|3.503000000000000...|7.520000000000000...|1.504000000000000...|7.340000000000000...|3.785000000000000...|-1.22260000000000...|
|     13|2.135000000000000...|3.075000000000000...|5.200000000000000...|2.491000000000000...|4.740000000000000...|1.098000000000000...|4.680000000000000...|3.785000000000000...|-1.22260000000000...|
|     14|1.913000000000000...|2.673600000000000...|5.200000000000000...|6.960000000000000...|1.910000000000000...|3.450000000000000...|1.740000000000000...|3.784000000000000...|-1.22260000000000...|
|     15|1.592000000000000...|1.916700000000000...|5.200000000000000...|2.643000000000000...|6.260000000000000...|1.212000000000000...|6.200000000000000...|3.785000000000000...|-1.22260000000000...|
|     16|1.400000000000000...|2.125000000000000...|5.000000000000000...|1.120000000000000...|2.830000000000000...|6.970000000000000...|2.640000000000000...|3.785000000000000...|-1.22260000000000...|
|     17|1.525000000000000...|2.774999999999999...|5.200000000000000...|1.966000000000000...|3.470000000000000...|7.930000000000000...|3.310000000000000...|3.785000000000000...|-1.22270000000000...|
|     18|1.555000000000000...|2.120200000000000...|5.200000000000000...|1.228000000000000...|2.930000000000000...|6.480000000000000...|3.030000000000000...|3.785000000000000...|-1.22270000000000...|
|     19|1.587000000000000...|1.991100000000000...|5.000000000000000...|2.239000000000000...|4.550000000000000...|9.900000000000000...|4.190000000000000...|3.784000000000000...|-1.22260000000000...|
|     20|1.629000000000000...|2.603299999999999...|5.200000000000000...|1.503000000000000...|2.980000000000000...|6.900000000000000...|2.750000000000000...|3.784000000000000...|-1.22270000000000...|
+-------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
  1. Cast to DoubleType: Since all the inputs features are Double, casting String to Double
// cast all the strings to Double type in Data Frame
val castedDF = df.columns.foldLeft(df)((current, c) =>
current.withColumn(c, current(c).cast(DoubleType)))
castedDF.describe().show()

casting.scala hosted with ❤ by GitHub

Output:

+-------+-----------------+------------------+------------------+------------------+------------------+-----------------+------------------+-----------------+-----------------+-------------------+
|summary|          houseId|  medianHouseValue|      medianIncome|  housingMedianAge|        totalRooms|    totalBedrooms|        population|       households|         latitude|          longitude|
+-------+-----------------+------------------+------------------+------------------+------------------+-----------------+------------------+-----------------+-----------------+-------------------+
|  count|            20640|             20640|             20640|             20640|             20640|            20640|             20640|            20640|            20640|              20640|
|   mean|          10320.5|206855.81690891474|3.8706710029070246|28.639486434108527|2635.7630813953488|537.8980135658915|1425.4767441860465|499.5396802325581| 35.6318614341087|-119.56970445736148|
| stddev|5958.399113856003|115395.61587441359| 1.899821717945263| 12.58555761211163|2181.6152515827944| 421.247905943133|  1132.46212176534|382.3297528316098|2.135952397457101|  2.003531723502584|
|    min|              1.0|           14999.0|            0.4999|               1.0|               2.0|              1.0|               3.0|              1.0|            32.54|            -124.35|
|    max|          20640.0|          500001.0|           15.0001|              52.0|           39320.0|           6445.0|           35682.0|           6082.0|            41.95|            -114.31|
+-------+-----------------+------------------+------------------+------------------+------------------+-----------------+------------------+-----------------+-----------------+-------------------+

castinOutput.scala hosted with ❤ by GitHub

  1. Create Feature and Label Set: Divide the Dataframe into features and labels. We are using VectorAssembler to create a feature set. VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector in order to train ML models like logistic regression and decision trees.
// create features
val featureCols = Array("medianIncome", "housingMedianAge", "totalRooms", "totalBedrooms",
      "population", "households", "latitude", "longitude")
val vectorAssembler = new VectorAssembler().setInputCols(featureCols).setOutputCol("features")
//create Label and Features
val featuresDf = vectorAssembler.transform(castedDF).select("medianHouseValue", "features")
//print label/feature
featuresDf.show()

Output:

+----------------+--------------------+
|medianHouseValue|            features|
+----------------+--------------------+
|        452600.0|[8.3252,41.0,880....|
|        358500.0|[8.3014,21.0,7099...|
|        352100.0|[7.2574,52.0,1467...|
|        341300.0|[5.6431,52.0,1274...|
|        342200.0|[3.8462,52.0,1627...|
|        269700.0|[4.0368,52.0,919....|
|        299200.0|[3.6591,52.0,2535...|
|        241400.0|[3.12,52.0,3104.0...|
|        226700.0|[2.0804,42.0,2555...|
|        261100.0|[3.6912,52.0,3549...|
|        281500.0|[3.2031,52.0,2202...|
|        241800.0|[3.2705,52.0,3503...|
|        213500.0|[3.075,52.0,2491....|
|        191300.0|[2.6736,52.0,696....|
|        159200.0|[1.9167,52.0,2643...|
|        140000.0|[2.125,50.0,1120....|
|        152500.0|[2.775,52.0,1966....|
|        155500.0|[2.1202,52.0,1228...|
|        158700.0|[1.9911,50.0,2239...|
|        162900.0|[2.6033,52.0,1503...|
+----------------+--------------------+

featuresOut.scala hosted with ❤ by GitHub

  1. Scaling the features using Standard Scaler:
//Using Standard scaler to scale the feature set
val standardScaler = new StandardScaler()
      .setInputCol("features")
      .setOutputCol("scaledFeatures")
      .setWithStd(true)
      .setWithMean(true)
val scaler = standardScaler.fit(featuresDf)
val scaledFeatures = scaler.transform(featuresDf)
//print the output
scaledFeatures.show()

Output:

+----------------+--------------------+--------------------+
|medianHouseValue|            features|      scaledFeatures|
+----------------+--------------------+--------------------+
|        452600.0|[8.3252,41.0,880....|[2.34470895611761...|
|        358500.0|[8.3014,21.0,7099...|[2.33218146484030...|
|        352100.0|[7.2574,52.0,1467...|[1.78265621721384...|
|        341300.0|[5.6431,52.0,1274...|[0.93294490759373...|
|        342200.0|[3.8462,52.0,1627...|[-0.0128806838430...|
|        269700.0|[4.0368,52.0,919....|[0.08744451941137...|
|        299200.0|[3.6591,52.0,2535...|[-0.1113636089684...|
|        241400.0|[3.12,52.0,3104.0...|[-0.3951270773548...|
|        226700.0|[2.0804,42.0,2555...|[-0.9423363181905...|
|        261100.0|[3.6912,52.0,3549...|[-0.0944672866994...|
|        281500.0|[3.2031,52.0,2202...|[-0.3513861309202...|
|        241800.0|[3.2705,52.0,3503...|[-0.3159091178071...|
|        213500.0|[3.075,52.0,2491....|[-0.4188135104422...|
|        191300.0|[2.6736,52.0,696....|[-0.6300964935813...|
|        159200.0|[1.9167,52.0,2643...|[-1.0285022981105...|
|        140000.0|[2.125,50.0,1120....|[-0.9188604311750...|
|        152500.0|[2.775,52.0,1966....|[-0.5767230643578...|
|        155500.0|[2.1202,52.0,1228...|[-0.9213869840377...|
|        158700.0|[1.9911,50.0,2239...|[-0.9893407287394...|
|        162900.0|[2.6033,52.0,1503...|[-0.6670999657155...|
+----------------+--------------------+--------------------+
  1. Divide the dataset into Test and Training set
//divide in test and train
val split = scaledFeatures.randomSplit(Array(.8, .2))
  1. Build a Linear Regression Model with the Scaled Features
//linear regression model
val lr = new LinearRegression().setLabelCol("medianHouseValue").setFeaturesCol("scaledFeatures")
      .setMaxIter(100)
      .setRegParam(0.3)
      .setElasticNetParam(0.8)
// Using Training set for model building
val lrModel = lr.fit(split(0))
    
// Print the coefficients and intercept for linear regression
println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}")
// Summarize the model over the training set and print out some metrics
val trainingSummary = lrModel.summary
println(s"numIterations: ${trainingSummary.totalIterations}")
println(s"objectiveHistory: [${trainingSummary.objectiveHistory.mkString(",")}]")
trainingSummary.residuals.show()
println(s"RMSE: ${trainingSummary.rootMeanSquaredError}")
println(s"r2: ${trainingSummary.r2}")

Output:

Coefficients: [76738.94522230534,14903.054273122021,-18332.757718162775,50003.22137242165,-42713.72451462165,16547.521499666713,-90676.93313902401,-85365.66935383904] Intercept: 206772.90073196447
numIterations: 101
objectiveHistory: [0.5,0.429025021421256,0.26855114619613496,0.22652891956525684,0.21499857945908582,0.20773178001678289,0.19514784970174676,0.1885232671505026,0.1859671425971781,0.1853561326312991,0.18457399881973188,0.18441537651204076,0.18373103793910683,0.18350208412916771,0.1823597273673587,0.18229316947052004,0.18218622679968952,0.18215858300437648,0.18213798185173202,0.1821246023017618,0.18211459711626019,0.18209961828343316,0.18208837731785227,0.18205588604904863,0.1820434687407733,0.1820208677601439,0.18200624291405224,0.18200467173854357,0.18199278217947065,0.18199099118402143,0.18198050302769553,0.1819789044800163,0.18196944209901966,0.1819675578082704,0.18195935806645552,0.18195583674748167,0.18194791834626967,0.1819391408347191,0.1819311038867398,0.1819225377989266,0.18191655895966014,0.18190851174762848,0.18190363033484025,0.1818959409176114,0.18189156441553492,0.18188376965718034,0.18187948424747338,0.18187360174546469,0.18186895573232031,0.1818635331831013,0.1818584295438534,0.1818534353328115,0.1818477940447873,0.18184338131598446,0.1818370520009538,0.18183019469423445,0.18182623652061064,0.18181992818323542,0.18181479669876457,0.18180980523170026,0.18180828991921147,0.18180289998437538,0.1818009106789837,0.1817973011799926,0.18179518581366086,0.18179204915301264,0.18178992998002239,0.18178753203578832,0.18178555677441152,0.18178259277789513,0.18178126191707356,0.1817752061445554,0.18177052282864672,0.1817656813431612,0.18175784283297136,0.1817571920872549,0.18175514410840932,0.18175445013903024,0.18175257430793876,0.1817519142927131,0.1817502604701024,0.18174969009013336,0.18174822566214843,0.18174793601506734,0.18174738819241323,0.18174606666583532,0.1817457064933114,0.1817448402686882,0.18174442260634774,0.18174371450337473,0.18174341969111288,0.1817427069503691,0.18174243139117785,0.18174180325170708,0.18174152008293612,0.18174083826101903,0.1817405847778775,0.18173984443252902,0.18173934292707547,0.18173882279925963,0.18173828427045613]
RMSE: 69593.10829328775
r2: 0.6365386822854913

The model can be used for predictions now.

It takes a fair amount of time and effort to create the simple app in the Apache Spark example above. In contrast, we can quickly build an app using the TransmogrifAI CLI module that comes out of the box with TransmogrifAI and explore the power of AutoML.

Real Estate House Price Prediction using TransmogrifAI

For the purpose of this blog, we are going to demonstrate how we can quickly generate a real estate housing price prediction application and train it using the California Housing dataset described above.

Code Generation

The fundamental idea behind code generation is type inference. Before going into further details on how type inferencing works, lets quickly go over the steps that are required to generate the app and train the model by diving into the internal details of the code generation.

  1. Clone TransmogrifAI & Build CLI module
    Clone the TransmogrifAI project from GitHub:
# Cloning TransmogrifAI Repo
git clone https://github.com/salesforce/TransmogrifAI.git

clone.sh hosted with ❤ by GitHub

Check out the latest release branch (in this example, 0.4.0):

cd ./TransmogrifAI
git checkout 0.4.0

checkout.sh hosted with ❤ by GitHub

Build the TransmogrifAI CLI:

./gradlew cli:shadowJar
alias transmogrifai="java -cp `pwd`/cli/build/libs/\* com.salesforce.op.cli.CLI"

build.sh hosted with ❤ by GitHub

  1. Fetch Dataset & Generate Real Estate App
    Create the required directories and download the dataset:
cd .. && mkdir -p blog/data && cd blog/data
wget https://raw.githubusercontent.com/ajayborra/TransmogrifAI-CaliforniaHousing/master/data/cadata.csv && cd ..

fetch.sh hosted with ❤ by GitHub

Generate and build the real estate app using the California housing dataset:

transmogrifai gen --input data/cadata.csv --id houseId --response medianHouseValue --overwrite --auto HouseObject RealEstateApp 
cd realestateapp && ./gradlew compileTestScala installDist

generate.sh hosted with ❤ by GitHub

Step 3: Training

Run training on the dataset with the generated app. This step takes about five to ten minutes:

./gradlew sparkSubmit -Dmain=com.salesforce.app.RealEstateApp -Dargs="--run-type=train --model-location=./house-model --read-location HouseObject=`pwd`/../data/cadata.csv"

train.sh hosted with ❤ by GitHub

After running the above steps, we get an output the looks similar to the one below:

"bestModelUID" : "gbtr_ee53956069aa",
"testSetEvaluationResults" : {
  "(regEval)_R2" : 0.7763428813232043,
  "(regEval)_RootMeanSquaredError" : 54308.675029060905,
  "(regEval)_MeanAbsoluteError" : 36619.733388656015,
  "(regEval)_MeanSquaredError" : 2.9494321834121437E9
},
"bestModelName" : "gbtr_ee53956069aa_13",
"trainingSetEvaluationResults" : {
  "(regEval)_R2" : 0.8184847478105526,
  "(regEval)_RootMeanSquaredError" : 49200.74279966309,
  "(regEval)_MeanAbsoluteError" : 33837.91117985914,
  "(regEval)_MeanSquaredError" : 2.420713092038599E9
}
18/09/11 16:37:57 INFO OpWorkflowRunner: Total run time: 6m54.662s

trainOut.scala hosted with ❤ by GitHub

This particular result means that we successfully trained our model and identified that the gradient boosted tree (gbtr_ee53956069aa) is a good fit for the dataset with the above-mentioned training and test scores.

Type Inference

So far we have seen that using the TransmogrifAI CLI module we can quickly get an application off the ground and train it without any code. Now let’s dive a bit deeper and explore type inferencing.

In the example above, we leveraged the automatic type inferencing capability of TransmogrifAI which assigns a type to each field in the CSV file and detects the schema of the dataset. In this example, type assignment happens in two phases when we invoke transmogrifai gen --input data/cadata.csv with the --auto flag. In the first phase, the input dataset is passed on to the Spark CSV module which scans the entire dataset and assigns a primitive spark type (string, long, double) to each field to construct the schema of the dataset. This schema is passed on to the second phase to map these primitive type to rich data types that are supported by TransmogrifAI described in detail here. For the curious mind, here’s the link to the code that handles automatic type inference.

In general, if you want more control over which type gets assigned to a field during code generation for the datasets, provide the --schema flag to the transmogrifai gen command. This flag allows the users to pass in an Apache Avro schema file with the field to primitive type mappings. In this case, the Avro schema passed to the CLI module is used to infer the primitive types of the fields. These primitive types are then used to infer the rich types supported by TransmogrifAI type hierarchy.

In both the cases mentioned above, we covered the most common scenario where the schema of the datasets is flat without any nested objects. But there can be use cases where the data reading part of the pipeline can present challenges with reading data from complex, nested data structures with nested schemas. For this scenarios, we recommend using the Custom Reader capabilities of the framework by providing customized Apache Avro schema file. You can tailor the dataset parsing to your needs and map these types to the TransmogrifAI type hierarchy.

The resulting fields composed of TransmogrifAI types are used to generate the feature abstraction for each field in the dataset. This schema comprising of feature abstractions enables the AutoML pipeline.

The real estate house price prediction use case with a handful of numeric features is a simplistic illustration of the power of TransmogrifAI. TransmogrifAI shines more with diverse feature types needing sophisticated feature engineering, real-world data with hindsight bias or data leakage, etc. Stay tuned for future blog posts where we uncover different aspects of the automated data pipeline capabilities which include feature engineering, feature selection, model selection, and hyper-parameter tuning offered by TransmogrifAI.

Related Artificial Intelligence Articles

View all