Creating big data products: shiny’s spark journey



I’m always interested in how to develop and deploy “shiny sparkr” applications. This article will show you how to use sparkr to drive shiny applications.

What is sparkr

Sparkr is an R package that provides a lightweight spark front end for R. Sparkr provides a distributed data frame data structure, which solves the bottleneck that the data frame in R can only be used in a single machine. Like the data frame in R, it supports many operations, such asselect,filter,aggregatewait. (similar)dplyrThis solves the big data bottleneck of R. Sparkr also supports distributed machine learning algorithms, such as usingMLibMachine learning library.

What is shiny

Shiny is an open source R package, which provides an elegant and powerful web framework for using R to build web applications. Shiny helps you transform data analysis into an interactive web application without front-end knowledge.

Use cases

You may ask yourself, “why do I need to use sparkr to run my program?”. This is a reasonable question and answer. We need to understand different types of big data problems.

Classification of big data problems

lately,On reddit’s AMA channel, Hadley Wickham(rstudio chief scientist) has drawn a clear definition of “big data”. His insights will help us define use cases for sparkr and shiny.

I think big data problems should be classified into three main categories:

  • Big data and small analysis: Data scientists begin to slice and sample data from a large original data set for a specific business or research problem.
    In most projects, the sampling results are small data sets, and sparkr is not needed to drive shiny applications in these projects.

  • Piecewise aggregation analysis: Data scientists need distributed parallel computing on multiple machines. Wickham thinks this is a trivial parallelization problem. One example is when you do large-scale computing, you need to fit a model for each machine on thousands of machines. In this case, sparkr is a good choice, but you can also use R’sforeachWait for the package to solve this problem.

  • Large scale data analysis: Data scientists need big data, probably because they are dealing with a complex model fitting. An example of such a problem is the recommendation system. Because they need to capture users’ sparse interactions, recommender systems do benefit from a large amount of data. When developing shiny applications, sparkr can perfectly solve such problems.

Memory considerations

In addition, when you want to use such an application, it’s important to consider the availability and size of memory. This can be done in two different ways:

  • If the application server you are running has enough memory to meet your big data needs, you may not need sparkr at all. Now there are cloud providers like Amazon AWS that provide computing memory on t.

  • If your big data cannot be installed on one machine, you may need to allocate it to several machines. Sparkr is suitable for this problem because it provides a distributed algorithm, which can compress the data of different nodes and return the results to the master node.

A simple example

Before we begin to understand how each of these applications will work, let’s download this simple shiny sparkr and run the application.Project addressUnder the directory“shiny-sparkr-demo-1”Examples are available.

Creating big data products: shiny's spark journey

get ready

  • Install spark 1.5 and above.

  • Install Java 1.7 and above, and configure environment variables.

application was launched

Once you download the app folder, open the project rstudio and open it“server.R”Documents.

  1. changeSPARK_HOMEThe path of the environment variable to the location where spark is installed.

Creating big data products: shiny's spark journey

  1. Run the application. By using this commandshiny::runApp()Run the application. It will take some time for sparkr to initialize before the results of the analysis are displayed.

Creating big data products: shiny's spark journey

  1. This is“server.R”The code for.

#Install the shiny library first

#Setting system environment variables
Sys.setenv(SPARK_HOME = "/home/emaasit/Desktop/Apache/spark-1.5.2")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))

#Loading the sparkr Library

#Create a spark context and a SQL context
sc <- sparkR.init(master = "local")
sqlContext <- sparkRSQL.init(sc)

#Create a sparkr dataframe for the "iris" dataset
iris_DF <- createDataFrame(sqlContext, iris)

#Define the back-end logic that needs to predict the sepal length
shinyServer(function(input, output) {

  #Machine learning
  model_fit <- glm(Sepal_Length ~ Species + Petal_Width + Petal_Length, data = iris_DF, family = "gaussian")
  output$summary_model <- renderPrint({summary(model_fit)})
  output$predict_new_value <- renderText({
      Species <- as.character(input$species) 
      Petal_Width <- as.double(input$petalWidth)
      Petal_Length <- as.double(input$petalLength)
      new_data_frame <- data.frame(Species = Species, 
                                 Petal_Width = Petal_Width,
                                 Petal_Length = Petal_Length)
      newDataFrame <- createDataFrame(sqlContext, new_data_frame)
      predicted_value <- predict(model_fit, newData = newDataFrame)
      unlist(head(select(predicted_value, "prediction")))


Step one:

When you run this application, there will be no text rendering or model summary data in the displayed user interface.

Creating big data products: shiny's spark journey

Step 2:

At the same time, on the node (s) in the background of your computer, Java uses the spark submit startup file, and then the sparkr library loads the sparkr initialization.

Creating big data products: shiny's spark journey

Step 3:

Then the sparkr command"server.R"Finally, the output is displayed in shiny’s application.
Creating big data products: shiny's spark journey
To access port 4040 of localhost, you can use spark UI to check the progress of task scheduling.
Creating big data products: shiny's spark journey

Step 4:

When you modify the input value in the app and click"Predict Sepal Length"Button, this application will use the value you entered as spark context to execute the prediction function and display the prediction value. Compared with initializing the shiny application, this operation takes only a short time.
Creating big data products: shiny's spark journey


The purpose of this example is to explain how to learn the use cases of sparkr and shiny; to see what happens, you need to finally deploy and run the application on your computer.

If you have built such an app, please share your thoughts and experiences in the comments below.

This paper has been approved by the original authorDaniel EmaasitAuthorized byHarryZhuTranslation.

As Sharism, all the pictures and texts published on the Internet are subject to CC copyright. Please keep the author’s information and indicate the author’s financer column of Harry Zhu: the source code is involved, please indicate the address of GitHub: Micro signal: harryzhustudio
For commercial use, please contact the author.