# DataLab Getting Started in R

The Bigstep DataLab is a open data exploration service that offers data science, analytics and technology experimentation, built on our SparkArray, DataLake and on our highly flexible and high performance bare-metal infrastructure.

This tutorial assumes some programming experience.

## Uploading Data

A private datalake (HDFS service) is used to store the data that the SparkArray uses. To upload data to a Bigstep Datalake, one would typically:
1. upload data to the home directory of the datalake using commands like "-put"
2. execute commands like "-ls" to ensure data was uploaded in the datalake

```
dl -ls /
16/09/26 17:18:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 3 items
drwxrwxrwx   - hdfs supergroup          0 2018-09-15 13:12 /data_lake/dl1234/baseball
drwxrwxrwt   - hdfs supergroup          0 2018-09-15 12:08 /data_lake/dl1234/tmp
drwxr-xr-x   - hdfs supergroup          0 2018-09-15 12:08 /data_lake/dl1234/tmp/user
```
You can also execute the same commands on the master container!

Data can be uploaded to the DataLake also by using the File Browser that is available in the DataLake File Browser tab in our user interface.

In [None]:
# Allow the use of shell operations 
system("wget http://www.exploredata.net/ftp/MLB2008.csv", intern=TRUE)

In [None]:
# Copy the downloaded file to Bigstep DataLake, using the path specified under the Spark tab in the Bigstep Control Center
system("dl -put MLB2008.csv /", intern=TRUE)

## Initialize Spark Context

For all Spark functions to be available, a Spark context has to be initialized in the current notebook.

In [None]:
library(SparkR)
sparkR.session(appName = "R", sparkConfig = list(spark.warehouse.dir=""))


## RDDs

An Resilient Distributed Dataset is an array that is spread across multiple servers. It allows the programmer to abstract away the complexity of transforming large volumes of distributed data.

In [None]:
system("wget http://seanlahman.com/files/database/baseballdatabank-master_2016-03-02.zip", intern=TRUE)

system("apt-get install -y unzip", intern=TRUE)
system("unzip baseballdatabank-master_2016-03-02.zip", intern=TRUE)
system("rm -rf baseballdatabank-master_2016-03-02.zip", intern=TRUE)

system("dl -put baseballdatabank-master/core/AllstarFull.csv /", intern=TRUE)

In [None]:
system("dl -chmod 777 /tmp/hive", intern=TRUE)

In [None]:
Sys.getenv()

In [None]:
sc <- sparkR.session()
 
people <- read.df("/AllstarFull.csv", "csv")


In [None]:
count(people)

In [None]:
first(people)

## DataFrames and SparkRSQL

A SparkDataFrame can also be registered as a temporary view in Spark SQL and that allows you to run SQL queries over its data. The sql function enables applications to run SQL queries programmatically and returns the result as a SparkDataFrame.

Spark 2.3.0. has a built-in CSV reader:

In [None]:
# Read a json file
dfPeople <- read.df("file:///opt/spark-2.3.0-bin-hadoop2.7/examples/src/main/resources/people.json", "json")

In [None]:
# Register the DataFrame as a SQL temporary view
createOrReplaceTempView(dfPeople, "people")

# SQL statements can be run by using the sql method
teenagers <- sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
head(teenagers)