Real-Time Weather Event Processing With HDF, Spark Streaming, and Solr

Blog - Apache, Government, HDF Spark

By now, we’ve all gotten well-acquainted with Hortonworks DataFlow (HDF), which collects, curates, analyzes, and delivers real-time data to data stores (with help from Apache NiFi) in a super easy and quick way without having to actually code the various components required to deliver the expected results.

My team and I have also been exploring HDF implementation in various projects and POCs and I have to say, we have grown richer by our experience of working on it. It simply is fantastic!

Today, I am sharing one of our recent uses of HDF, which I hope will let you all implement it effectively, as well.


It’s live weather reporting using HDF, Kafka, and Solr.

Here are the environment requirements for implementing:

  • HDF (for HDF 2.0, you need Java 1.8).
  • Kafka.
  • Spark.
  • Solr).
  • Banana.

Now let’s get on to the steps!

1. Create DataFlow

Start HDF with /opt/HDF- start. Open HDF https://localhost:8090/nifi. Choose the following processors to create the data flow:

  • InvokeHTTP.
  • SplitJson.
  • EvaluationJsonPath.
  • ReplaceText.

Configurations for Each of the Processes

InvokeHTTP processor properties:


With the help of the InvokeHTTP processor, we are connecting to the source and getting data with the help of

It gives JSON data as below:

JSON data

The parameters details are:

  • coord
    • coord.lon City geo location, longitude
    • City geo location, latitude
  • weather
    • Weather condition ID.
    • weather.main: Group of weather parameters (rain, snow, extreme etc.).
    • weather.description: Weather condition within the group.
    • weather.icon: Weather icon ID.
  • base: Internal parameter.
  • main
    • main.temp: Temperature (unit default: Kelvin; metric: Celsius; imperial: Fahrenheit).
    • main.pressure: Atmospheric pressure (on the sea level, if there is no sea_level or grnd_level data), hPa.
    • main.humidity: Humidity, %.
    • main.temp_min: Minimum temperature at the moment. This is a deviation from the current temperature that is possible for large cities and megalopolises geographically expanded (use these parameters optionally). Unit default: Kelvin; metric: Celsius; imperial: Fahrenheit.
    • main.temp_max: Maximum temperature at the moment. This is a deviation from the current temperature that is possible for large cities and megalopolises geographically expanded (use these parameters optionally).Unit default: Kelvin; metric: Celsius; imperial: Fahrenheit.
    • main.sea_level: Atmospheric pressure on the sea level, hPa.
    • main.grnd_level: Atmospheric pressure on the ground level, hPa.
  • wind
    • wind.speed: Wind speed (unit default: meter/sec, metric: meter/sec, imperial: miles/hour).
    • wind.deg: Wind direction, degrees (meteorological).
  • clouds
    • clouds.all: Cloudiness, %.
  • rain
    • rain.3h: Rain volume for the last three hours.
  • snow
    • snow.3h: Snow volume for the last three hours.
  • dt: Time of data calculation, unix, UTC.
  • sys
    • sys.type: Internal parameter.
    • Internal parameter.
    • sys.message: Internal parameter.
    • Country code (GB, JP etc.).
    • sys.sunrise: Sunrise time, unix, UTC.
    • sys.sunset: Sunset time, unix, UTC.
  • id: City ID.
  • name: City name.
  • cod: Internal parameter.

The JSON data is split into separate records.

SplitJson Processor Properties


With the help of the SplitJson processor, we are splitting the JSON data based on the JsonPath Expression value:


EvaluationJsonPath processor

EvaluationJsonPath properties:


With the help of the EvaluationJsonPath processor, we are getting the required column values from the JSON data:


ReplaceText Processor

ReplaceText processor properties:


Select the required fields from the list in JSON Format:


With the help of the ReplaceText processor, we are replacing the required JSON records in the required columns.

PutKafka Processor

  • Ingesting data into Kafka topic weather-in.
  • PutKafka will create a new Kafka topic if a topic does not already exist.


Finally, load data into Kafka by using the PutKafka processor. Then, Spark does the job of reading data from Kafka and processing it.

2. Spark-Scala Application to Save Kafka Data in CSV Format

Create a Maven project with following code:




Indexing Data in Solr

To start Solr, open the terminal and run the below commands:


Open the following Solr UI in a browser: https://localhost:8983/solr

Solr UI in a browser

3. Create core in Solr:


Ingest data into Solr from the CSV file generated earlier from the Spark application by using the following code:


Now, data is loaded into Solr core.

You can check data inserted into the core in the web UI:

core in web UI

Open the Banana Web Dashboard using https://localhost:8983/solr/banana/index.html.

It will show a default introduction dashboard similar to the image below:

banana dashboard

Click New in the top right corner to create a new dashboard. It will prompt you to click a type of dashboard:

  • Time-series dashboard if any timestamp column contains your core.
  • Non time-series dashboard if you don’t have any timestamp columns.

time-series dashboard
Time-series Dashboard1

Click Create to create a new dashboard. A successfully created dashboard will look something like this:

created dashboard will

Change the Time Window to select the number of rows required to be shown. Go to Table Panel to check whether your data is correctly parsed or not. If everything is fine, your data will be displayed as follows:

Table Panel

Go to Dashboard settings at the top right corner of the web page:

dashboard settings
dashboard settings1

The webpage will prompt you to create a new panel. Click on that and it will take you to the row settings.


Click on Add panel to empty row.

Select your desired panel from the list. It will show you options based on its properties.

desired panel from

Fill all the required fields to get the graph:

get a get graph

The below diagram shows temperature levels on ground level distributed by the type of weather that day:

type of weather day

Similarly, the humidity on ground level is distributed by wind speed:

distributed by wind

Pressure on sea-level distributed by wind speed:

distributed by wind speed

Follow the below configurations to get a pie chart of various weather days:

Chart of various weather days

Chart of various

You will also see the count of different types of weather days (whose data is recorded):

count of different types


I hope this gives you enough perspective on putting HDF to (very effective) use!

This blog was originally published on Dzone…


Contact MSRCosmos

Contact Us