Coursera Spark week 1 Wikipedia

So this week is an official start of 4th course from Scala Specialization on Coursera. I mean Spark course. This is an outstanding event, because so many students have been waited for this almost 1 year 🙂 What does it mean for me? Firstly I need to keep focused on video lectures and assignments. Secondly, I decided to share my impressions about the course here.

General impressions

As you may guess, the first week was totally about Spark basics. Within video lectures Heather Miller (lector) told a lot of things related to place of Spark in a big data processing. There was a pretty comprehensive comparison of Spark and Hadoop. Hence I’m as a student of this course, have more or less objective understanding of all Spark advantages over Hadoop.

A course content is good enough: Heather’s English is nice, slides contain a lot of useful information. But sometimes slides are blurred, and reading becomes pretty tricky. A theory predominates over practice. This means that you would know how Spark works, how it could be used in more efficient way, but an acquaintance with Spark API is given to students on self-learning.

Exactly in this moment you understand, that the previous three courses of the Specialization come in handy! Yes, I’m talking about Scala. Knowledge of Scala is extremely important for Spark. Especially useful is Scala collections API. Despite of distinction in internals of Spark API and Scala collections API, they look similar.

Impressions from assignments

Well, after watching the video lectures you need to execute some practical assignments. Based on my previous experience I prepared myself for the worst scenario (it was rather hard to solve tasks in the first three courses). But everything went smoothly! After attentive reading of the assignments, I started browse Spark API and step by step implemented all of the functions.

4 out of 5 functions passed the tests on the Sunday evening. And on the Monday morning I completed the 5th function. Three… Two… One… I submitted the task for verification. And voila! ZERO! ZERO POINTS! Functions were running too long and I had 0 out of 10!

So I rapidly looked at the functions once again and re-developed them. All tasks were related to processing of Wikipedia articles and frequency of programming languages names in them. My new result was 9 / 10. Wow! I figured out what was wrong and improved the code in order to get 10 / 10 points.

For those of you who do not walk through this course, I want to share the solutions. So be kind and do not copy paste these solutions 🙂

Occurrences number of programming language in wikipedia articles:

def occurrencesOfLang(lang: String, rdd: RDD[WikipediaArticle]): Int = {
  rdd.map(article => article.text)
    .filter(article => article.split(" ").contains(lang))
    .collect()
    .length
}

Rank programming languages based on occurrencesOfLang function:

def rankLangs(langs: List[String], rdd: RDD[WikipediaArticle]): List[(String, Int)] = {
  langs.map(lang => (lang, occurrencesOfLang(lang, rdd)))
    .sortBy(pair => pair._2)
    .reverse
}

Inverted index of the wikipedia articles:

def makeIndex(langs: List[String], rdd: RDD[WikipediaArticle]): RDD[(String, Iterable[WikipediaArticle])] = {
  rdd.flatMap(article => langs.filter(lang => article.text.split(" ").contains(lang))
    .map(lang => (lang, article)))
    .groupByKey()
}

Rank programming languages based on inverted index:

def rankLangsUsingIndex(index: RDD[(String, Iterable[WikipediaArticle])]): List[(String, Int)] = {
  index.map(pair => (pair._1, pair._2.size))
    .sortBy(pair => pair._2)
    .collect()
    .toList
    .reverse
}

Rank programming languages using reduceByKey function:

def rankLangsReduceByKey(langs: List[String], rdd: RDD[WikipediaArticle]): List[(String, Int)] = {
    rdd.flatMap(article => langs
      .filter(lang => article.text.split(" ").contains(lang))
      .map(lang => (lang, 1)))
      .reduceByKey(_ + _)
      .collect()
      .toList
      .sortWith(_._2 > _._2)
  }

Summary

First week of the “Big Data Analysis with Scala and Spark” course is interesting. And I recommend to focus on learning Spark API and practice on some small samples in order to see what each particular function can do. If you have no any previous with Scala it would definitely not easy to move fast through the assignments.

I’m going to publish next assignments for week #2 & #3 in the nearest future.

About The Author

Mathematician, programmer, wrestler, last action hero... Java / Scala architect, trainer, entrepreneur, author of this blog

Close