Review on Apache Flink

research

Abstract:

This paper describes an overview of apache Flink a data processing framework or tool developed by Apache organization. In this paper I provide some insights on apache Flink on how it works and how efficient it is compared to other data processing tools available out there. This paper also explains the need of such kind of tools and the importance of using these tools in order to make data driven decisions or getting to know more from Big Data. To process Big data several techniques have been developed over time like hadoop, map reduce, etc. Apache Flink is an open source stream processing framework developed by the Apache Software Foundation. The core of Apache Flink is a distributed streaming data flow engine written in Java and Scala [1] . This paper provides some contents for which a developer or data processing manager should be aware of when using apache Flink or implementing it to process huge amount of data flowing into the system at a high rate.

Introduction

Big Data can be considered as a vast amount of Data. In technical terms it is such a big amount of data which escapes beyond the processing capacity of any database system. The amount of data is so huge it becomes a problem to store or process it in a traditional database system. Whole amount of data together forms a very complex structure, it becomes difficult to iterate and gain useful information through it. The need for distributed data processing frameworks is growing tremendously because of the increase in demand and analysis achieved through data processing. There are basically two well-known data processing tools with API for data batches and data streaming Apache Flink and Apache Spark. This paper comprises of insights of how Apache Flink works and it’s comparison with other tools like Apache Spark or Apache Beam. Further this paper also consists of some statistics gathered by running a single node cluster for Apache Flink and Apache Spark and running the word count example. A very easy to read comparison between different data processing frameworks is provided in a tabular format which can be very useful for anyone who is deciding on which tool to use according to their application.

For full paper click here

Music Genre Classification Using Lyrics

research

Classification of music is a very important and heavily researched task in the field of NLP. Previous research in this field has focused on classifying music based on mood, genre, annotations, and artist. All the approaches either used audio features, lyric as text or both in combination.

Genre classification by lyrics is itself a clear Natural Language Processing problem. The end goal of NLP is to extract some sort of meaning from text. For music genre classification, this equates to finding features to classify music using lyrics. There are a wide range of scholarly and commercial applications for automated music genre classifiers. For example, classifiers could be used to automatically analyze and sort music into large databases. Music recommendation systems could be used to automatically analyze a user’s liking’s and recommend appropriate songs to listen. Music classifier can be used to recommend music based on the mood of users. Similarity analysis which is a part of music

Report

Data

  • We scrapped songs data from songsLyrics.com, and metrolyrics.com. Also, we used a song dataset from kaggle.com.

  • Our data included around 390,000 songs. Our data includes attributes like song, lyrics, year, artist, and a target attribute genre. For our task, we sampled 20,000 songs of each genre from our original dataset.

Pre-Processing

  • Removed instances with genres like “not available” and “other”,
  • Removed genres which didn’t have many instances.
  • Removed unnecessary characters using regular expression.
  • Removed stopwords using nltk’s english stopwords and stanford’s stopwords list.
  • We stemmed tokens in each song using nltk’s Snowball stemmer.
  • Some songs in our dataset had a non-english words. Using ftfy, we have fixed the
  • encoding of the text, and also we removed instances which had a non-english words even after we fixed the encodings.
  • We removed word such as ‘Chorus’ and ‘Verse’ which represent different parts of a song.

Target variable

For our classification task we selected 4 target variables which are the genres. Target variables are as follows:

  • Classical
  • Pop
  • Metal
  • Jazz

Features

  • Similarity with four genres: We calculated the top 30 words in each genre using tf-idf. We created four different features named metal_similarity, pop_similarity, rock_simliarity, and hip_hop_similarity. If a token appeared in any of the top-30 words of any genre we used its tf-idf value to calculate the cosine similarity with the tf-idf value of that token in a particular genre in which it appeared.
  • Pos tags: Using nltk tokenizer, get used a normalized count of pos tags.
  • Word2vec: We trained a word2vec model on the whole dataset, and brown corpus. After training word2vec model, we used it to generate word2vec vector of each token in each song.

Models used:

  • Dummy Classifier
  • kNN CLassifier
  • MLP Classifier
  • Gradient Boosting
  • Logistic Regression

Metrics Used:

  • Accuracy
  • F1 Score

Conclusion

After analysis of tf-idf values and confusion matrix we came to know how similar rock and metal songs are. Most of the Classifiers were also predicting wrong labels among these two genres. For the future work, we can use some more features such as parse trees, word endings, and length of a song to distinguish between these two genres and further increase accuracy of different classifiers.

  • Results:

  • Confusion Matrices:

[Project Link] (https://github.com/kartikprakash1993/Lyrics-Analysis)

Music Genre Classification Using Audio Signals

research

This project had an objective to perform a machine learning approach to classify a song based on it’s audio features. This approach can also be used for recommendation purposes on a huge scalable system. Music genres are hard to systematically and consistently describe due to their inherent subjective nature. This project uses a small dataset to just understand how to approach such kinds of problem and develop a model that can be easy to understand use.

Following ML algorithms were applied:

  • k-Nearest neighbors
  • Neural Networks with different parameters

DataSet

We have 2530 instances which are distributed among four genres (which is our target variable). The genres are as following:

  • Classical
  • Jazz
  • Pop
  • Metal

We initially did for 10 genres but due to huge data size we reduced our target categories to 4. We selected the above mentioned four genres because they have distinct style of music.

Data source

We collected the data from a website which provides free and legal download of music tracks based on their genres.Instead of downloading tracks manually we designed a web crawler using Selenium framework, a web automation tool, in java. We upgraded our crawler so that it loads dynamically the webpage and download the tracks. We ran our crawler on 4 different computers simultaneously due to the huge size of the data. It took around roughly 7 hours for each genre to be downloaded.

Target variable

For our classification task we selected 4 target variables which are the genres. Target variables are as follows:

  • Classical
  • Pop
  • Metal
  • Jazz

Features

We choose 5 features which are

  • Pitch (Chromagram) In music, the pitch of a note means how high or low a note is.
  • RMS The RMS (Root-Mean-Square) value is the effective value of the total waveform.
  • Tempo Tempo is the speed or pace of a given piece or subsection.
  • Roll-Off Roll-off is the steepness of a transmission function with frequency.
  • Zero Crossing Rate The zero-crossing rate is the rate of sign-changes along a signal.

Conclusions:

  • k-Nearest neighbours with different value of number of neighbours:

  • Neural network with different layers and solver:

  • Important features for k-NN

The most important feature for KNN : Roll Off

  • Important features for Neural Networks

The most important features of Neural Networks : Roll Off and Tempo Negative Features for Neural Network : Pitch and RMS and Zero Crossing Rate

Project Link

Big Data & Hadoop Framework

research

In this paper, we describe Big Data and open source framework Hadoop. The statistics are provided, explaining formation of Big Data and the importance of analyzing it. Big Data possess many problems in the real world scenario due to it’s vast size, velocity and variety. Several techniques are developed to process Big Data. Hadoop framework is one such technique and is described along with the pseudo code for its mapper and reducer function. The file system of Hadoop – HDFS is explained as well with the help of appropriate diagram.