¿Cambiar a Español?

My learnings about Big Data

Sharing some interesting things I've learned recently

My learnings about Big Data
19 October 2015

Hi folks! Yesterday I joined a course in Coursera regarding Big Data. It is taught by the University of San Diego in California. I thought it would be a good idea to write a blog post about my learnings, so you can have an idea of the type of things you will learn if you decide to enrol as well.

I will keep posting notes here while doing the course, so don't expect a very organised article until the end!.

Week 1: Welcome to Big Data

  • There's a huge need for data scientists
  • 90% of the World's data has been created in the last two years!
  • In 2009 we had 0.8 ZB of data and in 2020 we will have 20 ZB
  • It's growing very rapidly
  • The majority of Big Data out there is unstructured
  • It is generated from everywhere around us (mobiles, GPS, etc)
  • Companies need to capture information about their products, services, customers, pricing, segmentation, social networks, etc, to gain insight from the data
  • Gather, store and manipulate large amounts of data at the right speed and at the right time
  • There's a lot of untapped value in Big Data
  • Predictive and deep analytics
  • Come up with answers and improve ROI or try to understand our customers and learn their habits and predict their future behaviours
  • Functional requirements: collection, integration, organise, analyse (statistical, summary, predictive,..), management, take action, decisions
  • Big Data stack: analyse then offer some focussed services on top of those analytics
  • Tools that provide fast, scalable access to the data then push that to the analytics stack
  • Many different areas of booming new technologies. Crowded and diverse space
  • Marketing companies are at the forefront
  • Needs: real-time, scalable, high-performance analytics on large datasets
  • Bring storage capacity and computational capacity together: Hadoop
  • Apache Hadoop: open source, low cost, reliable, scalable, distributed computing. From a single server to thousands of machines
  • Fault-tolerant, flexible environment (structured or unstructured data)
  • Lower layer: Hadoop Distributed File System (HDFS)
  • Middle layer: Hadoop MapReduce, a model for large-scale data processing
  • Top layer: we can have software like Pig, Hive, Mahout, etc to manipulate the data through the MapReduce processes
  • Minimize data movement
  • This is how MapReduce works:
  • We will learn how to submit MapReduce jobs
  • In Hadoop 2.0 we have YARN, which allows us to do more complex stuff

Week 2: Why Big Data?

  • Computers are no longer deterministic machines. Not physically available
  • Bring technologies together to find meaning in large, fast-moving, uncertain data
  • Before: relational databases. Now: clickstream
  • Machine data is very fast
  • Streaming data. IoT. Very fast too
  • Before: structured datasets. Now: raw, complex, unstructured
  • Going beyond the data warehouse. SQL? HBase, Hive,...
  • Expanded 'views' of data. Behavioural, Social Media challenge: integration
  • Find meaning in the chaos: integration, transformation, load
  • Analytics: simple, advanced, statistical
  • Predictive dashboards
  • Parallelised, distributed, optimised
  • Before: sample, do machine learning, build predictive models, score larger data set. Now: just analyse all data and run models? Exploding sample size
  • Correlation vs causality. Does not necessarily explain it
  • New methods from the research community: deep learning, moving beyond flat files to more complex data
  • Past and present. Before: white-coat PhD expensive tools. Now: data scientist open source tools
  • Who are data scientists? Need to understand statistics, machine learning, databases, data mining, how to query, order, visualisation,...
  • Communication skills. Understand the domain.
  • Ask the right questions that will bring value to the business
  • Intellectual curiosity, intuition, communication and engagement, presentational skills, creativity, and business savvy. Interact with business analysts.
  • Data preparation, understanding, modelling
  • Need to code, create equations
  • Most successful data scientists have substantial, deep expertise in at least one aspect of data science: statistics, machine learning, Big data, Business communication
  • Data science is inherently collaborative and creative
  • Curriculum topics: Data manipulation at scale, Analytics, Communicating results
Before you continue...

By clicking "Accept All", you agree to the storing of cookies on your device to enhance site navigation and analyze site usage.


¿Cambiar a Español?