• About Me Card

Max Hemingway

~ Musings as I work through life, career and everything.

Max Hemingway

Category Archives: Data Science

Learning Data Science – Useful References

14 Tuesday Jul 2015

Posted by Max Hemingway in Big Data, Data Science, Machine Learning, Open Source

≈ 1 Comment

Tags

Big Data, Data, Data Science, Knowledge, Machine Learning

Firstly thanks to Tim Osterbuhr who prompteLearningd me to create this list of resources that I have found useful in learning about Data Science after he read my blog post on Learning Data Science. Tim has also provided some of the likes below as well.

Here is the list of Useful References for Learning Data Science. (This list is be no means exhaustive)

From my Blog

  • Learning Data Science
  • Data Science in the Cloud ebook
  • Data Science and Information Theory
  • Data Mining Courses
  • Open Source, Open Human, Open Data, Open Sesame!
  • Data Scientist Skill Set
  • R {swirls} – Learning R by doing
  • Correlation does not imply causation
  • Statistical Inference Resources

From Around the Web

  • 6 checkpoints to ensure regression model validity for analytics
  • Algorithms: Design and Analysis
  • Analyzing Big Data with Twitter
  • Big Data Analytics: Descriptive Vs. Predictive Vs. Prescriptive
  • Data Analysis
  • Data Mining for the Masses
  • Data Science Course
  • Google Visualization API Reference
  • k-means clustering
  • Occam’s Razor
  • PCA Step by Step
  • Regression Equation: What it is and How to use it
  • Using JavaScript visualization libraries with R

Public Data Sets

  • http://www.cs.cmu.edu/~./enron/
  • http://www.secviz.org/content/the-davix-live-cd
  • http://www.caida.org/data/overview/
  • http://www.secviz.org/content/visual-analytics-workshop-with-worlds-leading-security-visualization-expert-0
  • http://snap.stanford.edu/data/
  • http://analytics.ncsu.edu/
  • https://code.google.com/p/google-refine/

Data Science Books

  • 9 Free Books for Learning Data Mining & Data Analysis
  • 16 Free Data Science Books
  • 27 free data mining books

Happy to add other links from readers to this list.

Share this:

  • Twitter
  • Facebook
  • LinkedIn
  • Email
  • Pinterest

Like this:

Like Loading...

The R Consortium

01 Wednesday Jul 2015

Posted by Max Hemingway in Data Science, Open Source, Programming

≈ Leave a comment

Tags

Coding, Data Science, Development, Open Source

RThe R Consortium has been founded and recently launched under an Open Source Governance and Foundation model. This is in response to the growing use of R and its communities.

The initial members of the R Consortium include:

  • Microsoft
  • R Studio
  • TIBC Analytics
  • Alteryx
  • Google
  • HP
  • Ketchum Trading LLC
  • Mango Solutions
  • Oracle

The mission statement is listed as:

The central mission of the R Consortium is to work with and provide support to the R Foundation and to the key organizations developing, maintaining, distributing and using R software through the identification, development and implementation of infrastructure projects.

They have also listed a potential number of projects they will be involved with:

  • strengthening the R Forge infrastructure;
  • assisting the Stanford University group running user!R 2016;
  • developing documentation; and
  • encouraging increased communication and collaboration among users and developers of the R language.

One to watch going forward for influencing the R Community.

Source https://www.r-consortium.org

Share this:

  • Twitter
  • Facebook
  • LinkedIn
  • Email
  • Pinterest

Like this:

Like Loading...

Getting to Grips with Git

08 Monday Jun 2015

Posted by Max Hemingway in Cloud, Data Science, DevOps/OpsDev, Open Source, Programming

≈ Leave a comment

Tags

Cloud, Coding, Data Science, DevOps, Open Source, OpsDev, Programming

If you are new to GIT or want to refresh your skills/knowledge a good way of learning is through the Learning Git Branching simulator with it taking you through the commands and techniques.

Welcome to Learn Git Branching

Interested in learning Git? Well you’ve come to the right place! “Learn Git Branching” is the most visual and interactive way to learn Git on the web; you’ll be challenged with exciting levels, given step-by-step demonstrations of powerful features, and maybe even have a bit of fun along the way.

After this dialog you’ll see the variety of levels we have to offer. If you’re a beginner, just go ahead and start with the first. If you already know some Git basics, try some of our later more challenging levels.

The simulator covers:

Introduction SequenceGithub
– Introduction to the majority of Git commands

Ramping Up
– Additional Git commands

Moving Work Around
– Modifying the source tree

A Mixed Bag
– Git techniques tricks and tips

Advanced Topics

Share this:

  • Twitter
  • Facebook
  • LinkedIn
  • Email
  • Pinterest

Like this:

Like Loading...

Open Data Handbook Revised

21 Thursday May 2015

Posted by Max Hemingway in Data Science, IoT, Open Source

≈ Leave a comment

Tags

Data Science, IoT, Open Source

openOriginally published in 2012 the “Open Data Handbook” has been revamped to inspire Open Data Newcomers.

The new version of the online site allows access to:

  • The Open Data Book
  • Value Stories
  • Resource Library

For those who have not yet been introduced to the Open Data Handbook, the below is a good overview of what its about.

This handbook discusses the legal, social and technical aspects of open data. It can be used by anyone but is especially designed for those seeking to open up data. It discusses the why, what and how of open data – why to go open, what open is, and the how to ‘open’ data.

There are already a lot of “Open” sources of such as Open Source Web Crawlers and Data Sets, which is set to grow in the wake of the Internet of Things (IoT) and other data creation solutions.

The Handbook/Guide provides a good place to start if you are considering making your data Open, however the data created will also be a huge revenue generator for companies who produce the IoT devices.  The amount of data that will made Open remains to be seen more things come online

Share this:

  • Twitter
  • Facebook
  • LinkedIn
  • Email
  • Pinterest

Like this:

Like Loading...

Open Source Web Crawlers and Data Sets

15 Friday May 2015

Posted by Max Hemingway in Big Data, Data Science

≈ 1 Comment

Tags

Big Data, Data, Data Science

webA great list of 50 Open Source Web Crawlers has been produced by Baiju NT on a Big Data Blog

Web Crawlers are useful in gathering data from other sites when performing research, although caution should be used as with today’s levels of protection some sites defenses may consider your data gathering as an attack.

Its probably best to check first if any data sets exist with the data you are looking for.

https://www.quandl.com/ is a search engine for data sets that has listed 12 million data sets.

There are lots of data sets available from governments such as http://data.gov.uk/ in the UK.

If its a smaller list of good data sources is needed have a look at http://www.kdnuggets.com/datasets/index.html

Sources:

  • https://www.quandl.com/
  • http://www.kdnuggets.com/datasets/index.html
  • http://bigdata-madesimple.com/top-50-open-source-web-crawlers-for-data-mining/

Share this:

  • Twitter
  • Facebook
  • LinkedIn
  • Email
  • Pinterest

Like this:

Like Loading...

Data Mining Courses

28 Tuesday Apr 2015

Posted by Max Hemingway in Big Data, Data Science

≈ 1 Comment

Tags

Big Data, Data, Data Science, learning

mineVia Coursera the University of Illinois at Urbana-Champaign is running a specialisation on Data Mining.  As with all Coursera courses, you don’t have to take the specialisation, but can take the courses individually or one after each other. Taking the courses outside of the specialisation means that you wont get to complete the capstone project and earn your certificate at the end.

This track is made up 5 courses covering:

Pattern Discovery in Data Mining

  • Introduction to data mining
  • Concepts and challenges in pattern discovery and analysis
  • Scalable pattern discovery algorithms
  • Pattern evaluation
  • Mining flexible patterns in multi-dimensional space
  • Mining sequential patterns
  • Mining graph patterns
  • Pattern-based classification
  • Application examples of pattern discovery

Text Retrieval and Search Engines

  • Introduction to text data mining
  • Basic concepts in text retrieval
  • Information retrieval models
  • Implementation of a search engine
  • Evaluation of search engines
  • Advanced search engine technologies

Cluster Analysis in Data Mining

  • Basic concept and introduction
  • Partitioning methods
  • Hierarchical methods
  • Density-based methods
  • Probabilistic models and EM algorithm
  • Spectral clustering
  • Clustering high dimensional data
  • Clustering streaming data
  • Clustering graph data and network data
  • Constraint-based clustering and semi-supervised clustering
  • Application examples of cluster analysis

Text Mining and Analytics

  • Overview of text analytics and applications
  • Extending a search engine to support text analytics (text categorization, text clustering, text summarization)
  • Topic mining and analysis with statistical topic models
  • Opinion mining and summarization
  • Integrative analysis of text and structured data

Data Visualization

  • Visualization Infrastructure (graphics programming and human perception)
  • Basic Visualization (charts, graphs, animation, interactivity)
  • Visualizing Relationships (hierarchies, networks)
  • Visualizing Information (text, databases)

These courses would complement the courses from John Hopkins on Data Science

Source: https://www.coursera.org/specialization/datamining/20

Share this:

  • Twitter
  • Facebook
  • LinkedIn
  • Email
  • Pinterest

Like this:

Like Loading...

Big Data – 4V’s + Verification

27 Monday Apr 2015

Posted by Max Hemingway in Big Data, Data Science, IoT

≈ Leave a comment

Tags

Big Data, Data, Data Science, Infographic, IoT

IBM have released an Infographic on the “Four V’s of Big Data” which covers:

  • Volume – Scale of Data
  • Variety – Different forms of Data
  • Velocity – Analysis of Streaming Data
  • Veracity – Uncertainty of Data

4-Vs-of-big-data

There should be another V for “Verification” which covers the questions you ask of the data in order to obtain the results. A check should also be made on the data to look at the inference of the results as different views or questions asked in a slightly different way could produce completely different outcomes in the data.

Having the right data is important and ensuring the data gathered and collected is relevant to the business questions you are asking. Two stats in the infographic stick out for me on this:

  • $3.1 Trillion a year on poor data quality
  • 40 Zetabytes of data created by 2020

Perhaps with the right Verification there may not be so much uncertainty (Veracity) and a huge saving to businesses reducing a high loss in money, time and incorrect data.

Share this:

  • Twitter
  • Facebook
  • LinkedIn
  • Email
  • Pinterest

Like this:

Like Loading...

R {swirls} – Learning R by doing

16 Thursday Apr 2015

Posted by Max Hemingway in Data Science, Programming

≈ 1 Comment

Tags

Coding, Data Science, Programming, R

A swirl is an interactive way of learning R by installing a package called {swirl} into R and then installing a course.

I have used swirls in the Data Science Courses on Coursera and found them a useful way of learning and testing your knowledge.

swirl is installed as a package into R using the following command in R (internet connection required).

> install.packages("swirl")

Then launching the swirl library and run it.

> library("swirl")
> swirl()

To locate a swirl course use the following command.

?InstallCourses

Sources: Swirlstats

There are a list of courses available in the swirl repository on GitHub. There are 3 levels of courses available.

Beginner

  • R Programming: The basics of programming in R
  • R Programming Alt: Same as the original, but modified slightly for in-class use
  • Data Analysis: Basic ideas in statistics and data visualization
  • Mathematical Biostatistics Boot Camp: One- and two-sample t-tests, power, and sample size
  • Open Intro: A very basic introduction to statistics, data analysis, and data visualisation

Intermediate

  • Regression Models: The basics of regression modeling in R
  • Getting and Cleaning Data: dplyr, tidyr, lubridate, oh my!

Advanced

  • Statistical Inference: This intermediate to advanced level course closely follows the Statistical Inference course of the Johns Hopkins Data Science Specialization on Coursera.

To install a course you can use the following commands in R

library(swirl)
install_from_swirl("Course Name Here")
swirl()

Datacamp have recently released a free browser based R learning tool. This is a browser based  version to learn R based on a flipcard version of swirl teaching you in bite sized chunks.

Sources:

  • Swirlstats
  • GitHub swirl
  • Datacamp

R

Share this:

  • Twitter
  • Facebook
  • LinkedIn
  • Email
  • Pinterest

Like this:

Like Loading...

Do you know Big Data?

07 Tuesday Apr 2015

Posted by Max Hemingway in Big Data, Data Science, Tools

≈ Leave a comment

Tags

Big Data, Data, Data Science, Knowledge

Whilst looking into some suitable questions to ask about Big Data, I can across an excellent poster titled “Do you know Big Data?” produced by Altamira.

The poster covers a set of questions that help you question Big Data and a Big Data project.

  • What is Big Data?
  • What types of Big Data are there?
  • How do we extract knowledge from Big Data?
  • What do we do with knowledge we extract?
  • What types of Visual Techniques are there?
  • What types of Statistical Algorithms are there?
  • How big is Big Data?
  • What is a Data Scientist?
  • How do we implement Big Data solutions?
  • How do we address privacy and ethics in Big Data?
  • How do we secure Big Data?
  • What are leading Big Data tools?
  • What questions should we ask about Databases?
  • What questions about Predictive Tools?

bigdata

A useful tool as a starting place to research further elements of Big Data.

Share this:

  • Twitter
  • Facebook
  • LinkedIn
  • Email
  • Pinterest

Like this:

Like Loading...

16 Ordnance Survey tools – Open Maps

24 Tuesday Mar 2015

Posted by Max Hemingway in Data Science, Programming

≈ Leave a comment

Tags

Data Science, Programming, R

The Ordnance Survey (OS) have released some more tools as part of their Open Mapping products which are free to use. This takes the products up to 16 available for the UK geographical areas

The new products are:

  • OS Open Map Local
  • OS Open Rivers
  • OS Open Road

The opportunities for using the data with results from R projects and Data Science are vast.  Time to start downloading to see if I can use my R skills to good effect.

Share this:

  • Twitter
  • Facebook
  • LinkedIn
  • Email
  • Pinterest

Like this:

Like Loading...
← Older posts
Newer posts →

Technology Couch Podcast

Technology Couch Podcast

Topical discussions with different guests on Technology

Chat and views on latest Technology trends, news and what is currently hot in the industry

Max Hemingway

  • Listen on Apple Podcasts
  • Podcast RSS Feed

RSS Feed

RSS Feed RSS - Posts

Currently Reading

@HemingwayReads

Other Publications I contribute to

https://sparrowhawkbushcraft.com/

Recent Posts

  • How to Become a 21st Century Human: Navigating the Digital Age
  • The Intersection of Technology and Ethics
  • Data, Data Everywhere: The Rise of Datafication
  • “Digital Ash” – What we leave behind
  • Digital Mindset Tools – Second Brain

Categories

  • 21st Century Human
  • 3D Printing
  • Applications
  • Architecture
  • Arduino
  • Automation
  • BCS
  • Big Data
  • Certification
  • Cloud
  • Cobotics
  • Connected Home
  • Data
  • Data Fellowship
  • Data Science
  • Development
  • DevOps/OpsDev
  • Digital
  • DigitalFit
  • Drone
  • Enterprise Architecture
  • F-TAG
  • Governance
  • Health
  • Innovation
  • IoT
  • Machine Learning
  • Metaverse
  • Micro:Bit
  • Mindset
  • Mobiles
  • Networks
  • Open Source
  • Podcasts
  • Productivity
  • Programming
  • Quantum
  • Raspberry Pi
  • Robotics
  • Scouting
  • Scouts
  • Security
  • Smart Home
  • Social Media
  • Space
  • STEM
  • Tools
  • Uncategorized
  • Wearable Tech
  • Windows
  • xR

Archives

Reading Shelf

Archives

Recent Posts

  • How to Become a 21st Century Human: Navigating the Digital Age
  • The Intersection of Technology and Ethics
  • Data, Data Everywhere: The Rise of Datafication
  • “Digital Ash” – What we leave behind
  • Digital Mindset Tools – Second Brain

Top Posts & Pages

  • Building a Quadruped
  • Apps - Why do you really need access to my devices camera?
  • No Batteries Required: My Personal Journal
  • Personal Knowledge Management System - Revised for 2016
  • Taking your coding to the next level - Scratch to Python
  • Pwned on the Dark Web - Have you checked recently?
  • Personal Knowledge Management System – Revised for 2020
  • Data, Data Everywhere: The Rise of Datafication
  • Personal Knowledge Management System – Revised for 2023

Category Cloud

21st Century Human Architecture Automation Big Data Cloud Data Data Science Development DevOps/OpsDev Digital DigitalFit Enterprise Architecture Governance Innovation IoT Machine Learning Mindset Open Source Podcasts Productivity Programming Raspberry Pi Robotics Security Social Media STEM Tools Uncategorized Wearable Tech xR

Tags

# 3D Printing 21st Century Human Applications Architecture Automation BCS Big Data Blockchain Certification Cloud Cobot Cobotics Coding Communication Connected Home Continuous Delivery CPD Data Data Fellowship Data Science Delivery Development DevOps Digital DigitalFit Digital Human Docker Drone Email Encryption Enterprise Architecture Framework GTD Hashtag Infographic Information Theory Innovation IoT Journal Knowledge learning Machine Learning Metaverse MicroLearning Mindset Mixed Reality Networks Open Source OpsDev PKMS Podcasts Productivity Programming Proving It R RaspberryPI Robot Robotics Scouts Security Smart Home Social Media Standards Statistical Inference STEM Technology Couch Podcast Thinking Tools Training Visualisation Voice Wearable Tech Windows xR

License

Creative Commons Licence
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Meta

  • Register
  • Log in
  • Entries feed
  • Comments feed
  • WordPress.com

Blog at WordPress.com.

  • Follow Following
    • Max Hemingway
    • Join 72 other followers
    • Already have a WordPress.com account? Log in now.
    • Max Hemingway
    • Customize
    • Follow Following
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar
 

Loading Comments...
 

    %d bloggers like this: