Posts Tagged ‘Data science’

Avoiding Common Mistakes with Time Series Analysis

A basic mantra in statistics and data science is correlation is not causation, meaning that just because two things appear to be related to each other doesn’t mean that one causes the other. This is a lesson worth learning.

Imbalanced Classes FAQ

Here we share some further thoughts on imbalanced classes, and offer more resources.

Techniques and Technologies: Topology and TensorFlow

In early December we hosted a meetup, featuring Dr. Alli Gilmore discussing topological data analysis, and Dr. Andrew Zaldivar covering practical usage of Tensorflow.

Agile Data Science Teams Deliver Real World Results

In this post, CTO John Akred looks at the practical ingredients of managing agile data science.

Embracing Experimentation at AstroHackWeek 2016

Senior Data Scientist Jonathan Whitmore talks about experimentation and agility, based on his time at the unconference.

The Venn Diagram of Data Strategy

Data strategy matters to both business and tech. It’s a problem that sits in the center of a Venn diagram, and if we get stuck thinking of those two domains as existing solely in completely separate silos, we’ll lock ourselves out of that key middle ground where the really important problems get solved.

The One Key Skill of the CDO

In this post, Julie talks about the necessary skills for a CDO, as learned through her research for the “Understanding the Chief Data Officer” report.

With Data, Ask “What” Before “How”

At the Strata + Hadoop World conference in New York last week, there were an impressive 16 tracks of session talks. A lot of them focused on the tools that everyone is excited about, but I focused on the goals people are using data science to accomplish. Here are a few of the sessions that stood out.

Noteworthy Links: September 22 2016

We’re at Enterprise Dataversity this week in Chicago, and next week we’ll be in NYC for Strata + Hadoop World. In the midst of this busy September, here are some articles we’ve come across and enjoyed.

Jupyter Notebook Best Practices for Data Science

Editor’s note: Welcome to Throwback Thursdays! Every third Thursday of the month, we feature a classic post from the earlier days of our company, gently updated as appropriate. We still find them helpful, and we think you will, too! The original version of this post can be found here. The Jupyter Notebook is a fantastic […]

Image Processing in Python

In this post, we go over the steps for creating a proof of concept for the image processing piece of our Caltrain work.

Introduction to Trainspotting

In this post we’ll start looking at the nuts and bolts of making our Caltrain work possible.

Learning from Imbalanced Classes

This post gives insight and concrete advice on how to tackle imbalanced data.

Scaling Data Science: Dream Big, Start Medium-ish

On July 13th we welcomed the Open Data Science Conference meetup series to our HQ—our speaker talked about thinking critically about the size of your data.

How I Learned to Stop Worrying and Love Ephemeral Storage

This post will show architects and developers how to set up Hadoop to communicate with S3, use Hadoop commands directly against S3, use distcp to perform transfers between Hadoop and S3, and how distcp can be used to update on a regular basis based only on differences.

Brain Monitoring with Kafka, OpenTSDB, and Grafana

A team of our data scientists recently won 2nd place in Confluent’s Kafka Hackathon. In this post, explore their project—streaming EEG data and visualizing it.

Noteworthy Links: Social Media Edition

In this post we share some links to interesting work being done with social media data.

Hadooponomics Interview: The Evolution of Data

VP of Strategy Edd Dumbill was recently interviewed by James Haight on the Hadooponomics podcast. Find the audio and transcript here.

One Year Later, Observations on the Big Data Market

Back in 2014, we discussed how the market looked like on our first birthday. As we hit three years, it seems like an appropriate time to look back on those observations, and see where we are now.

Noteworthy Links: Hadoop Edition

Hadoop is 10 years old! Check out these related links.

Talking About the Caltrain

On May 6th, SVDS was honored to host an Open Data Science Conference (ODSC) Meetup in our Mountain View headquarters. Data Engineer Harrison Mebane and Data Scientist Christian Perez presented on our Caltrain project.

Working Effectively in Data Science Teams

On April 21st, SVDS hosted the WWCode Silicon Valley chapter in our Mountain View office; we gave a talk titled Working Effectively in Data Science Teams.

IoT and Resilient Systems

We believe there are clearly some compelling value propositions that come from integrating the visibility from the IoT into applications that help understand and manage the state of complex systems. With the internet of things, the more things, really, the merrier.

Jupyter Notebook for Data Science Teams

Data Scientist Jonathan Whitmore has just released a screencast tutorial for Jupyter Notebooks.

Successful Data Teams are Agile and Cross-Functional

I was always struck by how the Silicon Valley startups I worked with could do so much more, with so much less. I’ve come to learn, sometimes the hard way, that there are critical elements of the “who” and the “how,” particular to those start-up teams, that contribute to their success. It’s why we named our company for Silicon Valley: a lightweight, agile approach to data-driven product development was pioneered here.

SVDS at Strata San Jose 2016

Several of our presenters were interviewed at Strata San Jose. If you missed the conference, check out these interviews below to catch up on some of the topics that were on our minds.

Why Notebooks Are Super-Charging Data Science

There is little limit to what can be done with a notebook. As well as the data science work you might expect, such as manipulating and graphing data, we’ve used them for sharing work on analytical tasks such as motion detection in video. In this post Edd takes a look at why we’re seeing notebooks everywhere.

Analyzing Caltrain Delays: What We Can Learn

In this post, we will explore some aspects of the train delay data we’ve been collecting from the Caltrain API over the past few months. The goal is to get our heads into the data before setting off on building a prediction model.

Crossing the Development to Production Divide

Heather knows what it’s like to deal with complex production deployments that cover the gamut from infrastructure upgrades, to feature deployments, to data migrations, where each step threatens to derail the plan. In this post she’ll give an overview of obstacles she’s faced (you may be able to relate) and talk about solutions to overcome these obstacles.

The Basics of Classifier Evaluation, Part 2

A previous blog post made the point that classifiers shouldn’t use classification accuracy as a performance metric. The next part in this series was going to discuss other evaluation techniques such as ROC curves, profit curves, and lift curves. However, there are several important points to be made first. Here I present a sequence that shows the progression and inter-relation of the issues.

Advanced Spark Meetup Recap

Our audience of engineers got right into the guts of Spark’s GraySort benchmark win last year with Chris Fregly from IBM Spark Technology Center. Here are a few highlights from the meetup.

How Do You Build a Data Product?

Data products are the reason data scientists are lately treated like rockstars. Along the way at SVDS, we’ve learned a few things about data products, which we shared as we told the story of the Caltrain Rider app.

Jupyter Notebook Best Practices for Data Science

We present here some best-practices that SVDS has implemented after working with the Jupyter Notebook in teams and with our clients.

Zero to Kaggle in 30 Minutes

We’ll walk through the steps for competing in Kaggle’s “Digit Recognizer” contest using SQL-based machine learning tools to identify hand-written digits.

Better Know the Districts

One might reasonably judge how well the congress reflects the views of the citizenry by examining the proportion of those citizens who think congress is doing a good job.

Avoiding Common Mistakes with Time Series

A basic mantra in statistics and data science is correlation is not causation,

Listening to Caltrain: Analyzing Train Whistles with Data Science

Many people who live and work in Silicon Valley depend on Caltrain for transportation. And because the SVDS headquarters are in Sunnyvale, not far from a station, Caltrain is literally in our own backyard. So. as an R&D project, we have been playing with data science techniques to understand and predict delays in the Caltrain […]

Railroad Modeling at Hadoop Scale

Data Scientist Tatsiana Maskalevich and CTO John Akred presented at this year’s Hadoop Summit in San Jose,

Data Strategy in a World of Big Data

Silicon Valley Data Science has designed a new method to create a data strategy to overcome limitations of conventional approaches.

Successful Data Teams are Agile and Cross-Functional

I built a lot of capabilities based on emerging technologies in my years delivering enterprise data and analytical solutions.

Storing and Visualizing Time Series with Graphite

Graphite is a tool that does two things rather well: storing numeric time-series data (metric, value, epoch timestamp), and rendering graphs of this data on demand.

When Fair Isn’t Predictable: The Law of Averages

When making decisions with data, the idea that things will “even out” may ring true, but it’s not always helpful.

Sign up for our newsletter