Posts Tagged ‘Technical’

Avoiding Common Mistakes with Time Series Analysis

A basic mantra in statistics and data science is correlation is not causation, meaning that just because two things appear to be related to each other doesn’t mean that one causes the other. This is a lesson worth learning.

Imbalanced Classes FAQ

Here we share some further thoughts on imbalanced classes, and offer more resources.

Techniques and Technologies: Topology and TensorFlow

In early December we hosted a meetup, featuring Dr. Alli Gilmore discussing topological data analysis, and Dr. Andrew Zaldivar covering practical usage of Tensorflow.

Embracing Experimentation at AstroHackWeek 2016

Senior Data Scientist Jonathan Whitmore talks about experimentation and agility, based on his time at the unconference.

Streaming Video Analysis in Python

In this post, we discuss our Raspberry Pi streaming video analysis software, which we use to better predict Caltrain delays.

Noteworthy Links: September 22 2016

We’re at Enterprise Dataversity this week in Chicago, and next week we’ll be in NYC for Strata + Hadoop World. In the midst of this busy September, here are some articles we’ve come across and enjoyed.

Jupyter Notebook Best Practices for Data Science

Editor’s note: Welcome to Throwback Thursdays! Every third Thursday of the month, we feature a classic post from the earlier days of our company, gently updated as appropriate. We still find them helpful, and we think you will, too! The original version of this post can be found here. The Jupyter Notebook is a fantastic […]

Image Processing in Python

In this post, we go over the steps for creating a proof of concept for the image processing piece of our Caltrain work.

Introduction to Trainspotting

In this post we’ll start looking at the nuts and bolts of making our Caltrain work possible.

Predix Transform 2016

In this post, VP of Product and Innovation Tony Falco details insights learned while attending the recent Predix Transform conference.

Learning from Imbalanced Classes

This post gives insight and concrete advice on how to tackle imbalanced data.

Scaling Data Science: Dream Big, Start Medium-ish

On July 13th we welcomed the Open Data Science Conference meetup series to our HQ—our speaker talked about thinking critically about the size of your data.

How I Learned to Stop Worrying and Love Ephemeral Storage

This post will show architects and developers how to set up Hadoop to communicate with S3, use Hadoop commands directly against S3, use distcp to perform transfers between Hadoop and S3, and how distcp can be used to update on a regular basis based only on differences.

Structured Streaming in Spark

This post gives you a quick overview of the new structured streaming feature in Spark 2.0, illustrating why it’s an exciting addition.

Brain Monitoring with Kafka, OpenTSDB, and Grafana

A team of our data scientists recently won 2nd place in Confluent’s Kafka Hackathon. In this post, explore their project—streaming EEG data and visualizing it.

Building Pipelines to Understand User Behavior

In this post, we cover what’s needed to understand user activity, and we look at some pipeline architectures that support this analysis.

Kafka Simple Consumer Failure Recovery

This post walks you through a simple failure recovery mechanism, as well as a test harness that allows you to make sure this mechanism works as expected.

Noteworthy Links: Social Media Edition

In this post we share some links to interesting work being done with social media data.

Materialized Views with Cassandra

In this screencast, Principal Engineer and Cassandra committer Gary Dusbabek provides an overview of Materialized Views.

Building Data Systems: What Do You Need?

In this post, we’re going to go over the capabilities you need to have in place in order to successfully build and maintain data systems and data infrastructure.

Noteworthy Links: Hadoop Edition

Hadoop is 10 years old! Check out these related links.

Talking About the Caltrain

On May 6th, SVDS was honored to host an Open Data Science Conference (ODSC) Meetup in our Mountain View headquarters. Data Engineer Harrison Mebane and Data Scientist Christian Perez presented on our Caltrain project.

Working Effectively in Data Science Teams

On April 21st, SVDS hosted the WWCode Silicon Valley chapter in our Mountain View office; we gave a talk titled Working Effectively in Data Science Teams.

Jupyter Notebook for Data Science Teams

Data Scientist Jonathan Whitmore has just released a screencast tutorial for Jupyter Notebooks.

Building a Prediction Engine using Spark, Kudu, and Impala

In this post, Richard walks you through a demo based on the Meetup.com streaming API to illustrate how to predict demand in order to adjust resource allocation.

Noteworthy Links

Here are some links from around the internet to get you in a Strata state of mind.

Why Notebooks Are Super-Charging Data Science

There is little limit to what can be done with a notebook. As well as the data science work you might expect, such as manipulating and graphing data, we’ve used them for sharing work on analytical tasks such as motion detection in video. In this post Edd takes a look at why we’re seeing notebooks everywhere.

Analyzing Caltrain Delays: What We Can Learn

In this post, we will explore some aspects of the train delay data we’ve been collecting from the Caltrain API over the past few months. The goal is to get our heads into the data before setting off on building a prediction model.

How to Choose a Data Format

It’s easy to become overwhelmed when it comes time to choose a data format. In this post Silvia gives you a framework for approaching this choice, and provide some example use cases.

Crossing the Development to Production Divide

Heather knows what it’s like to deal with complex production deployments that cover the gamut from infrastructure upgrades, to feature deployments, to data migrations, where each step threatens to derail the plan. In this post she’ll give an overview of obstacles she’s faced (you may be able to relate) and talk about solutions to overcome these obstacles.

Ethereum: Rise of the World Computer

The Ethereum network is a distributed economy like Bitcoin, except it is much, much more powerful. Rick Seeger dives into why you should be paying attention to its popularity.

Reshaping Data with Pivot in Spark

Andrew gives you a deep dive into pivoting data with SparkSQL. This piece was originally posted on the Databricks blog.

Data Day and Graph Day Texas Slides

Check out the slides from our recent presentations at Data Day TX and Graph Day.

Pivoting Data in SparkSQL

Andrew Ray, Senior Data Engineer, contributed to the most recent release of Spark. This post gives examples of how to use his pivot commit in PySpark.

The Basics of Classifier Evaluation, Part 2

A previous blog post made the point that classifiers shouldn’t use classification accuracy as a performance metric. The next part in this series was going to discuss other evaluation techniques such as ROC curves, profit curves, and lift curves. However, there are several important points to be made first. Here I present a sequence that shows the progression and inter-relation of the issues.

Advanced Spark Meetup Recap

Our audience of engineers got right into the guts of Spark’s GraySort benchmark win last year with Chris Fregly from IBM Spark Technology Center. Here are a few highlights from the meetup.

From Impala to Hive with Love

While on paper it should be a seamless transition to run Impala code in Hive, in reality it’s more like playing a relentless game of whack-a-mole. This post provides hints to make the transition easier.

5 Things a Blockchain Needs to Succeed

Today, the currency supply supported by the Bitcoin blockchain is worth four billion dollars. So, what have we learned? There are five essential properties any good blockchain must have.

Jupyter Notebook Best Practices for Data Science

We present here some best-practices that SVDS has implemented after working with the Jupyter Notebook in teams and with our clients.

Evaluating Microservices: Real World Lessons

Microservices are a popular topic in developer circles, because they are a means of solving problems that have plagued monolithic software projects for decades: namely, tardiness and bugs, both caused by complexity.

The Basics of Classifier Evaluation, Part 1

If it’s easy, it’s probably wrong.

Dust in the Chain

Since the blockchain is both easily accessible and immutable, it is incredibly useful for other purposes. Issuing a tiny fraction of a Bitcoin (called dust) with embedded data allows anyone to easily store data permanently and publicly.

The Venn Diagram of Data Strategy

The business community and the technical community can sometimes seem like they live on the opposite sides of the planet — or at least opposite ends of the hallway. When it comes to data strategy, many people read the “data” part and automatically dump the topic in the “technical” bucket. It can be a struggle […]

Getting from Data to Visualization

Someone recently asked me about my process from brainstorming through to delivery;

The Hardest Part of Technology is the Humans: Thoughts from Euro PyData

The PyData Conference in Berlin gave me a lot to think about this past weekend.

Visualizing the Evolution of Rock Music

Rock ’n’ roll is one of the most popular music genres today, but that wasn’t always the case.

Sign up for our newsletter