A basic mantra in statistics and data science is correlation is not causation, meaning that just because two things appear to be related to each other doesn’t mean that one causes the other. This is a lesson worth learning.
Posts Tagged ‘Technical’
Here we share some further thoughts on imbalanced classes, and offer more resources.
In early December we hosted a meetup, featuring Dr. Alli Gilmore discussing topological data analysis, and Dr. Andrew Zaldivar covering practical usage of Tensorflow.
Senior Data Scientist Jonathan Whitmore talks about experimentation and agility, based on his time at the unconference.
In this post, we discuss our Raspberry Pi streaming video analysis software, which we use to better predict Caltrain delays.
We’re at Enterprise Dataversity this week in Chicago, and next week we’ll be in NYC for Strata + Hadoop World. In the midst of this busy September, here are some articles we’ve come across and enjoyed.
Editor’s note: Welcome to Throwback Thursdays! Every third Thursday of the month, we feature a classic post from the earlier days of our company, gently updated as appropriate. We still find them helpful, and we think you will, too! The original version of this post can be found here. The Jupyter Notebook is a fantastic […]
In this post, we go over the steps for creating a proof of concept for the image processing piece of our Caltrain work.
In this post we’ll start looking at the nuts and bolts of making our Caltrain work possible.
In this post, VP of Product and Innovation Tony Falco details insights learned while attending the recent Predix Transform conference.
This post gives insight and concrete advice on how to tackle imbalanced data.
On July 13th we welcomed the Open Data Science Conference meetup series to our HQ—our speaker talked about thinking critically about the size of your data.
This post will show architects and developers how to set up Hadoop to communicate with S3, use Hadoop commands directly against S3, use distcp to perform transfers between Hadoop and S3, and how distcp can be used to update on a regular basis based only on differences.
This post gives you a quick overview of the new structured streaming feature in Spark 2.0, illustrating why it’s an exciting addition.
A team of our data scientists recently won 2nd place in Confluent’s Kafka Hackathon. In this post, explore their project—streaming EEG data and visualizing it.
In this post, we cover what’s needed to understand user activity, and we look at some pipeline architectures that support this analysis.
This post walks you through a simple failure recovery mechanism, as well as a test harness that allows you to make sure this mechanism works as expected.
In this post we share some links to interesting work being done with social media data.
In this screencast, Principal Engineer and Cassandra committer Gary Dusbabek provides an overview of Materialized Views.
In this post, we’re going to go over the capabilities you need to have in place in order to successfully build and maintain data systems and data infrastructure.
On May 6th, SVDS was honored to host an Open Data Science Conference (ODSC) Meetup in our Mountain View headquarters. Data Engineer Harrison Mebane and Data Scientist Christian Perez presented on our Caltrain project.
On April 21st, SVDS hosted the WWCode Silicon Valley chapter in our Mountain View office; we gave a talk titled Working Effectively in Data Science Teams.
Data Scientist Jonathan Whitmore has just released a screencast tutorial for Jupyter Notebooks.
In this post, Richard walks you through a demo based on the Meetup.com streaming API to illustrate how to predict demand in order to adjust resource allocation.
There is little limit to what can be done with a notebook. As well as the data science work you might expect, such as manipulating and graphing data, we’ve used them for sharing work on analytical tasks such as motion detection in video. In this post Edd takes a look at why we’re seeing notebooks everywhere.
In this post, we will explore some aspects of the train delay data we’ve been collecting from the Caltrain API over the past few months. The goal is to get our heads into the data before setting off on building a prediction model.
It’s easy to become overwhelmed when it comes time to choose a data format. In this post Silvia gives you a framework for approaching this choice, and provide some example use cases.
Heather knows what it’s like to deal with complex production deployments that cover the gamut from infrastructure upgrades, to feature deployments, to data migrations, where each step threatens to derail the plan. In this post she’ll give an overview of obstacles she’s faced (you may be able to relate) and talk about solutions to overcome these obstacles.
The Ethereum network is a distributed economy like Bitcoin, except it is much, much more powerful. Rick Seeger dives into why you should be paying attention to its popularity.
Andrew gives you a deep dive into pivoting data with SparkSQL. This piece was originally posted on the Databricks blog.
Check out the slides from our recent presentations at Data Day TX and Graph Day.
Andrew Ray, Senior Data Engineer, contributed to the most recent release of Spark. This post gives examples of how to use his pivot commit in PySpark.
A previous blog post made the point that classifiers shouldn’t use classification accuracy as a performance metric. The next part in this series was going to discuss other evaluation techniques such as ROC curves, profit curves, and lift curves. However, there are several important points to be made first. Here I present a sequence that shows the progression and inter-relation of the issues.
Our audience of engineers got right into the guts of Spark’s GraySort benchmark win last year with Chris Fregly from IBM Spark Technology Center. Here are a few highlights from the meetup.
While on paper it should be a seamless transition to run Impala code in Hive, in reality it’s more like playing a relentless game of whack-a-mole. This post provides hints to make the transition easier.
Today, the currency supply supported by the Bitcoin blockchain is worth four billion dollars. So, what have we learned? There are five essential properties any good blockchain must have.
We present here some best-practices that SVDS has implemented after working with the Jupyter Notebook in teams and with our clients.
Microservices are a popular topic in developer circles, because they are a means of solving problems that have plagued monolithic software projects for decades: namely, tardiness and bugs, both caused by complexity.
Since the blockchain is both easily accessible and immutable, it is incredibly useful for other purposes. Issuing a tiny fraction of a Bitcoin (called dust) with embedded data allows anyone to easily store data permanently and publicly.
The business community and the technical community can sometimes seem like they live on the opposite sides of the planet — or at least opposite ends of the hallway. When it comes to data strategy, many people read the “data” part and automatically dump the topic in the “technical” bucket. It can be a struggle […]
Someone recently asked me about my process from brainstorming through to delivery;
The PyData Conference in Berlin gave me a lot to think about this past weekend.