In this post, we will explore some aspects of the train delay data we’ve been collecting from the Caltrain API.
Posts Tagged ‘Data science’
One way to give back to the open source community that provides us with tools is to help others evaluate and choose those tools in a way that takes advantage of our experience. We offer this analysis, along with explanations of the various criteria upon which we based our decisions.
In this post, Matt talks about using TensorFlow to detect true and false positives in our Caltrain work.
In this post we explore how data is changing the insurance industry, through the lens of auto insurance underwriting.
A basic mantra in statistics and data science is correlation is not causation, meaning that just because two things appear to be related to each other doesn’t mean that one causes the other. This is a lesson worth learning.
Here we share some further thoughts on imbalanced classes, and offer more resources.
In early December we hosted a meetup, featuring Dr. Alli Gilmore discussing topological data analysis, and Dr. Andrew Zaldivar covering practical usage of Tensorflow.
In this post, CTO John Akred looks at the practical ingredients of managing agile data science.
Senior Data Scientist Jonathan Whitmore talks about experimentation and agility, based on his time at the unconference.
Data strategy matters to both business and tech. It’s a problem that sits in the center of a Venn diagram, and if we get stuck thinking of those two domains as existing solely in completely separate silos, we’ll lock ourselves out of that key middle ground where the really important problems get solved.
In this post, Julie talks about the necessary skills for a CDO, as learned through her research for the “Understanding the Chief Data Officer” report.
At the Strata + Hadoop World conference in New York last week, there were an impressive 16 tracks of session talks. A lot of them focused on the tools that everyone is excited about, but I focused on the goals people are using data science to accomplish. Here are a few of the sessions that stood out.
We’re at Enterprise Dataversity this week in Chicago, and next week we’ll be in NYC for Strata + Hadoop World. In the midst of this busy September, here are some articles we’ve come across and enjoyed.
Editor’s note: Welcome to Throwback Thursdays! Every third Thursday of the month, we feature a classic post from the earlier days of our company, gently updated as appropriate. We still find them helpful, and we think you will, too! The original version of this post can be found here. The Jupyter Notebook is a fantastic […]
In this post, we go over the steps for creating a proof of concept for the image processing piece of our Caltrain work.
In this post we’ll start looking at the nuts and bolts of making our Caltrain work possible.
This post gives insight and concrete advice on how to tackle imbalanced data.
On July 13th we welcomed the Open Data Science Conference meetup series to our HQ—our speaker talked about thinking critically about the size of your data.
This post will show architects and developers how to set up Hadoop to communicate with S3, use Hadoop commands directly against S3, use distcp to perform transfers between Hadoop and S3, and how distcp can be used to update on a regular basis based only on differences.
A team of our data scientists recently won 2nd place in Confluent’s Kafka Hackathon. In this post, explore their project—streaming EEG data and visualizing it.
In this post we share some links to interesting work being done with social media data.
VP of Strategy Edd Dumbill was recently interviewed by James Haight on the Hadooponomics podcast. Find the audio and transcript here.
Back in 2014, we discussed how the market looked like on our first birthday. As we hit three years, it seems like an appropriate time to look back on those observations, and see where we are now.
On May 6th, SVDS was honored to host an Open Data Science Conference (ODSC) Meetup in our Mountain View headquarters. Data Engineer Harrison Mebane and Data Scientist Christian Perez presented on our Caltrain project.
On April 21st, SVDS hosted the WWCode Silicon Valley chapter in our Mountain View office; we gave a talk titled Working Effectively in Data Science Teams.
We believe there are clearly some compelling value propositions that come from integrating the visibility from the IoT into applications that help understand and manage the state of complex systems. With the internet of things, the more things, really, the merrier.
Data Scientist Jonathan Whitmore has just released a screencast tutorial for Jupyter Notebooks.
I was always struck by how the Silicon Valley startups I worked with could do so much more, with so much less. I’ve come to learn, sometimes the hard way, that there are critical elements of the “who” and the “how,” particular to those start-up teams, that contribute to their success. It’s why we named our company for Silicon Valley: a lightweight, agile approach to data-driven product development was pioneered here.
Several of our presenters were interviewed at Strata San Jose. If you missed the conference, check out these interviews below to catch up on some of the topics that were on our minds.
There is little limit to what can be done with a notebook. As well as the data science work you might expect, such as manipulating and graphing data, we’ve used them for sharing work on analytical tasks such as motion detection in video. In this post Edd takes a look at why we’re seeing notebooks everywhere.
In this post, we will explore some aspects of the train delay data we’ve been collecting from the Caltrain API over the past few months. The goal is to get our heads into the data before setting off on building a prediction model.
Heather knows what it’s like to deal with complex production deployments that cover the gamut from infrastructure upgrades, to feature deployments, to data migrations, where each step threatens to derail the plan. In this post she’ll give an overview of obstacles she’s faced (you may be able to relate) and talk about solutions to overcome these obstacles.
A previous blog post made the point that classifiers shouldn’t use classification accuracy as a performance metric. The next part in this series was going to discuss other evaluation techniques such as ROC curves, profit curves, and lift curves. However, there are several important points to be made first. Here I present a sequence that shows the progression and inter-relation of the issues.
Our audience of engineers got right into the guts of Spark’s GraySort benchmark win last year with Chris Fregly from IBM Spark Technology Center. Here are a few highlights from the meetup.
Data products are the reason data scientists are lately treated like rockstars. Along the way at SVDS, we’ve learned a few things about data products, which we shared as we told the story of the Caltrain Rider app.
We present here some best-practices that SVDS has implemented after working with the Jupyter Notebook in teams and with our clients.
We’ll walk through the steps for competing in Kaggle’s “Digit Recognizer” contest using SQL-based machine learning tools to identify hand-written digits.
One might reasonably judge how well the congress reflects the views of the citizenry by examining the proportion of those citizens who think congress is doing a good job.
A basic mantra in statistics and data science is correlation is not causation,
Many people who live and work in Silicon Valley depend on Caltrain for transportation. And because the SVDS headquarters are in Sunnyvale, not far from a station, Caltrain is literally in our own backyard. So. as an R&D project, we have been playing with data science techniques to understand and predict delays in the Caltrain […]
Data Scientist Tatsiana Maskalevich and CTO John Akred presented at this year’s Hadoop Summit in San Jose,
Silicon Valley Data Science has designed a new method to create a data strategy to overcome limitations of conventional approaches.
I built a lot of capabilities based on emerging technologies in my years delivering enterprise data and analytical solutions.
Graphite is a tool that does two things rather well: storing numeric time-series data (metric, value, epoch timestamp), and rendering graphs of this data on demand.