Discovering the Value of Family Data Science

Digital innovations have given rise to the Quantified Self —a movement enabling individuals to capitalize on the insight-generating power of self-tracking.  People self-track to gain deeper insights regarding their mind, body and other aspects of their well-being.   These insights can help people make better day-to-day decisions regarding their performance, health,  and happiness.  For example, allowing a patient to preempt a doctor’s visit, or transforming a necessary visit into an informative data driven discussion [1].  Others do it to collect data required to train personal well-being algorithms that will soon integrate with their smart home assistants.

As a husband and father, I wondered how this could benefit my family too. Could self-tracking help us become more conscientious in our day-to-day?  Could it help improve our well-being? Three year ago, I started tracking various aspects of my family’s growth, health and happiness through a practice I refer to as Family Data Science. I wasn’t interested in tracking minute-by-minute calories, moods, or steps. Instead, I wanted a daily record of exceptional events that be analyzed across time and other dimensions.

My experience is proving valuable in three ways. First, the data allows us to recall events over longer periods and greater detail than our memory alone. Second, the act of self-tracking introduces a moment of pause and reflection in the busyness of every day, and this helps boost conscientiousness. Third, applying machine learning to this data enables unique integration opportunities with today’s growing demand for smart home assistants.

Conventional wisdom teaches us that saving for retirement by maximizing investment account contributions at a young age is a sound strategy for individuals and their future generations. Doing so allows interests to compound daily while also minimizing tax liability. In this era of digital innovations, Family Data Science is offering similar benefits in the area of health and well-being. Starting early allows families to make better decision through all stages of their life, and this leads to better outcomes. The value of data compounds over time as well, providing a larger body of evidence to enable deeper and more accurate discovery of insights.

[1] Topol E. (2015). The Patient Will See You Now: The future of Medicine is in Your Hands, Basic Books

Picking the Right Neighborhood

What if I could apply a little family data science to help answer the question “Which neighborhood is right for my family?” In other words, I want to rank future neighborhoods of interest in such a way so that the ones on top are guaranteed to satisfy the needs of my family.

Seven years ago, my wife and I moved to Rome, Italy.  We moved apartments and neighborhoods quite a number of times just to keep up with the changing demands of a growing family.  In two cases, it didn’t take long for us to realize we had moved to a neighborhood that didn’t really suit us.

Looking back over this  period, a few factors emerged for those neighborhoods that did work for us, which were curiously missing in those that didn’t.

In this article, I propose a method to rank neighborhoods of interest according to criteria important to us.  I then apply this method to automatically rank past and present neighborhoods we’ve lived in in order to see if the ranking reflects our preferences.

Ranking Neighborhoods

I recently read “The Death and Life of Great Italian Cities: A Mobile Phone Data Perspective” which details a study conducted by a group of Italian researchers to measure ‘city liveliness’-metrics focused on urban diversity and urban vitality.  Their work is based on the theories and concepts outlined by Jane Jacobs in her seminal book “The Death and Life of Great American Cities.”  In her book, Ms. Jacobs proposes four qualities any city neighborhood must have in order to be vibrant and desirable to its residents.   The research, and my ranking of neighborhoods, is based on these four qualities.

I wanted to use some of these metrics, but I needed to apply them at the neighborhood level not just the city level as done by the researchers.   One idea for doing this is to rely on remote sensing of satellite images.   With geospatial platforms such as Google Earth Engine, this is very much doable.

In the sections below, I use Google Earth Engine to compute neighborhood-related features that are based on the “four generators of diversity” proposed by Jane Jacobs.  The final ranking for my neighborhoods of interest will be based on these features.

Urban Green

Our newborn loves to sit back in her stroller and listen to the wind move its way through the tree leaves.  Our  small girls, on the other hand, love the freedom to scoot or bike around without having to worry about cars and trucks running them over.  This means that our ideal neighborhood will have a combination of green as well as park areas closed off to traffic.

I rely on the standard Normalized Difference Vegetation Index (NDVI) to calculate a neighborhood’s ‘greenesss’.   NDVI works because satellite’s equipped with near-infrared sensors accurately record reflection of solar radiation by green vegetation on the ground.


NDVI is a numerical indicator and can be computed  by analyzing satellite images in Google Earth Engine (see Image 1).

Image 1: 2011 Avg NDVI for Rome’s Trastevere Neighborhood

Ranking our Rome neighborhoods of interest by their NDVI confirms what we already knew – Trastevere is one of the greenest and also happens to be our favorite neighborhood in Rome.

  1. Della Vittoria – .33 NDVI
  2. Trastevere – .25 NDVI
  3. Ostiense – .23 NDVI
  4. Testaccio – .19 NDVI
  5. Prati – .13 NDVI

Multiple Land Use

Neighborhoods that serve multiple functions is another aspect of urban life important to us.   It is really great when we can greet the neighbors, pick our favorite flavor at the local ice-cream shop, get a hair-cut and take the kids to the children’s park all within a short walk of our front door.    The foot traffic generated by these types of neighborhoods creates an urban vibrancy unlike that experienced in single-use neighborhoods.

Land Use Mix is one of the variables used by the researchers in the paper I referred to earlier [1].  The underlying principle for Land Use Mix was proposed by Jacobs when she suggested a city district should serve more than one primary function, preferably more than two.

The researchers concluded that the city of Rome has a high Land Use Mix.  Living here for the past seven years, however, taught us that not all neighborhoods exhibit the same levels of Land Use Mix.

In my quest to pick the perfect neighborhood, I will also calculate Land Use Mix for the neighborhoods we have lived in using the following formula:

Land Use Mix

where Pi,j refers to the percentage of square footage with land use j in district i, and n is the number of possible land uses.  For the purpose of my article, a ‘district’ is synonymous with ‘neighborhood’, which I consider to be equivalent to the Rioni of Rome or the Quarters of Rome.  Just like the researchers, I will use a value of n=3  (1=residential, 2=parks/squares/water, 3=businesses/commercial/government).   Dedicated single-use neighborhoods will have LUMi equal to zero.  When land use is equally divided in all n ways then LUMi will equal one. The higher LUMi, the more mixed the neighborhood’s land use.

Classifying Land Use in Google Earth Engine

Ranking our Rome neighborhoods of interest by their Land Use Mix begins to reveal some interesting characteristics:

  1. Trastevere .73 LUM
  2. Della Vittoria .64 LUM
  3. Testaccio .44 LUM
  4. Prati .44 LUM
  5. Ostiense .34 LUM

Neighborhood Block Size

Jacobs believed people didn’t like walking down long blocks, and instead would avoid them at all costs.   She believed short blocks offered more navigation options between Point A and B.  With short blocks, pedestrian traffic is more easily distributed and this distribution helps create more viable locations for smaller businesses to mix into residential areas.

This observation plays out on the streets of Rome everyday.   Neighborhoods composed of mostly small, sometimes quirky shaped blocks are full of small mom-and-pop type stores sitting at the base of multi-level residential buildings.

Average block size

Not surprisingly, applying the formula for block size in my neighborhoods of interest reveals the following:

  1. Testaccio .004
  2. Trastevere .006
  3. Prati .008
  4. Della Vittoria .011
  5. Ostiense .015


Many times numbers simply confirm what our intuition knew all along.   In this exercise, I took the principles proposed by Jacobs and the some of the metrics utilized by researchers to rank a set of Roman neighborhoods based on characteristics important to me and my family. I then used Google Earth Engine and remote sensing of satellite images to quantify these characteristics.   Having lived in all five neighborhoods of interest, we have strong opinions regarding these neighborhoods and how they compare to one another.  Not surprisingly, the final ranking reflects these opinions.

Ranking Rome Neighborhoods


[1] R. Cervero. Land-use mixing and suburban mobility. University of California Transportation Center, 1989.



My Data Principles

In my still early journey practicing family data science, I’ve come to understand a few key areas where data can help a lot, and also other areas where data won’t help any.

First, it doesn’t really make sense to try and quantify creative endeavors, for example – hours spent at dance class. At the end of the day, passion and motivation will determine how much of these activities my family engages in (and how often!)

The same applies to long-term plans – I don’t believe  i will quantify my way towards achieving the big moonshots / BHAGs in life.  I do believe theses types of goals should be articulated, and maybe even journaled in the process of attaining them, but that’s about it from a quantification standpoint.

Instead, I think it is better to measure those things where I am confident that conformance to a set standard or level can help determine personal growth, health and/or happiness.  Examples of this include tracking sleep, nutrition, health or cash flow.

Power BI your Family Data Science

Since 2013, I have been experimenting with family data science – or the process of drawing deeper understanding and insights to help my wife and daughters grow, stay healthy and be happy.

I generate and maintain household datasets that record key aspects of my family’s day-to-day lives. I use these datasets to drive my family data science experiments and they can be grouped in one of two ways.

Power BI Visusalizations

The first group is mostly transactional in nature. Credit card purchases, blood test results or dance practice attendance are just a few examples of datasets in this group.

The second group is mostly analytical in nature. These datasets aggregate, or count the different aspects recorded in the day-to-day transactional events.

During the first few years of running my family data science experiments, I happily collected and aggregated this information. For example, I could tell you the number of hours my daughters spent at dance practice,how many times my wife and oldest daughter suffered from bronchitis during winter months, or the months during which I sent the most work-related emails – no major insights but clarity nonetheless.

These analytics can be amusing for a little while, but as dad and husband, I wanted discover and answers to more probing questions. I wanted to know what happens to my daughter’s respiratory issues when we cut lactose for a period of six months? How do my wife’s sleep patterns improve when she goes swimming on a regular basis? Does my standing heart rate decrease during weeks of healthy eating and intense exercise?

In recent months, I’ve been using Microsoft Excel 2016, and its powerful Get and Transform features, to automatically connect to my household datasets in the Microsoft Azure cloud.  Once in Excel, I map these datasets to a Power Pivot data model.  Preparing the data this way gives me the freedom to  discover and correlate seemingly independent observations.   For example,  how weeks with increased workout activity improve the family’s sleep patterns and mood.

Excel’s potential for this kind of analysis is great.  Unfortunately the Get and Transform and PowerPivot features are clouded by the traditional spreadsheet  features.  When they coexist in one monolithic software package, the result is slow processing an instability. Excel crashed on me quite a number of times as it connected and imported 10MB from fourteen datasets in the Azure cloud.  The process of analyzing and refreshing the data model was simply too slow for the type of experimentation I was trying to perform.

This is where Power BI desktop comes to the rescue.   Power BI repackages Excel’s Get and Transform and PowerPivot features into an intuitive report authoring environment.  Where Excel buries the discovery and visualization features among its traditional spreadsheet capabilities, Power BI brings them front and center.

The end result is that Power BI lets you quickly visualize your data in numerous and diverse ways.   This is a much needed improvement over similar approaches using Excel 2016.

April Fools from BP Monitor

Since 2013, I have been experimenting with family data science – or the process of drawing deeper understanding and insights to help my wife and daughters grow, stay healthy and be happy.

Home blood pressure monitoring is something I do on a regular basis. About a year ago I purchased a Medisana wrist based monitor, this despite the fact that the American Heart Association discourages these types of wrist based blood pressure monitors due to their inaccuracy!

Nevertheless, the price and reviews were right and I was confident I could overcome the inaccuracy risks by simply using the device as recommended by the manufacturer.

Figure 1 shows a screenshot of my recent BP readings in iPhone’s Health app.  Look closely and you will see my BP readings over the past month, including multiple readings for April 1.  The high reading on April 1 was 153/99 mmHg.  This represents an all-time high for me and by far.  Was I at risk for hypertension or was something else going on here?

A quick visit to my local pharmacy’s upper arm-based blood pressure monitor resulted in a more respectable 124/82 mmHg blood pressure reading.   This confirmed my suspicion that my wrist-based monitor had gone out of whack.   I took a few more readings at the pharmacy over the course of the day and then another one in the evening using my wrist-based monitor and the verdict was in – my wrist-based monitor was playing an April Fools joke by inflating my readings by over 20%!

Figure 1: Blood Pressure readings in March

School night anxiety?

Since 2013, I have been experimenting with family data science – or the process of drawing deeper understanding and insights to help my wife and daughters grow, stay healthy and be happy.  A recent analysis in one of my family datasets revealed an interesting observation regarding our kids and the days of the week in which they tend to experience the most symptoms (e.g. cough, runny nose, congestion).

The dataset in question represents a record log of symptoms I keep to track the things affecting my family over the course of the year.

Pivoting my way through this dataset, I landed on a table showing the frequency of my daughters’ symptoms by days of the week (see table 1 below).

Table 1: Symptoms by day of week

When I showed my wife this table, we couldn’t help reacting to the fact that the largest number of symptoms (thirty-two percent)  occurred on Sundays.   Of course there can be many reasons for this, but we wondered whether good ol’ Monday back to school anxiety may be stressing the kids unnecessarily.

Digging in a little bit more, table 2 reveals that 97% of these Sunday symptoms occurred in quarters when school was in session.  In other words, the only quarter with no Sunday symptoms is the same quarter when school is mostly closed for summer!

Table 2: Sunday symptoms by quarter

Numbers can be deceiving.  Table 1 and 2 support the school night hypothesis but table 2 also suggests an entirely different one that has more to do with the onset of colder weather.   This is because 80% of the Sunday symptoms from table 2 occurred during the mostly cold Northern Hemisphere autumn and winter months.  Another factor complicating the original hypothesis has to do with the date when I recorded the symptoms.  I tend to have more free time on Sunday compared to other days of the week.  It is entirely possible I was capitalizing on this free time to record symptoms which, in actuality, appeared prior to Sunday.

Conclusive one way or the other?  Definitely not.  With only a few years of data, it is clearly too early to reach any credible conclusions on this.   Nonetheless, the data does make for some interesting observations which are just some of the benefits of practicing a little family data science.