Saturday, March 3, 2018

Welcome Back (Couch)

Welcome Back to Seattle - Abandoned Couch and Mattress
Having spent a chunk of time in Italy the last few years, I can't think of once when we saw a couch out by its lonesome, alongside a thoroughfare. However, back in Seattle in less than the time it takes to say couch pillow shams sale, we caught glimpse of an abandoned couch.

But why? Do Americans, or more specifically Seattleites, have so little regard for their couches? Do they think that couches can fend for themselves outside? I believe it's relatively easy to give a couch away or donate it, so I'm not sure why people think it's okay to just leave them around. This couch in particular has been on its Fremont Ave location for several days with rain and snow. At least it has company with that mattress. Two-for-one.

You don't believe me about Seattle and abandoned couches? Ha, check it out for yourself.

Abandoned furniture always brings a song to my mind. (It doesn't for you? I'm sorry.) For me, this couch conjured up the theme song to the 1970's show Welcome Back Kotter with the lyric "welcome back, welcome back." (youtube).
Welcome back, your dreams were your ticket out
Welcome back, to that same old place that you laughed about
Well the names have all changed since you hung around
But those dreams have remained and they've turned around
Who'd have thought they'd lead ya
(Who'd have thought they'd lead ya)
Back here where we need ya
(Back here where we need ya)
Yeah we tease him a lot 'cause we got him on the spot
Welcome back, welcome back, welcome back, welcome back
Welcome back, welcome back, welcome back
The song was written by John Sebastian.

Thursday, February 22, 2018

Mapping and Comparing the Murder Rates of Italy and the USA Using R

Overview | Code | Output

Map of the world depicting countries in green which have a homicide rate less than Italy.Map of Europe depicting countries in green which have a homicide rate less than Italy. Map of the word depicting countries in green which have a homicide rate less than the USA.
Three plots visually showing countries with a smaller and greater homicide rate than Italy and USA. Left: Map of the world depicting countries in green which have a homicide rate less than Italy. Center: Map of Europe depicting countries in green which have a homicide rate less than Italy. Right: Map of the word depicting countries in green which have a homicide rate less than the USA.


Traveling between the USA and Italy, you might be interested in crime statistics between the two countries as we were. In particular, we wanted to see if we could understand which country we were "safer". Safe is complex, multi-faceted concept, depending on where you are and what kind of things you do. To look at one facet of it, here we look at the murder rate per million people per country. These stats are published on the NationMaster site and are intentional homicides, number and rate per number and rate per million people.

The NationMaster site has a map where countries are colored light to dark based on their crime statistic. We wanted to look at the data in a slightly different way, specifically breaking the data into two: countries with a crime statistic greater than a reference country and countries with a crime statistic lower. Therefore, in a glance we could see for Italy or the USA (as the reference country) all the countries with the higher and lower values.


You should be able to paste the code below directly onto the command line of your R environment. Here is the code on Github. See the notes below for more information.


data.murder <- read.csv("Murder rate per million people.csv")
colnames(data.murder)[2] <- "Rate"

# get values for Italy and USA
rate.Italy <- data.murder[data.murder$Country=="Italy","Rate"]
rate.USA <- data.murder[data.murder$Country=="United States", "Rate"]

data.murder.Italy <- data.murder
data.murder.USA <- data.murder
data.murder.Italy$RateF <- mutate(data.murder, RateF=ifelse(Rate/rate.Italy >= 1, ifelse(Rate/rate.Italy == 1.0, 1.0, 2.0),0.0))$RateF
data.murder.USA$RateF <- mutate(data.murder, RateF=ifelse(Rate/rate.USA >= 1, ifelse(Rate/rate.USA == 1.0, 1.0, 2.0),0.0))$RateF

# Plotting variables
title <- "Country homicide crimes relative to"
pal <- c("lightgreen", "lightskyblue", "mistyrose2")

# World map referenced to Italy
data.joined.Italy <- joinCountryData2Map(data.murder.Italy, joinCode="NAME", nameJoinColumn="Country")
mapCountryData(data.joined.Italy, nameColumnToPlot="RateF")
mapParams <- mapCountryData(data.joined.Italy, nameColumnToPlot="RateF", addLegend=FALSE, catMethod="fixedWidth", numCats=3, colourPalette=pal, mapTitle="") addMapLegend, c(mapParams, legendWidth=0.5, legendMar = 2, legendLabels = "none" ))
text(x=0,y=120,labels=paste(title,"Italy"), cex=1.5)
text(c(-100,0,100),-140, labels=c("Rates less than Italy","Italy","Rates greater than Italy"))

# Europe map referenced to Italy
data.joined.Italy2 <- joinCountryData2Map(data.murder.Italy, joinCode="NAME", nameJoinColumn="Country")
mapCountryData(data.joined.Italy2, nameColumnToPlot="RateF")
mapParams <- mapCountryData(data.joined.Italy2, nameColumnToPlot="RateF", addLegend=FALSE, catMethod="fixedWidth", numCats=3, colourPalette=pal, mapTitle="", mapRegion="europe") addMapLegend, c(mapParams, legendWidth=0.5, legendMar = 2, legendLabels = "none" ))
text(x=17,y=75,labels=paste(title,"Italy"), cex=1.5)
text(c(2,17,32),30, labels=c("Rates less than Italy","Italy","Rates greater than Italy"))

# World map referenced to USA
data.joined.USA <- joinCountryData2Map(data.murder.USA, joinCode="NAME", nameJoinColumn="Country")
mapCountryData(data.joined.USA, nameColumnToPlot="RateF" )
mapParams <- mapCountryData(data.joined.USA, nameColumnToPlot="RateF", addLegend=FALSE, catMethod="fixedWidth", numCats=3, colourPalette=pal, mapTitle="" ) addMapLegend, c(mapParams, legendWidth=0.5, legendMar = 2, legendLabels = "none" ))
text(x=0,y=120,labels=paste(title,"USA"), cex=1.5)
text(c(-100,0,100),-140, labels=c("Rates less than USA","USA","Rates greater than USA"))


  • On the NationMaster site, we selected all years to export. This gives the largest possible list of countries for comparison with the caveat that not all countries are compared for the same year. For this exercise of R code this wasn't a big concern. 
  • If you export the NationMaster stats into a CSV file into a specific directory, then make sure you use setwd() to change to that directory or modify the read.csv() statement in the code to point to the correct location. 
  • Using the rworldmap package is described here in the post Maps in R: Introduction - Drawing the map of Europe. See rworldmap package page on to get to the latest reference manual. 
  • The rworldmap::mapcountryData() method will warn about quantiles as discussed in this Stack Overflow post. If you look at the help (?mapcountryData) it says "will generate unhelpful errors in data categorisation if inappropriate options are chosen, e.g. with catMethod:Quantiles if numCats too high so that unique breaks cannot be defined."


The code generates three plots. Two plots are referenced for Italy and one for the USA. Each plot has three colors, green, blue and orange. Green represents countries with less homicides than the reference country and orange represents countries with more homicides than the reference. Blue represents the reference country, either Italy or the USA.

  • Map 1: The world map referenced for Italy shows that there are many countries colored as orange, i.e., having more homicides. In the NationMaster data, there are only 19 countries with less homicide. Many are European and Scandinavian countries. A few countries with less homicide are in the Saudi Arabian Peninsula and others in Asia. The world map doesn't make it easy to see the smaller countries that are green.
  • Map 2: Zooming into Europe makes it easier to see the countries in Europe that have less homicide than Italy, including Denmark, Spain, Germany, Slovenia, Switzerland, Austrian, Norway, and Iceland.
  • Map 3: Back to the world map referenced on the USA, i.e., countries with less homicide than the USA are colored green and countries with more homicide are colored orange.

In 2010, 8.75 people per million were killed due to intentional homicide in Italy. For 2010, the number was 42.01 people per million in the USA. The UNODOC report Global Study on Homicide 2013 reports that for Italy, there were 0.9 homicides per 100,000 people with a total count for that year of 530. For the USA, the report gives 4.7 homicides per 100,000 people with a total count for that year of 14,827.

There will be discrepancy in the numbers because we are mixing years and different reporting sources, but it's still useful to do a sanity check.

  • In 2012, Italy's total population was about 60 million. 8.75 people per million x 60 million = 525 people, close to the UNODOC report.
  • In 2012, USA's total population was about 314 million. 42.01 people per million x 316 million = 13,275, about 10% off from UNDOC number.

The takeaway is that one is more likely to be killed by intentional homicide in the USA than in Italy. Crime in the United States Wikipedia article is a good introduction. It states that "Overall the total crime rate of the United States is higher than developed countries, specifically Europe, with South American countries and Russia being the exceptions", which is pretty much what is shown above.

An interesting side note from the UNDOC report is that "[i]n Italy, there has been a 50 per cent decline in this type of homicide since 2007, with organized crime-related rates of homicide decreasing from 0.2 to less than 0.1 per 100,000 population."

Wednesday, February 21, 2018

Using R to View the 2016 Italian Referendum Results

Overview | Code | Output

Three maps showing the elections of the 2016 Italian Referendum results. Left: Choropleth map. Center:  Bubble plot. Right: Worldwide bubble plot.


We are currently working our way through the Microsoft Professional Program in Data Science. There has been a lot of opportunity to work with the R programming language and software environment commonly used in statistical computing. In particular, the course Programming with R for Data Science gives students the opportunity to experiment with combining static maps (e.g. Google Maps) and data overlay overlays. In honor of the upcoming election in Italy, we thought it would be interesting to look at some recent Italian election statistics using R. The code plots the 2016 Italian Referendum results for inside Italy by region and worldwide. The data for this post comes from the Wikipedia article Italian constitutional referendum, 2016. Any errors transcribing the data are our own.


You should be able to paste the code below directly onto the command line of your R environment. Here is the code on Github. See the notes below for more information.


regionsEN=c("Abruzzo", "Aosta Valley", "Apulia", "Basilicata", "Calabria",
"Campania", "Emilia-Romagna", "Friuli-Venezia Giulia", "Lazio",
"Liguria", "Lombardy", "Marche", "Molise", "Piedmont", "Sardinia",
"Sicily", "Trentino-South Tyrol", "Tuscany", "Umbria", "Veneto",
"Italy", "Europe", "'South America'", "'North America'", "Asia")

regionsIT=c("Abruzzo","Valle d\'Aosta", "Puglia", "Basilicata","Calabria",
"Campania", "Emilia-Romagna", "Friuli-Venezia Giulia", "Lazio",
"Liguria", "Lombardia", "Marche", "Molise", "Piemonte", "Sardegna",
"Sicilia", "Trentino-Alto Adige", "Toscana", "Umbria","Veneto",
"Italia", "Europa", "America meridionale", "American settentrionale e centrale",
"Africa, Asia, Oceania, Antartide")


electorate=c(1052049, 99735, 3280745, 467000, 1553741,
4566905, 3326910, 952493, 4402145,
1241618, 7480375, 1189180, 256600, 3396378, 1375845,
4031871, 792503, 2854162, 675610, 3725399,
46720943, 2166037, 1291065, 374987, 220252)

percentNo=c(64.4, 56.8, 67.2, 65.9, 67.0,
68.5, 49.6, 61.0, 63.3,
60.1, 55.5, 55.0, 60.8, 56.5, 72.2,
71.6, 46.1, 47.5, 51.2, 61.9,
60.0, 37.6, 28.1, 37.8, 40.3)

referendum=c(rep("No",6),"Yes",rep("No",9),"Yes", "Yes", rep("No", 3), rep("Yes", 4)) <- data.frame(regionsEN, regionsIT, electorate, referendum, isRegion, percentNo)$regionsEN <- as.character($regionsEN)
latlon <- geocode($regionsEN)<-cbind(,latlon)<-subset(,isRegion==TRUE)<-subset(,isRegion==FALSE)
title <- "Referendum 2016 Results"

# generate Italy map
p <- ggmap(get_map(location="Italy",zoom=6), extent="panel")
p <- p + geom_point(aes(x=lon, y=lat),, 
colour=ifelse($referendum == "No",'red','green'),
p <- p + labs(title=paste(title, "by Region"))

# general Italy choropleth map
gp <- list(guide.label="Percent\nNo\nVote", title="2016 Referendum Results by Region", 
low="green", high="red")
mapIT(percentNo, regionsIT,,  graphPar=gp)

# generate world map
q <- openproj(openmap(c(70,-145), c(-70,145), zoom=1)) 
q <- autoplot(q) + geom_point(aes(x=lon, y=lat),, 
col=ifelse($referendum == "No",'red','green'),
alpha=0.4, size=log($electorate))
q <- q + labs(x="lon", y="lat", title=paste(title,"Worldwide")) 


  • For an  R environment, we use both Microsoft R Client and RStudio, but prefer the latter slightly.
  • We made some minor tweaks to the region and constituencies (outside of Italy) to make geocoding easier and place the bubble for a given constituency in a position on the map that makes sense. In particular, we simplified 'North and Central America' to 'North America' and 'Africa, Asia, Oceania, Antarctica' to 'Asia'. 
  • The get_map()function can't produce world map as described in this Stack Overflow post, so we found other ways to create a world map with the rworldmap and openstreetmap package. We use the openstreetmap to create a world view and rworldmap along with mapIT to to create a choropleth map of Italy. 
  • The mapIT package is discussed in Building a choropleth map of Italy using mapIT. Note in the code above that there are two listings of the Italian regions and constituencies, regionsEN in English and regionsIT in Italian. regionsEN was used with ggmap::geocode() and regionsIT was used with mapIT
  • You may see over-query-limit warnings as described in the Stack Overflow post. We saw this in happen in the course of running the code multiple times.


The code generates three plots as show at the head of this post.

The referendum was soundly defeated. Inside Italy, 60% voted against it. Only three regions approved it: Emilia-Romagna, Trentino-Alto Adige and Tuscany. The bubble plot shows the size of of the electorate (size of bubble) and their final vote (red for no and green for yes). The choropleth plot shows more clearly that the strongest no vote was more prevalent in the south of Italy.

About 10% of Italians are live outside of Italy and are broken into for constituencies: Europe, South America, North and Central America, and Africa / Asia / Oceania / Antarctica. Each of these constituencies voted for the referendum as can be seen in the worldwide bubble plot. In the worldwide plot, the logarithm of the number of votes is plotted. It's not a particularly compelling visualization because the size of the green bubbles (all constituencies outside of Italy) and the red bubble (Italy) could lead a viewer to incorrectly conclude the referendum passed overall.

For more on voting in Italy, in particular how overseas voting works, see The Italian March 2018 Election for Overseas Italians – Observations and Vocabulary Lesson.

Monday, February 19, 2018

The Italian March 2018 Election for Overseas Italians – Observations and Vocabulary Lesson


Italy is holding a general election on March 4, 2018. How and why this has come to be is better left to those more able to explain the situation than I: see, for example, Italian general election, 2018, Italy dissolves parliament for March election, and Paolo Gentiloni to succeed Matteo Renzi as Italian prime minister to see why the government was dissolved. Up for grabs in this election are 315 seats in the Italian Senate (Senato) and 630 seats in the Chamber of Deputies (Camera dei deputati), which is pretty much everything. So much for easing in change.

Fun with the 2018 Italian political party symbols (insignia). Represented here are about 40 symbols.Fun with the 2018 Italian political party symbols (insignia). Represented here are about 40 symbols.
Fun with the 2018 Italian political party symbols (insignia). Represented here are about 40 symbols.


Origins of the election aside, there are several confusing aspects we'd like to mention as seen by folks "outside" the system.

The first confusing aspect is the sheer number of parties in this election.

I lost track after counting over 30 parties registered. reports 98 parties. Even though not all parties may be on the ballot (la scheda elettorale) for a given citizen in a given location, the choices are nonetheless overwhelming. At least, I think so. And, boy do Italians, err, I mean political parties love their symbols. It took me a couple of hours to line up all the round colorful symbols (insignia) and understand who and what ideas were behind each one. More on symbols is discussed below.

The second confusing aspect of this election is the process for voting.

This election is the first test of a new electoral law passed in 2017 called Rosatellum bis. In this law, 36% of the seats of both the Senate and Chamber of Deputies are allocated to candidates who receive the most votes, or winner takes all (collegi uninominali). The remaining 64% of the seats in both bodies are awarded proportionally to parties (collegi plurinominali) based on the votes received by each party. On the ballot, winner takes all and proportional system are combined in such a way that there are a couple of ways you can vote. As explained in the Wikipedia article on this election, you can:

I. Select a candidate representing a constituency AND select a party that supports him (there may be multiple because the candidate may be in coalition with different parties). Therefore, you make two X marks on your ballot. 
II. Select just a party. In this case, your vote extends to a candidate in coalition with the party. 
III. Select just a candidate representing a constituency. In this case, your vote is proportionally extended to those parties supporting the candidate.

Diagrams of how this looks on ballots is shown here in; though written in Italian, the link's images get the point across about the different choices and their implications. Some commentators are warning that voting as described in I. above is the only way to ensure your vote goes to who you intended.

Our ballots (shown below) are simpler than you'd find in Italy because they do not contain coalitions (coalizioni), and as far as we can tell, we just write our candidate's names next to the party they are associated with as well as draw an X on the party. You can find the list of candidates for overseas voters online at the Dipartimento per gli Affari Interni e Territoriali - Elezioni trasparenti site, under Circoscrizioni Estero. For Italians in North and Central American, the We the Italians site gives a nice overview of the choices of candidates and parties.

Finally, the third confusing aspect, yet also interesting, is the existence of parties and platforms focused on concerns specific to Italians living outside the country.

We are not used to thinking about the voting block outside of a country. Let's back up for a second and review the Italian voting system. Italians residing overseas (all'estero) are part of a "territory" (circoscrizione estero) that is in turn broken into four subdivisions (quattro ripartizioni): Europe, including the Asian territories of the Russian Federation and Turkey; South America; North and Central America; and Africa, Asia, Australia and Antarctic. Each subdivision has different parties and candidates on the ballot.  Of the seats in this general election, 6 of 315 seats of the Senate and 12 of the 630 Chamber seats will beelected by Italians abroad.

In the North and Central America ripartizione, we have – at time of writing this - two parties  which from their names are obviously focused on the concerns of Italians living abroad: Associative Movement Italians Abroad (MAIE) and the Free Flights to Italy (see Update below). Two of the items on MAIE's platform are, for example:

  • Eliminating the IMU (Imposta Municipale Unica). From the MAIE site: " Italians living abroad, today pay the IMU tax on their home in Italy, due to the fact that it is not considered 'first home', an unjust discrimination that must be revoked."
  • Extending healthcare to residents abroad. Again from the MAIE stie: "Provide medical care to Italian residents abroad when they return temporarily to Italy. Italian citizens must have free access to the health care system in the country even if their residence is abroad."
The Free Flights to Italy (see Update below) platform seems to be all about culture with the "... specific target for the benefit of Italian citizens living abroad: building bridges between communities through free flights to and from Italy." Sign me up for those.

However, we also took a close look at all the candidates running under the Salvini-Berluconi-Meloni coalition - voting suggestions for mom not for us, I swear - and it was interesting to note that many of the candidates' platforms also made mention of IMU and healthcare as did MAIE. Many of the candidates representing voting blocks outside of Italy are tuned into their constituents. What a concept.

Perhaps the focus on overseas voters is less surprising when we look at the numbers. Using figures from the Italian Referendum of 2016 Numbers we can say that

- In Italy there were 46,720,943 voters. 
- Outside of Italy there were 4,052,341 voters. (This source gives a 4.3 million count of abroad voters.) 
- Together, these numbers indicate that about 9% of Italian voters live outside of Italy.

By comparison, there are about 9 million US citizens living abroad. With an estimated total USA population in 2016 of 321 million, we can estimate that about 3% of US citizens live outside the USA. Given the higher percentage of Italians abroad, it's fair to say that they represent an important voting block and, therefore, it makes some sense that there are specialized parties tailored to their concerns. But, really, free flights? Apparently no, see Update below.

Ballots for the Chamber of Deputies (red) and the Senate (blue) for Italians living in the USA.Ballots for the Chamber of Deputies (red) and the Senate (blue) for Italians living in the USA.Ballots for the Chamber of Deputies (red) and the Senate (blue) for Italians living in the USA.Ballots for the Chamber of Deputies (red) and the Senate (blue) for Italians living in the USA.
Ballots for the Chamber of Deputies (red) and the Senate (blue) for Italians living in the USA.

Vocabulary and Grammar

Rather than tell you who to vote for, or who we voted for, we'll take this opportunity to highlight some of the new vocabulary and grammar we've encountered.


A plico is a group of papers or documents in a sealed envelope: a packet. In two years studying the Italian language, this was the first time I ever encountered this word. In our plico, there is a sheet with instructions that include this:

All'interno del plico troverete
  • 1 certificato elettorale
  • 1 o 2 liste dei candidati
  • 1 o 2 schede elettorali
  • 2 buste, una piccola di norma di colore bianco e una più grande già affrancata con l'indirizzo del competente ufficio Consolare
  • Il foglio informative.

This translates as: "Inside the packet you will find 1 election certificate, 1 or 2 lists of candidates, 1 or 2 election ballots, 2 envelopes, one small standard one (a security envelope) and one larger one postage-paid envelope addressed to the consulate of jurisdiction, and an instruction sheet." 

Our plico contained 2 lists of candidates and 2 associated ballots, one for the Senate and one for the Chamber of Deputies. I'm not sure under what conditions a voter would receive just one list of candidates or one ballot.

The certificate (certificato elettorale) is a sheet of paper from which you tear off the bottom part and send back with your ballot as proof that your vote is valid.

Instructions for the March 2018 election.A mailer received from a candidate running with the coalition Salvini-Berlusconi-Meloni (sounds like a repackaged version of spumone).
Left: Instructions for the March 2018 election. Right: A mailer received from a candidate running with the coalition Salvini-Berlusconi-Meloni (sounds like a repackaged version of spumone).


Contrassegni are symbols that represent each party. There are rules for the creation of an symbol, such as that the symbol can't make reference to fascist or Nazi or religious themes, and it must be a circle. I immediately started wondering if they were always circles. It's seems like a practical standardization. I looked around a bit for the history of political symbols but couldn't find much by way of standardization or  their history, though I did stumble on to the site I simboli della discordia ("the symbols of discord"), which thoroughly describes this election's political symbols (in Italian).

Some symbols contain other symbols (pulce) of parties in a sort of coalition. Examples of the ballots we received which have these are Civica Poplare or Salvini-Berlusconi-Meloni. 

The parties that Italians living in the USA can vote for in the Chamber of Deputies.The parties that Italians living in the USA can vote for in the Senate.
The parties that Italians living in the USA can vote for in the Chamber of Deputies and Senate.


Below are some of the slogans we could find for the parties we had on our ballot. We couldn't find a well-defined slogan for MAIE, Partito Repubblicano, or Free Flights. Though Free Flights (see Update below) does make liberal use of "Time to Say Goodbye" (Con Te Partirò) by Sarah Brightman and Andrea Bocelli, which could sort of be counted as a slogan?

  • Avanti, Insieme – "Forward, together" [Partito Democratico, PD] 
  • Per i molti, non per i pochi – "For the many, not the few" [Liberi e Uguali, LeU]
  • Onestà, Esperienza, Saggezza – "Honesty, Experience, Wisdom" [Forza Italia, FI]*
  • Più Europa, serve all'Italia – "More Europe, Italy needs it" [Più Europa, +E]
  • Il vaccino control gli incompetenti – "The vaccine against incompetents" [Civica Popolare, CP]

* This slogan requires a comment.  Experience okay, wisdom maybe, but honesty after everything that has come out about Berlusconi?

Campaign Mailers

We received five campaign mailings. One from a MAIE candidate and four from Salvini-Berlusconi-Meloni (S-B-M) candidates. For educational purposes, let's look at one from the S-B-M camp, from Senate candidate Francesca Alderisi.  She writes:

Adesso sta a te fare la TUA scelta. Scegli con la testa. Scegli con il cuore! – "Now it's up to you to choose. Choose with your head. Choose with your heart!"

Update 3/3/2018

Alas, it turns out that the Free Flights to Italy party, a choice for North and Central American voters, is a hoax. As today, their web site has been taken down and there is no trace of those free flights - only . Here are two Italian articles detailing what's known: Free Flights to Italy, il mistero del partito fake ammesso alle elezioni and Questo partito è una truffa? The erstwhile party was the idea of one Giuseppe Macario. His running mate? His mom.  I guess - and if you read to this point you knew it was coming - Con Lui Partirà.

Wednesday, January 10, 2018

Bergamo – Street Sign Language Lesson XXIV

Street Sign Language Lesson 23 < Street Sign Language Lesson 24

Evviva! (hurray! in Italian), we've reached the first blog post of 2018. Over 10 years blogging and still making no sense. Some things never change. So, without further ado, let's get to this installment of Street Sign Language Lesson, where – drumroll please – we'll deal with a lost cat (dang, not again), dog poop (dang, not again), review some signs in the PAM supermarket (dang, not again), and review a prostate drug ad (ok, this really is a first).

Lost cat poster in Bergamo.A cake stand.  Atalanta fan mural: banned everywhere.
Left: Lost cat poster in Bergamo. Center: A cake stand. Right: Atalanta fan mural: banned everywhere.

Abbiamo perso il nostro gatto – Brianna – "We lost our cat Brianna."
Just when I thought I had it all straight on cat posters and how to refer to your cat – masculine as gatto and feminine as gatta – along comes this poster to turn me upside down. Brianna is a girl – blu di Russia, femmina – but they don't say abbiamo perso la nostra gatta as I might have expected.

The poster (like all good cat posters) has a good use of subjunctive: Per favore controllate ovunque la nostra Brianna si sia potuta nascondere – which translates as please, check everywhere because our Brianna could be hidden. I guess you can't trust those Russian blue cats.

Alzatina – "tiered cake stand"
I saw this alzatina in a recent visit to I Giardini di Giava. It struck me as a very practical name for a tiered cake stand. In English, you have to think about tiers, cakes, and stands with no reference to an action or verb. Alzatina is the diminutive of alzata, which is a shelf or rise, which in turn comes from the verb alzare - to raise or lift. 

Diffidati Ovunque Bergamo – "Warned everywhere - Bergamo"
I had to ask a Bergamasco friend to translate this one. Diffidati refers to fans who have received injunctions (diffide) to stay away from stadiums because of rowdy behavior. Furthermore, diffidati is the past participle of the verb diffidare – to have been warned. Our friend writes of the slogan: "Vuol dire che hanno ricevuto diffide (dalla polizia/tribunale) a non frequentare gli stadi in molte città (ovunque). Noi tifosi bergamaschi siamo quelli diffidati dappertutto e siamo fieri di questo." It is a point of pride to for at least some Bergamo soccer fans.

Pick up after you dog plea stenciled on a sidewalk in Bergamo.Sign asking customers to not use the automatic check-out for items discounted 50%.Amazon lockers in Bergamo.
Left: Pick up after you dog plea stenciled on a sidewalk in Bergamo. Center: Sign asking customers to not use the automatic check-out for items discounted 50%. Right: Amazon lockers in Bergamo.

L'ha fatta grossa? Raccoglila! – "Made a mistake? Pick it up!"
Here we go again. If it's not cats, it's dogs*. La Repubblica (Hoepli Editore) dictionary says farla grossa means commit a serious action. This message was stenciled on a sidewalk in Bergamo Città Alta. The rest of the message reads: È un obbligo ma anche un gesto d'amore verso la tua città – "It's not only your duty, but also an act of love for your city." And so it goes to get owners to pick up after their dogs.

* Makes me think of the song "Fortune Presents Gifts Not According to the Book" from Dead Can Dance that appeared on their Aion release. The lyrics include the following lines "Because in a village a poor lad has stolen one egg / He swings in the sun and another gets away with a thousand crimes / When you expect whistles it's flutes / When you expect flutes it's whistles".

Informiamo i sig[g] clienti che la merce con bollino 50% non va battuta alle casse automatiche – "We inform our dear customers that merchandise with a 50% sticker should not be scanned at the automated checkout."
Non va battuta is the same as non deve essere battuta or "should not be scanned". Note "a" at the end of battuta to agree with la merceSig. is "Mr." and plural is sigg. is "Messrs."

Ordina su Amazon, ritira qui – "Order on Amazon, and pick up here"
An indication of the slow, tireless penetration of Amazon: here we are in PAM supermarket in Bergamo, via Camozzi, and we see they have cleared precious floor space for Amazon.

Prostamol drug sign - no more excuses bearded daddy. Affectionate mom looking for job as babysitter.
Left: Prostamol drug sign - no more excuses bearded daddy. Right: Affectionate mom looking for job as babysitter.

Prostamol: contribuisce a favorire la funzionalità della prostata e delle vie urinarie. Da oggi basta scuse! – "Prostamol: helps to promote the functioning of the prostate and urinary tracts. From now on no more excuses!"
What caught my eye was the Da oggi basta scuse! La scusa is an "excuse"; plural is le scuse. Da oggi is more literally translated as "from today", but I think it's better reads as "from now on". This drug ad was seen in a pharmacy on Via Torquato Tasso.

Signora Lucia, amorevole mamma italiana di 50 anni cerca lavoro come babysitter – "Ms. Lucia, affectionate Italian mother of 50 years is looking for work as a babysitter."
Amorevole caught my eye in this sign we saw walking along Via Masone. Other -evole words are piacevole – "pleasant", gradevole – "pleasant", sgradevole – "unpleasant", onorevole – "honorable", and the useful vomitevole – "nauseating" as in something that makes you vomit. Find more -evole words on

Tuesday, December 19, 2017

Bergamo – Street Sign Language Lesson XXIII

Street Sign Language Lesson 22 < Street Sign Language Lesson 23 > Street Sign Language Lesson 24

It will never end. I confess I love signs and notes. With that, let's dig into the 23rd* installment of Street Sign Language Lessons where we study a lost cat sign, learn the Italian word for caviar, learn about a type of broccoli grown in the Veneto, and review some negative graffiti about soccer.

* In Italian, 23rd is ventitreesimo, which you get to know well in Bergamo because the beloved pope Papa Giovanni XXIII was born in nearby Sotto il Monte. His name, always with the XXIII,  graces many roads and institutions so that living here you soon get used to saying, vay-n-ti-tray-ay-zi-mo. No, it' isn't a new Starbucks coffee drink.

Sign for caviar to order at Orobica Pesca.Sign for panettone to order.A variety of broccoli grown in the Veneto and called broccoli fiolari.
Left: Sign for caviar to order at Orobica Pesca. Center: Sign for panettone to order. Right: A variety of broccoli grown in the Veneto and called broccoli fiolari.

Caviale fresco su ordinazione – "fresh caviar to order"
We were in Orobica Pesca once again, and while waiting for our number to be called, I came across this sign for caviar. I was surprised that there was not an "r" in the Italian word for caviar, which is caviale. Note that caviale is a noun ending in "e" which is masculine, for example: il caviale fresco. For more on words ending in "e" and that are masculine, see A Rule of Thumb for Predicting the Gender of Italian Language Nouns.

Panettoni su ordinazione – "panettone to order"
Tis' the season to order, from caviar to panettone. This sign was seen in a new bakery Grani Madre in via Santa Caterina specializing in grani antichi  e mulino a pietra – "ancient/heirloom grains, stone-mill ground".

Fioi Broccoli Fiolari – "Sons Broccoli sons?"
Broccolo fiolaro di Creazzo – is a variety of broccoli from Creazzo in the Veneto region. The term fiolari (singular fiolari) comes from the dialect term "fioi" which means "figli" or "sons", so named for the presence of shoots/buds along the trunk of the plant. In this case, Fioi is the name of the company producing this variety that we saw at our local ortofrutta.

Poster for a lost male cat in Bergamo.Graffiti in Bergamo equating soccer to ignorance.Sign advertising an house for sale.
Left: Poster for a lost male cat in Bergamo. Center: Graffiti in Bergamo equating soccer to ignorance. Right: Sign advertising an house for sale.

Gatto smarrito – "lost cat"
It's something that grabs you every time: a lost cat poster complete with a cute picture of said cat staring at you. We first dealt with this subject in Bergamo – Street Sign Language Lesson XIV, where we sweat(ed) over the difference between gatto (male cat) and gatta (female cat). This poster continues on to say gatto rosso maschio di 4 mesi – "red male cat 4 months old".

There is a good example of using the subjunctive as well in the poster: chiunque lo vedesse mi contatti – "anybody who sees him contact me". It's vedesse instead of vede, because with indefinite pronouns like anyone/anybody, the subjunctive is used. So much to learn in a lost cat poster. 

Calcio -> ignoranza – "soccer -> lack of education"
Instead of all the pro-soccer graffiti we usually find "decorating" walls of the streets leading to the stadium, it was interesting to see graffiti that was anti-soccer.  Ignoranza could mean illiterate, lack of education, rude, or incompetent. Or, maybe all of the above?

Città Alta svendita, trilocale doppia terrazza, introvabile – "Città Alta sale, three rooms, two terraces, extremely rare"
A trilocale – three rooms - on average is about 80 mq (860 ft^2), and is usually two bedrooms and a living room. The kitchen may be part of the living room or separate (but usually small) and there is at least one bathroom. I was drawn to this sign by the word introvabile, a word to keep in your back pocket and pull out when occasion warrants it.

Sign asking us to stay off the bocce court.A sticker on an ATM that says the bank is always smiling at us. Really?
Left: Sign asking us to stay off the bocce court. Right: A sticker on an ATM that says the bank is always smiling at us. Really?

Si prega cortesemente di non entrare nel campo da bocce – "please don't enter the bowling court"
The bowling court or bocce court was behind the Santuario della Madonna della Castagna, on the very northern edge of Bergamo. From Bergamo Città Alta, you walk about 6 km northwest to reach this little church. It's a pleasant walk through the characteristic well-to-do neighborhoods of Bergamo. As the story goes from the brochure describing the church, Mary appeared here in April 28, 1510 and asked the locals to spread the news that they should erect a temple...and a bocce court.  I suspect this kind of bossy apparition wouldn't work so well these days.

La tua banca ti sorride sempre – "your bank always smiles at you"
This was a sticker on an ATM machine.  The sticker features the words on the background of huge gaping jaws of a shark (squalo). What could they ever mean by this!? The sticker was gone when we went back a week later.

Monday, December 18, 2017

A Personal Information Management System: Introducing Scrapbook


Introduction and Motivation
What is Scrapbook?
What Scrapbook Isn't
Redefining Digital Scrapbooking
What's in a Name?
Scrapbook History and Milestones
Working with Scrapbook
Artificial Intelligence Tools
Natural Language Queries
Scenarios Revisited
What We've Learned
Future Directions

Introduction and Motivation

This post is about a personal information management system (PIM) we call Scrapbook. In this post we will talk about our 15+ years of experience developing and implementing Scrapbook. But before we get into details, let's set the stage with some of the scenarios that prompted us to create Scrapbook.

  • We are about to walk into a friend's house. What are the names of her three children?
  • We remember a wonderful hot chocolate we had in Cuneo. What was the name of the café? What was the name of any place we ate at in Cuneo?
  • We are planning to to have a friend over for lunch and we  want to prepare a dish we haven’t served her before and obviously want it to be something she’ll enjoy, so we review meals we’ve had with her in the past.
  • We are planning a dinner with friends and we are looking for an interesting red wine to serve. We remember that we drank some nice dolcettos from Piedmont last year. What were they?
  • We are about to call to make an appointment for a haircut. What are the names of the people that work at the salon so whoever answers the phone we're able to address them by name?
  • We are wondering about an upcoming wedding we have been invited to and we want to pull up the invitation to confirm the date and protocol for gifts.
  • We are writing an email to a friend and we want to recommend the last hike we did in Piedmont. What were the details of that hike?
  • We remember reading an interesting book in 2010 that had "science" in the title. What was it?
  • We want to know if we've ever received a postcard from Germany.
  • We want to know the average cocoa fat percentage of chocolate we tend to buy and what, if any, is the correlation between percentage and how we rate it.
  • We want to understand which categories in Scrapbook have the most items as well as how many items there are per year over the last 15 years.

These scenarios span a few of the types of questions that we encounter daily and that we want to be able to answer securely and quickly. And while there may be different software and tools to address one or more of these scenarios, Scrapbook for us is a one-stop personal information management system to address them all. Later in this post, we'll return to these scenarios and show a solution for each of them leveraging our Scrapbook database via diverse user experiences including a web application, natural language interaction via Cortana, Skype and web bot, as well as by various analysis and plotting tools.

We hope that after reading this post you will:
  • Understand the power of developing your own personal information management system.
  • Be inspired to create your own Scrapbook, or if you already have something similar, that you might find a few nuggets of useful perspective here.
  • Consider collecting and managing your own personal data, by whatever means is appropriate for you. Your data tells your story and is too valuable to hand over to hand over to social media platforms to be exploited for their profit, and potentially your expense.

What is Scrapbook?

Scrapbook is a digital personal information manager. To understant what that means let's talk about common information management tools we use each day. There are file storage services such as OneDrive, iCloud, and Dropbox, which we use to help manage photos and files, and email services like Gmail or Outlook, which we use to organize email, contacts, and calendar events. Facebook, LinkedIn, and Snapchat are social network services that facilitate communication and interaction between friends an contacts vis news feeds and personal timelines. Twitter and blogging are further examples of information sharing platforms through which we can relate what we're doing and what we've been thinking about. And, let's not forget the numerous offerings for personal wikis and journaling software.

All of these platforms manage your personal information in one form or another. Each has strengths and weaknesses and attempts to address a part of the personal information management problem. Scrapbook doesn't aim to replace any of these tools. In fact, Scrapbook can take input from or output data to these tools. That nice you say, but then how is Scrapbook different? We created Scrapbook to deal with four concerns that we felt are not yet adequately addressed:

1. How to deal with archival data

Archival data is information that we don't need immediate access to, but may wish to review in the future. It includes, but is not limited to, year-end financial statements, old health records, newspaper clippings, postcards, brochures, tickets, invitations, and letters. Archival data is all that physical paperwork that ends up in over-stuffed folders in drawers or file cabinets. Think of the times you needed to find something searching page by page through one bulging folder after another to finally locate (or not) what you were looking for. If your archival data happens to be already digitized, it's likely stored somewhere on a drive or in the cloud, but finding can be a trick. The idea of not being able to locate important information at some point in the future bothered us, and we set out to address it with Scrapbook.

For years, we saved things like playbills of theater shows, mementos of places we visited, articles cut out of the paper, labels from products we liked, and other tidbits of information. We pasted the scraps of paper into blank notebooks, which eventually filled up. Our "scrapbooks" grew in number over the years. We noticed that we rarely referred back to these physical scrapbooks in large part, because except by chronological order, it was hard to find anything. You can think of Scrapbook, in part, as a digital version of the physical scrapbooks in which we preserved these scraps of information, but far more accessible and fun.

Physical Scrapbook Page from 2002.Physical Scrapbook Page from 2000.
Left: Physical Scrapbook Page from 2002. Right: Physical Scrapbook Page from 2000.

2. How to capture context about data

Metadata provides context about an item (event, place, person, or object) that we capture because it is important to understanding why the item is interesting in the first place. Contextual data gives meaning to and unlocks understanding of the event, place, person or object. But how to capture this metadata and where to persist it isn't obvious. For example, we keep personal notes on friends that include reminders about what they like to eat or are allergic to, the names of their kids, and important events in their lives. You might be thinking, why not store that in Outlook or Gmail contacts, and you'd be correct. But what about personal notes on books read? Or, thoughts about a special dinner, an epic hike, or a great concert? Where can we store all of these notes - context - consistently and in one place whether it's about an event, place, person, or object? Consistent management of contextual information (metadata) is one of the scenarios we set out to address in Scrapbook.

In the photos of pages from our physical scrapbooks, written notes can be seen around the pasted objects. These are examples of 'metadata' that provide necessary context.

3. How to find relevant information quickly

It's ironic that using Bing or Google we can call up withing seconds all the details of a celebrity. But what about someone we actually know or care about? We know what you're thinking: just go to Facebook and look them up. Yes, that sort of works if the person in question is even using Facebook. However, here we are talking about information that matters to us, which is information about someone that you probably wouldn't find in a social media profile or wouldn't even appear in Facebook. It's information that's gleaned by spending time with someone, in person, in the context of a direct relationship. How to access this kind of context quickly and on our terms intrigued us and became a guiding principle for Scrapbook.

We use people data here as an example, but our argument applies to all types of information, be it events, places, or objects. In a way, Scrapbook is a small private search engine customized for our data that can be searched quickly. How quickly? Less than 10 seconds in most cases. That might not sound that great at first, but think about that filing cabinet of over-stuffed folders. How long would it take to find something in there? Or navigating a half dozen different apps or web sites to find what you are looking for. With that in mind, 10 seconds isn't very bad at all.

4. How to own our information

Fundamentally, we don't have much trust in many of the platforms we've mentioned above, especially the current flavors of social network services. We are not keen on having our personal information exploited by algorithms to sell us products or filter our news. Yes, we tolerate this to a degree, but absolutely not for the broader categories of information we have in mind here. And while we acknowledge that these services are fun and constitute an important social component for many, it seems crazy to us that people spend time at all creating detailed personal timelines that generate ad sales for someone else.

We’re also not confident in the longevity of these services and platforms. Five years is a long time and ten years an eternity in the technology sector. MySpace, for example is a distant memory. Facebook is already regarded as passe, with internet newbies flocking now to Snapchat. Smaller platforms come and go in a relative twinkling. Returning for a moment to the concept of an archive, we’re thinking long term.

From the start, Scrapbook was envisioned as way we could maintain as much control as possible of our own data, and really, the story of our lives, while making it accessible in a secure, sustainable manner. That desire led us down the avenue of developing our own solutions. It's not an easy road, but it's been rewarding. We maintain that taking ownership of our data has given us a greater understanding and appreciation of what constitutes us data-wise, of the information we return to more often, and which data is important enough to save for the future. 


Some definitions relative to Scrapbook that are used often in this post are:

Scrapbook item
An entry in the Scrapbook collection. Each Scrapbook item always has at minimum a title, category, date, location, assets, and notes field associated with it. In our current implementation, each Scrapbook item is a JSON object within a collection of JSON objects that make up the Scrapbook database. These JSON objects are stored in a document-oriented NoSQL database. The bold terms are represented as fields in the Scrapbook JSON object shown in the image below. A Scrapbook item is also referred to as a Scrapbook entry.
Scrapbook asset
A photo, document, link, web page, or any other digital format that is associated with the Scrapbook item and is stored on disk (locally or in the cloud). The names of the assets are associated with each Scrapbook item in an assets field. Each Scrapbook item has at least one representative photo, which by conventions is named 'Folder.jpg'.
Contextual data
Additional information or metadata about a Scrapbook item that beyond the title, category, date, location, and assets fields. Contextual data goes into the note fields, which are called body and bodyObj. There may be several subfields defined under bodyObj depending on the item category.
Scrapbook data or collection
Refers generically to the all the data managed by Scrapbook, both in the document-oriented NoSQL database and Scrapbook assets storage.
Physical object
One or more physical entities, usually paper, but really might be anything we're interested in capturing. A physical object becomes a Scrapbook item through photography, scanning, and describing or capturing in whatever means possible relevant contextual data about the object. Thus a physical object becomes a Scrapbook item or asset.

Scrapbook Item represented in JSON.Recent Concert category items displayed in a browser.
Left: A Scrapbook item represented in JSON. Right: Recent Concert category items displayed in a browser.

What Scrapbook Isn't

A photo or large document archive

Better tools to do this might be OneDrive, Google Drive, or Dropbox. While Scrapbook does save photos and documents, the emphasis is on saving only those key assets with associated context that relate the story of the Scrapbook item. For example, for a dinner with friends, we might save a group photo, a photo or two of what we ate or drank, or something telling about the evening. Along with the photo, we'll describe who was there, interesting conversation topics, and other bits of information about the event that we may refer to in the future.

A journaling or diary platform

Tools that already do this are Day One, RedNotebook, OneNote, and Evernote to name a few. While Scrapbook deals well with large quantities of textual data, it's best when the context of a Scrapbook item is captured as concisely as possible. A Scrapbook entry about a person, for example, contains only essential contextual information: name, where and how we met, context of the relationship, and perhaps a few descriptive details. If by chance we have a biography or resume of the person, that becomes an accompanying Scrapbook asset.

A contact management tool

Better tools to do this are Outlook, Gmail, and many more. Scrapbook saves bits of information about people that tell a story about who they are and what they mean to us. Sometimes, we include information we wouldn't want to include in a traditional contact management system such as favorite color, ex-spouses, despised or loved foods, things we've done together, or perhaps health events.

Redefining Digital Scrapbooking

Digital scrapbooking as defined by a number of sites today, Smilebox being one example, is a process whereby you combine photos and text to create a one-time presentation as output. You upload photos, add text, and select a theme. The process returns a "digital scrapbook" for sharing, downloading, or printing. The results is not searchable.

We consider this a weak definition of digital scrapbooking. For us, digital scrapbooking, and Scrapbook specifically, is more than just the output of a singular presentation. Digital scrapbooking is:

  • Focused on the collection
    • A digital scrapbook is the collection of items or digitized assets themselves.
    • You can select a subset or all items and direct them to an output (video, print, etc.), but how those items are consumed are not what constitute a digital scrapbook.
    • The collection of items is searchable.
  • More than photos
    • Any kind of printed material or object that can be scanned or photographed can be included, as can a file of any type or format.
    • A physical object, once digitized and recorded as a Scrapbook asset can often be discarded.
      • For example, we have Scrapbook items which were "once" physical objects such as awards, memorial plaques, or trinkets. And while it may sound perverse to save only digital representations of such objects, we can tell you that it is extremely liberating to let go of physical stuff this way and retain a compelling representation of it in Scrapbook for later.
    • The metadata / context of the digitized object is important. It provides the where, when, what, and why of something, which breathes the life into these digitized assets.
      • It's frustrating for us now to stumble upon a Scrapbook asset from our "early" days that doesn't have any contextual metadata. It's frustrating, because the value of having saved the item is greatly diminished without context.
  • A shift in perspective and habit
    • Storing assets digitally and sticking with it is a can be significant commitment. It means that instead of shoving something you want to save into a folder or drawer, you take a moment or two instead to scan or photograph it, then to make an entry in the Scrapbook system. And perhaps then throw the original away, giving up the comfort and security of its physical presence.
    • Today, more and more Scrapbook items arrive digitized anyway. We find now that our focus is less on the input task (scanning/photographing), and more on managing, curating, and researching ways to use query Scrapbook.
    • Discipline is key. You should think about what your future self, 30+ years from now, might be looking for in Scrapbook. More on this idea is discussed in the What We've Learned section.

What's in a Name?

We chose the name Scrapbook because it invokes the physical scrapbooks we were maintaining up to the time we went digital. The name is useful if not a little unfortunate in that it conjures an image of scraps, haphazardly pasted into a big binder. To avoid that association, we may change the name to MyJournal or MyDiary. With Cortana digital assistant (explained below), we already reference Scrapbook as "My Journal" because it's easier to pronounce and easier for Cortana to recognize.

Scrapbook History and Milestones

Our Scrapbook has been over 15 years in the making. Its evolution can best be described as sporadic with significant inflection points described below. We tend to create a version of Scrapbook that we use for a long period of time before thinking about and implementing the next version.

  • 2003: Origins - Scrapbook v1
    • Transition from physical scrapbooking to digital scrapbooking.
    • Establish basic data schema, understand which data makes sense to collect and over which categories.
    • Each Scrapbook item has a unique number generated sequentially.
    • Design choices which crystallized at this time include naming asset folders YYYY-MM-DD, organized in a "normal" folder hierarchy for compatibility and traversability withing any typical file system.
    • Assets associated with each Scrapbook item are stored on a physical hard drive.
    • The service is hosted on a local server, at this point under a desk at home.
    • Metadata is maintained in an XML flat file. This approach is not elegant, but useful.
    • The interface to the Scrapbook collection is via ASP.NET pages.
    • There is support for only single-user edit access because the XML flat file is cached in local web server (IIS) memory (yikes). Multiple users viewing is supported.
    • Support for basic CRUD operating on XML flat file database.
    • Data entry is cumbersome.
  • 2009/2010: Ajax - Scrapbook v2
    • Size: 3,000 Scrapbook items in over 30 categories.
    • Continuing to add to the collection. By now almost all of our old physical records, printed material that *was* in folders and cabinets has been scanned and entered into Scrapbook.
    • Still lacking interesting query methods.
    • Focus on a new front end using JavaScript, jQuery, ASP.NET, C#, LINQ, and webservices to deal with cumbersome data entry.
    • Still based on the same basic design in terms of single-user access and an IIS in-memory database with CRUD.
    • Experiment with different ways to access data from mobile, different layout paradigms; main use continues to be through browser.
    • Lack of context and connections between entries.
  • 2012: Cloud - Scrapbook v3
    • Size:  4000 items.
    • The XML file "database": 6 MB.
    • Hoist the Scrapbook application into the cloud. Host website/front end in Azure cloud, not on-premises.
    • Host assets in Azure blob storage, syncing from blob to local hard drive to maintain       "visibility" of the data in a local file system.
    • Refine the look and feel, but otherwise still using the same technology stack: XML flat file in web server memory, jQuery, ASP.NET, C#, LINQ, and  webservices.
    • Using CloudBerry tool to synchronize assets between blob storage and a local file system.
  • 2015: OneDrive - Scrapbook v3.1
    • Size: 5000 items.
    • OneDrive is now mature enough such that Scrapbook asset storage (Azure blob storage) is synced to OneDrive instead of a local file system.
    • Devices (computer, phone) can now sync to OneDrive to provide local views of Scrapbook assets.
  • 2017: MVC, Document DB - Scrapbook v4
    • Size: 5700 items over 25 categories.
    • Scrapbook assets (photos, documents, etc.): 30 GB
    • Replace complicated set of ASP.NET pages and web services with MVC 5 design.
      • Started with the ASP.NET MVC ToDo List and customized the heck out of it. Here are the instructions we started with.
      • Replaced all functionality dealing with XML with JSON.
    • Replace XML flat file in-memory solution with NoSQL solution with a cloud-hosted document data structure (Azure Cosmos Document DB)
      • Single-user user requirement removed. Now, Scrapbook is fully multi-user.
      • Scalability significantly increased.
    • Data schema now JSON-based:
      • Add geocode and friendly location name fields for each Scrapbook item.
      • Track update and modified times of each Scrapbook item.
      • Implement two-level schema:
        • Level 1 (top-level) fields apply to all items, e.g., title, location, date, category.
        • Level 2 (second-level) fields depend on category of item, e.g., for a Scrapbook Book item there are author and synopsis fields while for a Scrapbook Hike item there are length, duration, and elevation fields. Made possible because each document in the datastore need not have same number of fields.
      • Assets belonging to Scrapbook items (e.g., images, documents) are tracked as part of each document.
      • Scrapbook items now have full-fledged GUIDs.
      • Reduced and reorganized categories, in particular, with level 2 fields.
    • Front end remains in cloud (Azure app service).
    • Implement Windows Live login to authenticate users and a custom solution to authorize authenticated users.
    • Backing store for assets continues to be Azure blob storage with sync to OneDrive.
    • LINQ still a critical part of query in MVC design.
    • Cosmos DB Document DB allows for many more consumers, notably, intelligence services such as the Azure bot framework, and LUIS for natural language parsing.
    • Scrapbook is accessible now via multiple channels including Skype, Cortana, and embedded web chat using natural language with rich media output including voice.
    • Primary editorial interface is via web pages (MVC).

Working with Scrapbook

We went out of our way in the What is Scrapbook? section to define Scrapbook as being about the collection of context and assets, and the design and curation of that collection. That's true. But it's also true that what makes our Scrapbook stand out is how we can now consume the data stored in it and use it in our daily lives. In this section, we'll talk about different ways we use Scrapbook data.

In the Scrapbook History and Milestones section, we summarize the development of Scrapbook, and while Scrapbook from its initial implementation has had the core functionality of a database with a web front end, it was the 2017 implementation of Scrapbook that opened up the door to many more possibilities for working with Scrapbook data. Prior to 2017, Scrapbook was a XML flat file hosted in-memory. Though it supported multiple users viewing data, it only allowed single-user data entry via a web interface. Furthermore, it was not practical to share Scrapbook data  outside of the web site context unless we passed around an XML file, which would always be out of sync.

In 2017, the key change we implemented was to port Scrapbook into a NoSQL database solution and to reorganize the collection. Specifically, we use a NoSQL document store provided by Microsoft Azure Cosmos DB, although other services could work just as well. More precisely, the Scrapbook collection is a document-oriented database where each document is encoded in JSON. Each chunk of JSON (a document) represents one Scrapbook item.

Document stores such as Azure Cosmos DB are schema-less databases which enable rapid development by allowing for data models not confined to a strict schema. Each JSON document in a document store can have a different structure. In practice, we have limited the amount of variation. We currently have two types of JSON documents identified by a type field: a Scrapbook category document and a Scrapbook item document type. The latter make up 99.9% of the documents. A category document exists to track enumerate enforced categories, category synonyms, and the specific fields associated with each category (level 2 structure). Within Scrapbook item JSON documents, the schema depends on the category selected. We call this level 2 structure. Level 1 JSON schema includes the fields (title, location, etc.) that always apply regardless of category of the Scrapbook item.

The NoSQL approach immediately opened up Scrapbook to scalable, multi-user access, that is, multiple users performing CRUD (create, read, update, and delete) operations. Furthermore, the NoSQL approach hosted in the cloud allowed us to more easily and flexibly leverage Scrapbook data, which we discuss next.

Artificial Intelligence Tools

An important development that we are exploiting is the recent availability of a new breed of Artificial Intelligence (AI) based tools. Microsoft and others are making advanced analytics and big data processing tools available for users that just a few years ago were not possible. Specifically, in the Microsoft ecosystem, we are using tools which fall under several rubrics, which may be confusing at first glance but all convey the same basic message which is that Microsoft is trying to democratize the use of artificial intelligence and make the technology available to everyone. That's a good thing.

What this means for us is that we are able to create a personal digital assistant that is very personal, flexible, and powerful by leveraging and customizing the technologies underlying Cortana. By coupling the richness of the information we’ve amassed in Scrapbook with the intelligence services that power experiences such as Google Now, Cortana, Alexa or Siri, we’ve realized a much more interesting and impactful experience.

Here are some of the key parts of the Microsoft ecosystem we are currently using.

Bot Framework

This framework allows us to build apps that interact in a conversational way. The bot framework is consumed by multiple channels including Cortana and Skype. 

Microsoft Cognitive Services (MCS)

Bing Speech API is used for Windows applications including Cortana, a personal assistant platform we leverage for spoken, natural language queries to access Scrapbook.

Language Understanding Intelligent Services (LUIS) takes sentences sent to it by a bot channel (such as Web Chat, Cortana or a Skype bot) and interprets them by extracting the intention they convey and the key entities that are present.

Cortana Intelligence Suite

The suite is a marketing term which is about the orchestration all Microsoft technologies (including the ones mentioned above) used together to create end-to-end intelligent solutions, typically involving Microsoft Azure services and Microsoft Machine Learning Studio. The potential of the suite is best understood looking at the gallery of possible solutions, which show common patterns and practices.

The challenge of these AI-based tools is that they are, at least currently, rarely applied for personal use. Most of the demos and examples are for solving business cases like predicting sales or demand, detecting faults or anomalies, judging customer sentiment, or creating recommendations. Trying to use them with Scrapbook, in a personal data context, is an interesting challenge that we discuss more in the Future Directions section.

Below, is a list of some of the resources that we currently leverage in our current Scrapbook implementation. We color code each resource, to easily identify it later when we revisit example scenarios in which we show how each resource is used.

Web interface

User access: In a browser, navigate to Scrapbook URL, authenticate with a Windows Live account, and authorize with a custom authorization system.

Key components: ASP.NET MVC 5, Visual Studio, GitHub, Web App hosted on Azure, LINQ
From the inception of Scrapbook, access via a Web interface i.e., a web browser interacting with a hosted web site, has been the primary modality for access. The current ASP.NET MVC site allows for CRUD operations as well as other custom pages.

Bot Framework: Skype, Cortana, and Web Chat

User access: One can use SkypeCortana, and Web Chat as you would normally (from any device), however the user's Windows Live account must be granted access to use the bot via any of these channels.

Key components: Microsoft Bot Framework, LUIS API, Visual Studio, GitHub, LINQ
If you had asked us a year ago to imagine talking to a bot which interacts with our data collection, we wouldn't have been able to do it. In the space of a few months and some development work, we were able to implement natural language interfaces to query information which means something to us, verbally via Cortana and via messaging (Skype or Web Chat). The power to leverage natural language unlocks our data, and that's incredible. A big part of this leap forward is the Microsoft Bot Framework which provides an underlying dialog structure, as well as the Language Understanding Intelligent Service (LUIS) API, which facilitate language parsing. In particular, LUIS enables us to send "utterances", the things we say or write in a bot conversation, to the LUIS API and based on training via the application, receive scored intentions and key entities. Examples of intentions are find, list, describe, read, and show. Examples of entities are category, date range, and location. We see a great deal of opportunity leveraging bots and natural language queries as discussed in the Future Directions section.

Azure Portal

User access: Navigate to the Azure portal via browser and authenticate against our Windows Live accounts, which must have either administrator privilege or have been granted, at minimum, read access to access the Scrapbook Cosmos DocumentDB instance containing the Scrapbook data.

Key components: NoSQL document store, SQLite query language

We include the Azure portal here as a "tool" because we do in fact do some data analysis and grooming there. We never use the portal for entering new Scrapbook items. It is better to use the web interface for data entry because it implements all the necessary prompts and controls to ensure the integrity of data added or edited within the collection.

In the portal, there are various ways to issue queries against the document store using a "SQL-like query language with additional syntax features to handle JSON data types"[ref]. To see it in action, check out this playground.

The Cosmos DB document store is of course accessible via an SDK or REST endpoint. Access is mediated by a URI endpoint and access key. Both the Web Chat and bot framework applications leverage the SDK.

We also access the collection via the Azure Cosmos DB Data Migration tool, which can move data in or out of Cosmos DB; via ODBC (Excel); or with the Azure Machine Learning studio as discussed below.

Azure Machine Learning Studio

User access: Navigate via a browser to the Microsoft Azure Machine Learning Studio, and authenticates using your Windows Live account. You must then provide the appropriate URI and key to access the Scrapbook Cosmos DB instance.

Scrapbook data (stored in Cosmos DB) can be imported into Azure ML Studio and analyzed to achieve various interesting objectives. ML Studio facilitates data transformations, for example flattening or exporting all of the collection into CSV format, or processing text fields with various standard algorithms, passing data into a Jupyter notebook for analysis in R or Python. We are only in the most rudimentary stages of the capabilities available here. Some of our ideas for future exploration include sentiment analysis (via text processing) or category prediction from title (regression analysis), image analysis (metadata extraction), and detecting and potentially automatic insertion of missing data (via clustering analysis).


Access: ODBC Driver with URI endpoint and access keys.

Technology: Desktop or online versions of Excel.

With a special ODBC driver, you can load data directly from Cosmos DB, or you can load a CSV file which has been output from Azure ML Studio or the Migration Tool. Either way, you have data you can work with in Excel using pivot tables, analysis tools, as well as  quick and easy graphing capabilities. It's interesting, for example, to create histograms or tree maps of category counts to for an intuitive visual representation of the item collection in Scrapbook.


Let's discuss search, which as we've said, is fundamental to the utility of Scrapbook. Scrapbook items,  their associated contextual data, and assets are only useful if we can find them.

Most of our queries to Scrapbook, be they using our natural language bot or MVC web application interfaces, come down in the end to a SQL query. In our C# code, we use in some cases Language Integrated Query (LINQ) to construct SQL queries, in other cases (primarily the bot interface), we construct DocumentDB SQL queries directly in code. In the following examples, the bolded items correspond to a JSON string value in the Scrapbook schema. This isn't meant to be a thorough code review of how we search, just a few select examples to illustrate how we've implemented search in our applications.

Web interface

The following code was captured using the debugger in Visual Studio (MVC 5/C#) of a query for the category Books with the word "science" in the title.

.Where(f => (((True AndAlso f.Category.ToLower()
.category.ToLower())) AndAlso (f.Type == "scrapbookItem")) 
AndAlso ((False OrElse f.Title.ToLower()
OrElse f.Body.ToLower()
.OrderByDescending(c => c.DateAdded).Take(40)}

This is the LINQ lambda expression generated after creating a LINQ expression. Expression trees allow us to build runtime queries.  The important thing to note here is that however we use LINQ (directly expression or expression tree), it is ultimately converted into the following SQL query made against Cosmos DB.

{{"query":"SELECT TOP 40 * FROM root
WHERE (((true AND (LOWER(root[\"category\"]) = \"books\"))
AND (root[\"type\"] = \"scrapbookItem\"))
AND ((false OR CONTAINS(LOWER(root[\"title\"]), \"science\"))
OR CONTAINS(LOWER(root[\"body\"]), \"science\")))
ORDER BY root[\"dateAdded\"] DESC "}}

A couple of notes:
·        "root" in the SQL query refers to the collection itself.
·        In the C# LINQ expression, some bolded items start with an uppercase letter (Title instead of title) because in code we follow the standard C# naming conventions.
·        In the C# LINQ expression, ENDPOINT, DATABASE_NAME, and COLLECTION_NAME are replaced with their "real" values.

Bot Framework

This is an example of a SQL query generated in our C# code from a natural language query "Show me all concerts we went to near Bergamo last year."

"SELECT TOP 200, c.title, c.category FROM c
WHERE c.type = \"scrapbookItem\" AND CONTAINS(LOWER(c.category), \"performances\") 
AND ST_DISTANCE(c.geoLocation, {'type': 'Point', 'coordinates':[9.66950988769531, 45.6952285766602]}) < 10000
AND c.idFolder >= \"2016-01-01\" AND c.idFolder <= \"2017-01-01\"
ORDER BY c.idFolder ASC"

In our Bot Framework code, the SQL queries are assembled piece by piece in code modules each written to extract various query parameters parsed from a user’s natural language ‘utterance’ or query. Our code currently parses and translates into document DB SQL query syntax – scrapbook category or sub-category, date and date range, geographic location or region (geo-point, geo-polygon), text string search, entity type, the desired number of, or ordinal results to be returned, and in what order.

Azure Portal

This is a SQL query to find Scrapbook items in category Correspondence that concerns a postcard from Australia.

WHERE c.type = "scrapbookItem" AND c.category = "Correspondence"
AND CONTAINS(LOWER(c.title), "postcard") 
AND (CONTAINS(LOWER(c.location), "australia") 
OR CONTAINS(LOWER(c.title), "australia"))

In fact, all of the queries shown here above can be used within the portal.

Azure Machine Learning Studio

This is a SQLite query to select only a few columns from Scrapbook and use them for analysis. This code would be inside an Apply SQL Transformation module.

SELCT title,
      DATETIME(SUBSTR(idFolder, 1, 10)) AS dt,
      SUBSTR(idFolder, 1, 4) AS year
      FROM t1;

Natural Language Queries

Speaking is natural to us, so why not use it to find out about things you care about?  We can speak to Cortana or we simply send a message via our bot, via Skype or a Web Chat, for example. For purposes of demonstration here – it's easier to capture the output from a one of our bot channels – in this case our Web Chat interface. But everything you see typed in the screenshots below can be requested of Cortana.

Three examples of using a bot to access Scrapbook using natural language queries.

Here are a few common interactions with Scrapbook and the corresponding phrases we use invoke them:

  • "What can I do?", "Help"
  • "Reset"
  • "Show me the category list", "List categories"
  • "Describe category Books"
  • "Show me the last 3 concerts I went to"
  • "How many hikes did we do in 2013?"
  • "Show assets", "Show images for item 2"
  • "Tell me about the last hike we did within 50 km of Seattle"
  • "Show me everything we did in August"
  • "Show me Correspondence of type postcard from Germany with Karl in the title in 2014"

Drill Down/Change Parameters:
  • "Show me people from Seattle" followed by "what about in Bergamo"
  • "How many hikes did we do in 2013?" followed by "next page" or "page 4"

  • "Read it to me", "Read me the second one"

All of these natural language queries are made possible by the Microsoft Language Understanding Intelligent Service (LUIS) API. We use a few built-in entities in LUIS such as number, ordinal, and ate-range, but most of the entities we defined and trained ourselves. One thing we found useful in working with natural language queries was to create list of synonyms for each category. For example, queries about Scrapbook items in our category Hikes, can be asked using the synonyms "walks", "treks", "trekking", "hiking", and "backpacking". These are all interpreted in our bot code via category list look-up to be query in category 'Hikes'.

Scenarios Revisited

In the Introduction and Motivation section, we listed scenarios that guided the development of Scrapbook. In this section, we revisit those scenarios to illustrate how they are addressed in the various ways we consume information via our Scrapbook solution. All communications to and from Scrapbook and supporting services are encrypted, using an authorized, authenticated account.

Let's take a look at the scenarios we discussed before and how we address them with Scrapbook.

Scenario 1: We are about to walk into a friend's house. What are the names of her three children?
  • Search over category = People with title containing friend's name.
  • Skype dialog:
    • "Show me people with Christina in the title"
    • "details"

Scenario 2: We remember a wonderful hot chocolate we had in Cuneo. What was the name of the café? What was the name of any place we ate at in Cuneo?
  • Search over category = Restaurant and location = Cuneo.
  • Skype dialog:
    • "Show me all restaurants in Cuneo"
    • "Details of 3"

Screenshot for Scenario 1 – Skype bot people search.Screenshot for Scenario 2 - restaurant search.
Left: Screenshot for Scenario 1 – Skype bot people search. Right: Screenshot for Scenario 2 - restaurant search.

Scenario 3: We are planning to to have a friend over for lunch and we  want to prepare a dish we haven’t served her before and obviously want it to be something she’ll enjoy, so we review meals we’ve had with her in the past.
  • Using the Scrapbook Web interface, search over category = Events with title containing friend's name. OR, just title containing friend's name and "lunch".

Scenario 3 - Using the web interface to find an event.Scenario 3 -View of recent dinner photos.
Left: Scenario 3 - Using the web interface to find an event. Right: Scenario 3 -View of recent dinner photos.

Scenario 4:  We are planning a dinner with friends and we are looking for an interesting red wine to serve. We remember that we drank some nice dolcettos from Piedmont last year. What were they?
  • Using the Scrapbook Web interface, search over category = Wine with title containing "dolcetto".

Scenario 4 - Using the web interface to locate a type of wine.Scenario 4 - View of wine labels for type dolcetto.
Left: Scenario 4 - Using the web interface to locate a type of wine. Right: Scenario 4 - View of wine labels for type dolcetto.

Scenario 5:  We are about to call to make an appointment for a haircut. What are the names of the people that work at the salon so whoever answers the phone we;re able to address them by name?
  • Search over category = People with title containing coffee shop's name.
  • Web Chat dialog:
    • "Show me all people with Giacomo in the title"
    • "Details 1"

Scenario 6: We are wondering about an upcoming wedding we have been invited to and we want to pull up the invitation to confirm the date and protocol for gifts.
  • Search over category = People with date = this year and title containing "invitation".
  • Web Chat dialog
    • "Show me all correspondence this year with invitation in the title"
    • "Details”

Scenario 5 - Web Chat with bot searching over the people category.Scenario 6 - Web Chat with bot searching over correspondence category.
Left: Scenario 5 - Web Chat with bot searching over the people category. Right: Scenario 6 - Web Chat with bot searching over correspondence category.

Scenario 7: We are writing an email to a friend and we want to recommend the last hike we did in Piedmont. What were the details of that hike?
  • Search over category = Hikes with date range set to include the past few months.
  • Cortana [Desktop] query dialog
    • "Ask My Journal to tell me about the last hike we did in Piedmont."
  • We use "My Journal" with Cortana because we found it to be easier to understand than "Scrapbook".

Scenario 8: We remember reading an interesting book in 2010 that had "science" in the title. What was it?
  • Search over category = Books with date set to 2010.
  • Cortana [Mobile/iPhone] query dialog:
    • "Ask My Journal about books we read in 2010 with science in the title."
    • "Give me details for the first one."

Scenario 7 - Windows 10 Desktop Cortana searching Scrapbook for hikes.Scenario 8 - Cortana on iPhone searching Scrapbook for books.
Left: Scenario 7 - Windows 10 Desktop Cortana searching Scrapbook for hikes. Right: Scenario 8 - Cortana on iPhone searching Scrapbook for books.

Scenario 9: We want to know if we've ever received a postcard from Germany. 
  • Search over category = Correspondence with title or location containing Germany.
  • In the Azure portal go to Query Explorer and use the following:

SELECT * FROM c WHERE c.type = "scrapbookItem"
AND c.category = "Correspondence"AND CONTAINS(LOWER(c.title), "postcard")
AND (CONTAINS(LOWER(c.location), "germany") OR CONTAINS(LOWER(c.title), "germany"))

Scenario 9 - Using the Azure Portal over Cosmos DB to execute queries and find Scrapbook items.
Scenario 9 - Using the Azure Portal over Cosmos DB to execute queries and find Scrapbook items.

Scenario 10: We want to know the average cocoa fat percentage of chocolate we tend to buy and what, if any, is the correlation between percentage and how we rate it.
  • In Azure Machine Learning studio, import our Cosmos DB Scrapbook data and search over category = Chocolate.
  • View histogram of percentage and check out a scatter plot of percentage against rating.
  • Query for selecting data:

SELECT c.title, c.bodyObj.percentage, c.bodyObj.rating
FROM travelmarx AS c
WHERE c.type = "scrapbookItem" AND c.category = "Chocolate"

Using Scrapbook data in Azure Machine Learning Studio to analyze chocolate habits!Using Scrapbook data in Azure Machine Learning Studio to analyze chocolate habits!Using Scrapbook data in Azure Machine Learning Studio to analyze chocolate habits!
Scenario 10 - Using Scrapbook data in Azure Machine Learning Studio to analyze chocolate habits!

Scenario 11: We want to understand which categories in Scrapbook have the most items as well as how many items there are per year over the last 15 years.
  • In Excel, import Cosmos DB Scrapbook data (using ODBC driver), create a pivot table, thena tree map of the data, and finally a histogram of items per year.

Scrapbook data analyzed in Excel and put into a treemap chart.Scrapbook category counts analyzed in Excel.
Left: Scrapbook data analyzed in Excel and visualized as a treemap chart. Largest category is Correspondence. Right: Scrapbook category counts per year analyzed in Excel.

We believe that the four primary objectives we wanted to address with Scrapbook  - manage archival data, provide context around data, making our data searchable, and owning our data - are well illustrated in these  scenario examples. 

We acknowledge that it’s been a lot of work to reach this point – many hours of data entry, curation, planning and programming to assemble the machinery around Scrapbook. But we’ve achieved the objectives we set out to satisfy, and now have a personal information platform that works very well and is positioned to evolve and adapt as our needs and available technologies change over time.

What We've Learned

We've learned a few things in 15 or so years of curating Scrapbook that we'll share here. Even the guiding principles around Scrapbook took years to come in focus.

Scrapbook guiding principles:

We own our data.

We use different services and platforms to "make" Scrapbook, but we control to what ends our data is used

Data should always have context.

A Scrapbook item without context is greatly diminished. We always include something descriptive about a Scrapbook entry. Did we like it? Who was there? Impressions? Names of people or things? Anything can provide context as long it conveys why the item even exists in Scrapbook. We've found that visual context includes visual context is of particular importance, so that there is always at least one image associated with each Scrapbook item.

Scrapbook assets should be accessible via a file system metaphor.

What you say!? This may seem a bit old school, but we feel strongly that even with storage technology advances (be in on premise, cloud or hybrid), maintaining an intuitive organization of Scrapbook assets via a hierarchical file system metaphor is important. ‘Human readability’ is important because technology shifts are typically disruptive, and while we hope that we’ll always have the ability to migrate gracefully from one Scrapbook implementation to the next, there’s always the possibility that a service or platform we rely on fails in such a way that a ‘manual’ intervention is required. Scrapbook context is currently maintained in a NoSQL database, separate from the assets. But those records, being JSON, are reasonably human readable.  Furthermore, each asset reference in a record follows our ‘file system’ organization, and asset names remain descriptive rather than just a GUID. So it’s not unreasonable that even in the event of a complete system failure, our ability to access our information and to reconstruct Scrapbook, if from only the raw data in backups, remains preserved. 

Portability: avoid technology lock-in.

Leveraging the latest and greatest technology stacks is fine, but we always have an idea of what it would take to change.  Taking a longer term view on the order of decades, we expect that Scrapbook will evolve and shift between technology stacks and providers.

Digital scrapbooking is a deliberate choice.

It's one thing to collect stuff, assets in our vernacular, casually or by circumstance, and it's another thing to organize them such that they're discoverable. It takes deliberate effort, sometimes a lot of effort to consistently capture and categorize the key moments, objects, and activities occurring day to day. We've had to learn to be okay with letting the "physical" go. Old letters, postcards, and mementos can be appreciated as much in their digital forms. This process is already much easier now than when we began, and we anticipate becoming much easier still in the not very distant future.

Following are some additional observations:

  • Garbage in, garbage out. The quality of the data going into Scrapbook is fundamental to the quality and utility of its output.
  • Data curation is an ongoing process. We consider Scrapbook entities to be living and changing. If we have new information, context, or assets for an existing Scrapbook item, we'll update it. We aren't afraid to prune or combine data as needed. To periodically review and curate older data is a valuable exercise because it provides an opportunity to normalize certain fields or information that we perhaps treat differently now; or in fact, may inform the implementation of a new feature or capability. And to revisit existing information is valuable in itself.
  • Data tends to arrive in batches. We aren't as consistent as we would like to be so there are sometimes time-wise gaps in our data entry. That turns out to be okay, because over time, "clumps" even out. It's important though to be consistent in the long run.
  • The value of a Scrapbook entry is directly proportional to the contextual information associated with it. A date, location, category, and title fields are a good start for a Scrapbook entry. But it's more than that. It's the why, how, who, and other details depending on category that help flesh out an entry. It's frustrating to look up an item from 10 years ago to find only a terse sentence describing it. It's easiest to create entries when the events or activities are still fresh in your mind, and any details you'd want to include are readily available.
  • We maintain some statistics on our Scrapbook collection. For example, how many items do we tend to enter per year and in which categories and ow many Scrapbook items are missing data in particular fields. We use this information to refine how the collection is organized and to improve the scrapbook process.
  • It helps to be consistent when creating entries. Doing so greatly improves the future usefulness of the data, and later operations on the data (e.g., programmatically or manually) will be much easier. 
  • It has been important for us to work with our data for a while before we’ve understood how we wanted to structure  (or restructure) our schema. Yes, many pitfalls are avoided with good planning and careful design up front, but not all scenarios are easily foreseeable. We continue to identify opportunities to improve our data structures, and correct as we go. Fortunately, our current architecture permits us to do this relatively easily.  We always expect that at some point in the future, we’ll migrate to yet another platform which will suggest entirely new normalizations of our data.
  • We take a scenario-based approach when designing Scrapbook functionality. For example, our Scrapbook has information about books we've read, so we consider the type of book-related questions that a user might ask like 1) the title of the last book read, 2) the subject of the last book read, 3) the number of books read during a given time period, 4) favorite books read, or 5) books about a certain subject. With these questions in mind, we can design for the information we need to collect to support a book Scrapbook entry.

Future Directions

The move to a NoSQL document database was a key change to Scrapbook which has made our data consumable in a number of new and interesting ways, in particular with AI-based tools, which are becoming increasingly more accessible and easier to use. But, we are just at the initial stages. The following four themes describe areas where we are focusing our development time.

Theme 1: Improved input mechanisms.

We are working to streamline the capture of new Scrapbook entries. The goal is to be able to create an entry on the fly, as the event is occurring, or has just happened, in the same way you might share a photo or video via a messaging app. This is important because when we capture data in the moment, we tend to capture more of the essence of it and we avoid the chore of data entry later. (That's not to say later that a Scrapbook entry could not be edited later.) Therefore, we are looking at tools and processes to make it as quick and easy as possible to create Scrapbook entries. We are targeting natural language queries via the Bot Framework as one way to do this. Currently, we use natural language queries to request information from Scrapbook using interfaces such as Cortana or Skype. But there's no reason we can't extend our natural language bot interfaces along with AI resources on the back-end to create and populate initial metadata for new Scrapbook entries. For example,  the Computer Vision API, we might extract location, date, faces, and even subjects and themes from a photo shared to the Scrapbook bot via Skype.The notes for the Scrapbook item could be dictated and converted to text using a service such as the Bing Speech API.

Theme 2:  Leverage automated metadata sources.

Photos and documents in file systems (locally or in the cloud) are increasing augmented with rich auto-generated metadata generated using optical character recognition (OCR) and image processing algorithms. Below are two examples from OneDrive. One example shows a photo with text that was extracted, with the OCR going as far as to recognize a website and to surface that separately. Another example shows the location data and tag data saved with a photo. Our Scrapbook system doesn't yet attempt to interpret any of this auto-generated metadata; it still must be be entered in manually. It's an open question as to how much of this we should include in Scrapbook. We are investigating Cognitive Vision Services including the Computer Vision API and the Face API to directly to tag and extract descriptive information from photos.

Theme 3: Do more with geocodes.
Example of Auto-Generated  Metadata - Location and  Image Tagging
Example of Auto-Generated
Metadata - Location and
Image Tagging

We only recently started saving friendly location names with each Scrapbook Item.  Friendly names (e.g., Seattle, WA) are geocoded to extract longitude and latitude. The location data has proven extremely useful for simple questions such as "show me all postcards from France" or "show me all museums we went to in Rome". We've found that location data is so important that we're updating older Scrapbook items with location.

Besides location sensitive queries, we are looking at ways to search for and view data via a map interface.

Example of Auto-Generated Metadata - Extracted Text
Example of Auto-Generated Metadata - Extracted Text

Theme 4: Apply machine learning to Scrapbook.

We've barely begun to scratch the surface in applying AI-based tools and services to interact with Scrapbook. It's also exciting to think about the possibilities of using Scrapbook data in machine learning scenarios. We have a lot more work to do here, as even the scenarios in which we'd use machine learning are not yet obvious.

Machine learning at its simplest is about transforming data into intelligent action. As described in the Introduction to Machine Learning page: "[m]achine learning is a data science technique that allows computers to use existing data to forecast future behaviors, outcomes, and trends." The steps of turning data into action can be represented by this sequence of questions:

Descriptive: What happened?
·        With Scrapbook, we currently have a good handle on this question. We have categories and we have context to describe "what".

Diagnostic: Why did it happen?
·        With Scrapbook (personal data), it's not a relevant question. Scrapbook items are either something we wanted to happen or made happen or which happened to us. Perhaps, we can imagine a field in the JSON schema for Scrapbook items to indicate 1 for a planned or expected and 0 for something unplanned or unexpected. The 'why' then becomes more interesting.

Predictive: What will happen?
·        This is the holy grail of machine learning. In the context of Scrapbook, it's asking if we can make predictions based on our existing data. But predictions of what? We currently don't have Scrapbook data fields (called features in machine learning) that are easily "predictable", like say the classic example of predicting an automobile's price given features like the size of engine, number of doors, and average mpg. Could we use Scrapbook to predict the next book we'll read? Or, friend we'll meet for lunch?

Prescriptive: What should I do?
·        It's meaning for Scrapbook is even more abstract than the predictive analytic question. What could we reasonably expect to be able to control analyzing existing Scrapbook data?

Over 10 years ago, we started thinking about how we could put Scrapbook to good use for something relevant to us, rather than selling us ads or news. Today, we have access to the tools to at least start this analysis, even as we still have a lot to figure out in terms of what questions we need to be asking and what data we need to be collecting to answer these questions. In Scrapbook's present form, we have started to investigate some basic machine learning areas such as outlier detection (items in Scrapbook that are not categorized correctly) or sentiment analysis (analyzing our notes fields to determine sentiment, positive or negative). 

To answer the predictive and prescriptive type of machine learning questions, we are going to have to re-think about how we are collecting data and will likely need to collect new types of data. Imagine for example, that each Scrapbook item (book, film, person, event, etc.) has a utility ranking in terms of how important that item is to us. Then, it's not hard to imagine utility values could be predicted, and from that, models made to help us make decisions about which items give us the most utility, therefore giving us prescriptive guidance. It would be doing with data and machine learning what we already do intuitively when choosing how to spend our time.