Pluto's Little Brother
A scientific priority controversy and why anonymity isn't good enough
What I learned from astronomer Mike Brown’s book How I Killed Pluto and Why it Had it Coming is that searching for new planets is on the whole very tedious. Brown and his colleagues at Caltech spent years photographing every inch of the night sky and then sitting in front of their computer screens squinting at pairs of images, trying to determine whether what they were looking as was in fact the first new planet to be discovered in 50 years or a scratch on the photographic plate (spoiler: it was mostly scratches).
All that painstaking scouring of the outer reaches of the solar system did ultimately pay off. The Caltech team found several large objects in the Kuiper belt, culminating in the discovery of the controversy-igniting, Pluto-sized dwarf planet Eris. Eris forced a conversation in the astronomical community about what could be considered a planet, which ended in our beloved Pluto being relegated to a lowly dwarf planet (Brown points out asteroids Ceres and Pallas used to be considered planets and everyone has pretty well managed to get over that).
Eris was the big news story that rewrote elementary school textbooks, but in my view the weirdest plot twist actually came earlier in the book. Shortly before Brown’s team discovered Eris, they discovered another large rocky object orbiting far past Neptune. The team decided to keep the discovery under wraps until they had a scientific paper ready for publication, spending months carefully observing and documenting the properties of the object. The object, nicknamed Santa, had two small moons and was about a third the size of Pluto. In the gif below Santa, formally named Haumea, is the pink circle and Neptune is the blue circle. (credit to the Wikimedia Commons).
As they neared publication, the Caltech team posted an abstract for their report on a website for an upcoming conference using the identifier K40506A to refer to the object. About a week later, in what at first seemed like an extraordinary coincidence, Smith and his team got scooped. Dr. José Luis Ortiz Moreno and his team at a small Spanish university announced the discovery of the very same Kuiper belt object Brown had been studying for six months, referring to it as the “tenth planet” in a press release.
Brown initially set his suspicions aside and congratulated Ortiz. But the story didn’t sit well and Brown soon discovered that googling K40506A would bring up database records from a telescope in Chile that his team had used, showing the exact coordinates the telescope was pointed to and by extension the location of Santa. When he asked that database administrator to look at IP addresses of users who accessed the database, there was one address originating in Ortiz’s institute, days before the Spanish team’s announcement.
Ortiz admitted accessing the logs, but maintained that he had already discovered the object independently. After a long period of deliberation the International Astronomical Union made an awkward compromise in which Brown’s team was allowed to name the object, but the space left for the discoverer of the object was left blank.
By post-Equifax standards, the data breach described in Brown’s book (accessing the telescope data) seems almost quaint. Still, I would argue that there are a couple of security angles worth exploring in a broader context.
The Spanish researchers were able to determine that K40506A was an object of note because it was mentioned in two places: in the telescope database and in the conference talk title. One on its own would be meaningless but taken together they gave a clear indication that pointing a telescope to the coordinates in the database would yield something interesting.
This strategy of combining two data sources to identify records intended to be anonymous, called ‘re-identification’, is a major concern in the world of data privacy. Typically there isn’t a unique identifier like K40506A on which to join the data, but there might be a unique combination of attributes, for instance birth date, gender, and zip code.
In a highly publicized incident in the 90’s, researcher Latanya Sweeney in Massachusetts was able to identify Governor Bill Weld in an anonymized dataset of hospital visit records using publicly available voter registration data for the city of Cambridge. In 2006 Netflix ran into a similar problem when they published a large dump of anonymized user movie ratings as sample data for the Netflix Prize challenge. Two researchers at the University of Texas in Austin were able to identify about half of the users in the database using IMDB, because people had rated the same combinations of movies the same way on both Netflix and IMDB.
Just last year a study published in Nature found that 99% of Americans could be -re-identified from 15 demographic attributes. The authors conclude that as a result it would be very difficult for any anonymized dataset to be released without conflicting with the high standards set for privacy by Europe’s General Data Protection Regulation.
Anonymous datasets provide rich material for research in public health and AI, among other fields. But at the moment it feels like a definitive standard for de-identifying personal data is a moving target and we may have some difficult compromises ahead of us.
In a separate vein, there’s one more point that I would be remiss as a web developer not to bring up. Hiding a web page from being indexed by the Google bots is as simple as adding a meta tag to your HTML. It’s a good idea if there’s even the slightest possibility that your data could be sensitive. And even better -- just put it behind a password.