4124
A few days ago, CERN launched an Open Data Portal to publicly share data from the Large Hadron Collider. We are some of the scientists behind this project, working to make science more open globally. Ask Us (Almost) Anything about open data, open ...
Hi reddit!
We unveiled the CERN Open Data Portal to the world recently, releasing samples for education from all the main LHC experiments and around 27 TB of high-level and analysable LHC data from the CMS Experiment.
Following CERN’s last AMA, we’re thrilled to be here today to talk to you not only about open science but also our Open Data Portal, #cernopendata and the tools you can build on top of our data. We are:
- From CERN Information Technology:
- Tim Smith, Head of Collaboration and Information Services (tjs)
- Jamie Shiers, Project leader, Data and Knowledge Preservation in High-Energy Physics (js)
- Tibor Simko, Technology Lead for the Open Data Portal (ts)
- From CERN Scientific Information Service:
- Salvatore Mele, Head of Open Access (sm)
- Sünje Dallmeier-Tiessen, Open Science Research Fellow (sdt)
- From the CMS Experiment:
- Kati Lassila-Perini, Physicist and Co-ordinator of the CMS Data Preservation and Open Data project (klp)
- Tom McCauley, Physicist and Developer of CMS education/outreach tools (tm)
We’ll sign our posts with our initials (see above) so you know who said what. Just to be clear, we are speaking with you in our personal capacities and CERN does not necessarily support the views expressed during the AMA. Joining us are a few of our friends from CERN:
- Kate Kahle (/u/kate_kahle), CERN social-media manager
- Achintya Rao (/u/RaoOfPhysics), CMS science communicator and Science Communication doctoral student
- Patricia Herterich (/u/PHerterich), Data librarian and Open Science doctoral student
We’ll answer your questions from 16:00 CET until 17:30 CET (UTC+01).
About the CERN Open Data Portal
The CERN Open Data portal is the access point to a growing range of data produced through the research performed at CERN. It disseminates the preserved output from various research activities, including accompanying software and documentation that is needed to understand and analyse the data being shared.
The portal adheres to established global standards in data preservation and Open Science: the products are shared under open licenses; they are issued with a digital object identifier (DOI) to make them citable objects in the scientific discourse.
About CERN
CERN is the European Laboratory for Particle Physics, located in Geneva, Switzerland. Its flagship accelerator is the Large Hadron Collider (LHC), which has four main particle detectors: ALICE, ATLAS, CMS and LHCb. Two years ago, CMS and ATLAS announced the discovery of a new particle that we now believe is a Higgs boson.
In addition to the LHC experiments, we have dedicated facilities for studying antimatter, nuclear physics and climate science. Oh, and we also have a particle detector operating on the International Space Station!
For updates, news and more, head over to our unofficial home on reddit: /r/CERN!
Other CERN projects you can join
EDIT: 17:50 CET — Ok, everyone! We're logging out now. This was fun, and we hope you enjoy all of our data over on the CERN Open Data Portal.
askCERN67 karma
Please see this article by John Ellis: http://home.web.cern.ch/about/updates/2014/11/how-standard-higgs-boson-discovered-2012 (js)
TheBigBadDog150 karma
As a sysadmin for an ATLAS Tier 2 site, the launch of the data portal makes me even prouder to be a part of CERN Science.
The hardest part for me about Open Science is making sure the software, data and the metadata is accessible for ever. Does CERN/the experiments have a timeline in mind for how long they will support the software, make the data available on the portal and make sure that any bugs etc are fixed? Will it be until at least 2030 when the current LHC is switched off?
askCERN112 karma
Yes, this is the real challenge - knowledge capture and preservation (not just "the bits").
This is definitely something that we are working on - there are a number of approaches being using, including VM technology.
The timeline we are working on is several decades - beyond the end of the LHC data taking period, preferably up to the time of a possible "precision machine" (js)
MonocleLewinsky77 karma
My nine year old brother has his heart set on becoming a nuclear physicist, loves to talk about the Higgs boson. Of course, dreams and goals change as we grow older.
What did you all want to be when you were nine?
askCERN92 karma
What did I want to be when I was nine? A particle physicist at CERN !
(sm)
askCERN58 karma
I always wanted to be a scientist but had no idea on a specific field. I did have a fondness for astronomy though. Particle astrophysics and particle physics turned out to be close enough! (tm)
Zharofun69 karma
What is the atmosphere around arguably the biggest research facility on Earth? Workaholic or jolly?
SCHLONG_SWORD57 karma
If the government funds scientific research why isn't that science published openly and freely? Why is so much scientific articles hidden behind paywalls that make it impossible to research something without an institution supporting you? How can we change the system for the better?
askCERN65 karma
Here at CERN we believe in Open Access, and have published openly and freely all articles from the LHC experiments in peer-reviewed journals. The (c) stays with the authors, and the articles are available under a Creative Common license for everyone to read, re-post and re-use.
We agree with you that we can change the system for the better, and together with partners in 40 countries we have been organizing for most of the results in particle physics to be now published Open Access, without paywalls, through the SCOAP3 initiative
(sm)
acaban32 karma
Hello, first of all "thank you for your service"! (yeah that's the context that phrase should be used).
I presume you have multiple architectures you operate on for dealing with that vast amount of data, do you have any standard library to deal with float rounding/cancellation/etc.. errors in various calculations, to maybe assure tests on data are reproducible, or you treat every case/algorithm as a special case?
askCERN27 karma
For quite some time we have been using primarily x86 architecture, with IEEE floating point. This wasn't the case in the past, when many highly heterogeneous architectures (different word length, different byte ordering, different FP operations and rounding strategies). We know that the "golden days" of x86 are over and we will again face heterogeneous architectures. A validation suite is key - as you says tests, more tests and even more tests. Reproducibility is a big challenge and not just in our domain (js)
elektrisitet30 karma
What are some of the future endeavours CERN is working on to make science more accesible and popular on a worldwide scale, especially to isolated populations (besides the open data)? And thanks for taking some time off the groundbreaking discoveries to answer a few questions, you guys rock!
askCERN32 karma
We have been working since long in Open Access.
All the scientific publications from the LHC are available free to read to anyone, and are all published under a CreativeCommon license.
Recently we have been teaming up with partners in over 40 countries to support Open Access publication of most scientific results in High-Energy Physics through the SCOAP3 initiative.
(sm)
gtenagli14 karma
Disclosure: I work at CERN.
All the Open Access initiatives are very interesting, and I think one of the best ways to "contribute back" to the society. I was wondering what are the main challenges you face in promoting OA for HEP?
Cheers from IT/DB.
askCERN16 karma
The main challenge is building partnerships and consensus: Open Access is something you build across research institutions, libraries, publishers. We have a few stories recounted at http://scoap3.org/webinar2014
(sm)
bernaferrari23 karma
How realistic do you think Interstellar was and how favourable (or not) are your scientists to sci-fy (or bad science) movies?
askCERN50 karma
I think it was great to see a film that took the science seriously and tried to get things correct (more-or-less). It therefore held itself up for criticism, more than a "normal" sci-fi film would get. Nice to see problem of interstellar travel and the time and distances involved not "warped" or "hyperspaced" away. (tm)
askCERN23 karma
Ok, everyone, we're logging out now! This was fun, and we hope you enjoy all of our data over on the CERN Open Data Portal.
Argo_Vector19 karma
Is there anything that a normal person with little science background could do with the data? I want to explore all this open data but I am a college art school student.
askCERN18 karma
There are two sections in our OpenData.cern.ch portal. You can check the "Education" section, where there several Learning Resources to get you started
(sm)
bwohlgemuth18 karma
Fantastic news and I hope more scientists take this approach!
Question: how are you planning to handle the 49,000,000 armchair particle physicists (who last week were 49,000,000 armchair lawyers) and do you see these questions as an opportunity to engage people into the physics world?
askCERN25 karma
That's the entire idea: release Open Data to engage "citizen scientists" alongside scientists in this field and neighboring disciplines.
The data are released under the Creative Commons CC0 waiver. This means that neither CMS nor CERN endorse any works, scientific or otherwise, produced using these data.
Anyone re-using the data will be free to write scientific articles, quoting the source of the data, and submit them for publication in scientific journals.
We hope that those who will enjoy working with the data, without writing publications, will take this opportunity to get closer to physics, and to science
(sm)
ComboForTheStorm14 karma
What kind of hobbies do you usually have in common with the people that you work with?
askCERN41 karma
Climbing mountains of rock, to take a break from our mountains of data [tjs]
Unremoved12 karma
Any question I ask would be absolutely stupid based on the crazy amounts of science you guys are performing.
So...Thanks for all your hard work, and being on the front line of open data access and transparency. Even us not-as-smart guys know that is a huge undertaking, and hopefully one that we'll see as a continued trend.
Edit: Okay, so this sub won't let me submit without asking a question. Uh. What did y'all have for breakfast this morning?
flipstables9 karma
Thanks for your efforts and contributions to open data and science!
My question: what big data technologies does CERN use?
askCERN14 karma
Big data is an overused term. Today, we have a number of in-house developed solutions to deal with the volume, rate and access patterns. At some partner sites, e.g. members of the worldwide LHC grid, a combination of home-grown and commercial solutions is used. (js)
shivan219 karma
Are there any tutorials how one can interpret and search through data? Are there any tools for it?
askCERN15 karma
We've included some basic examples for accessing and using the CMS public data. The CMS-tools collection will certainly grow with examples and tutorials. This is just a start! (klp)
MadTux8 karma
Can you recommend anything for a small school physics course learning about electromagnetism and Lorentz force?
askCERN10 karma
Have a look at the tracks of charged particles in the magnetic field inside the CMS experiment. Load an event in the event display, turn it to the x-y plane and observe the track curvature. (klp)
askCERN11 karma
In the terminology of the CMS experiment, we are sharing the data at the AOD (Analysis Object Data) level. This is a part of the RECO level data, and is the format used by the CMS physicists for data analysis, and it contains the necessary information for analysis (in less volume compared to RECO data). (klp)
LuInFrance8 karma
Congratulations on the Open Data Portal. What a gift to the world! How long did it take to develop?
askCERN10 karma
Thanks! The CERN Open Data portal developments started in June 2014, so it took us about five months to build it. (ts)
Clestonlee7 karma
Why do you think some people resist open access data? And how can we make it more readily accessible?
askCERN19 karma
Researchers (in every discipline) put a lot of time and dedication into preparing their research and thus the data taking. Data are a precious good and thus need careful handling. Many are afraid to share data openly fearing they would not get credit for the hard work they put into it. It is only recently that there are established principles for referencing/citing data (Force 11 guidelines). Such mechanisms will help establishing trust into open data sharing. (sdt)
BlackOut19627 karma
How do you guys manage the massive amount of data you get from the LHC?
askCERN9 karma
"Manage" is a big word. Roughly speaking, the 4 main LHC experiments have similar computing models, where the raw data (after a significant reduction through "triggers"), is stored permanently at CERN (the Tier0) with a copy spread over roughly 10 Tier1s. Reprocessing is largely done at the Tier1 sites with analysis and Monte Carlo at the ~100 Tier2s. But this is all high-level. Funding agencies are now requiring "Data Management" plans, which will also should include Data Preservation and Open Access plans / policies. (js)
GetToDaChoppa16 karma
Hello scienticians!
I am but a layman, and do not speak your language of awesome science. Therefore, I will ask but a simple question: what's the coolest thing about working at CERN?
askCERN11 karma
Among many other things - I am excited about the collaborative, international and open minded work environment here (sdt).
dukwon6 karma
It's not immediately obvious how much of the Run I CMS dataset is currently available (half of 2010 maybe means more to someone within the collaboration than outwith). I could probably look this up, but how much integrated luminosity does this correspond to?
Will the rest of Run I be eventually made available at the same 'level' of data? I assume you're going from tens of pb–1 to tens of fb–1, so that's a factor of ~103 more data. Is this considered a feasible goal?
I'm looking forward to seeing data from the other experiments.
askCERN11 karma
Internally, the CMS 2010 data taking was divided in "RunA" and "RunB". CMS decided releasing RunB, which is the second part of the run with the volume of 27 TB. CMS will gradually release also the rest of RunI (i.e. the data from 2011 and 2012), with the upper limit of the amount of data being less than half of the integrated luminosity available to the collaboration, internally. (klp)
dukwon3 karma
Thanks.
I've found a plot:
http://cms-service-lumi.web.cern.ch/cms-service-lumi/publicplots/int_lumi_cumulative_pp_2.png
From this, I work it out to be around 20 PB in total for Run I. Is that right?
askCERN5 karma
A single reprocessing at the level of data that we release (which is also the format that CMS members used in the analysis) for 2011 is roughly 200 TB and 800 TB for 2012. But the total data volume (including raw data and the several rounds of reprocessings) is much more.(klp)
danny56086 karma
Do you also ship the scripts that were used to analyze the data? For example, can I find the actual analysis scripts that led to the conclusion that the Higgs boson exists?
askCERN7 karma
We include some code for analyses that one can do with the data released, for example a one Z boson (2 lepton) and 2 Z boson (4 lepton) analysis. The latter is a channel that can be a signature for the Higgs. However, actually finding the Higgs required much more data and work, etc. (tm)
Gerterd5 karma
Hi! I'm looking to apply to CERN for a Summer Student Programme. Do you have any tips to maximize my chances? I'm a CompSci student finishing his 3rd year, and I'm from Poland.
askCERN5 karma
Please see previous answer: be passionate about your work, get involved in free software community, publicise your code on GitHub or Bitbucket, etc.
Edit: Fixed typo. Sorry, not all of us are native English speakers.
Maximus56845 karma
Do you expect that these data will be used outside of education or double-checking the conclusions that CERN has reached? If so, what for?
Also, it appears the CMS data that you released are only from a single run on a single day. Do you intend to release more or ever allow open access to the "firehose?"
askCERN10 karma
For the first question, there's certainly a possibility for using them outside of education. Some earlier released public data have already been used for studies of statistical methods. Double-checking may also be possible, but I would see more interest for studies which we have not yet done. LHC data are incredibly rich and while we have studied the domain which is of most interest to high energy physics, but there may still be other things buried. I'm really curious to see what!
For the your second point, the released data contains the full "RunB", which is the term we are using for the second part of data taking in 2010, so it is not single run (in the sense of the accelerator run) and single day. (klp)
Aderyna5 karma
How would you guys like to see the work you do incorporated in modern science education?
Also, if I had the chance to tour CERN, how much would I be able to see?
askCERN6 karma
There are many educational resources which you can build on the Open Data, see for instance http://opendata.cern.ch/resources
We hope that those can be used in classrooms around the world: we know that when students can work with real scientific data they get fascinated by science
(sm)
Eunoshin5 karma
With the pure amount of data that you will be presenting to the public, do you see opportunities to influence industry direction or mindset for the long-term maintenance of big data?
askCERN11 karma
Yes, we do.
Long-term maintenance of large data volumes is certainly not trivial: check out the report from the 4C project. We (in HEP) believe that we have knowledge and skills highly relevant for affordable, sustainable massive scale archives and we are trying to influence both industry as well as possible consumers (js)
NicolasGuacamole4 karma
What's your best advice for a computer scientist hoping to do a placement year at the facility?
askCERN8 karma
Be passionate about your work, get involved in free software community, and apply for a CERN summer studentship or technical studentship programme! (ts)
______DEADPOOL______4 karma
When that damn Boson was discovered, there was a big talk about openness of the data, and I ended up in a debate with a scientist working in the field defending that data should remain closed just because people would be asking the scientists so many things on how to interpret the data and that means the data should stay locked up so the scientists can keep working on their stuff.
WHO'S LAUGHING NOW?????
askCERN7 karma
The experimental scientists who discovered the Higgs Boson with the ATLAS experiment made available some of their data for their colleagues in the theoretical physics community to verify their hypotheses.
Check http://inspirehep.net/record/1241574/data
(sm)
stax_n_stax4 karma
I'm always happy to see scientific data made openly available, but was the project approached by any commercial organisations for data collected from the project, or are we in such crazy realms of physics that it has limited market value/commercial application?
askCERN9 karma
Our Open Data has value for education, citizen science, and scientists in this field and neighboring disciplines.
So far we have not heard of a commercial re-use... but we released them just last week!
Maybe for a start someone wants to print a t-shirt out of some of the beautiful visualizations?
(sm)
aaaaaaaarrrrrgh3 karma
Is data preservation really an issue in the classic data preservation sense, i.e. beyond "make sure you have five copies of it on different continents and regularily check/recreate them"? Are you trying to preserve the data in a way it will be preserved for millenia and across civilizations?
askCERN6 karma
Data preservation - or probably bit preservation - is quite tricky when you get to the 100PB level. We do have multiple copies of the data and these are used to recover from time to time. We expect to preserve the bits for at least a few decades and have a cost model which suggests that this is possible and affordable up to about the 10EB level (10,000 PB), which we might reach around 2040.
Preserving the data AND the knowledge / environment so that they can still be used tomorrow, in ten years, in thirty years is a bigger challenge and this we are trying to address too.
There is a big difference between what some people call "observations" - e.g. of the universe, of the earth - which cannot be repeated and those that come from things like the LHC which are more "data factories". We could in principle build a new LHC in the future (it might even be done but for scientific reasons, like higher precision leading to more discovery potential) but you can't go back and repeat and observation you have lost / missed (js)
Zharofun3 karma
Do you think that Higgs boson will be the greatest ever finding by CERN? Or, are you guys planning to work or working on something even bigger?
askCERN4 karma
Please see this article too: http://home.web.cern.ch/about/updates/2014/11/how-standard-higgs-boson-discovered-2012 (js)
Tabura3 karma
Hello, just ye olde internet Science enthusiast here! I'd just like to say I greatly appreciate your work, and the fact that you take time off to inform the public about it. I have only two short questions for you.
I have read some of your previous AmA here and I thought of asking, how much more have you discovered since then?
Sort of an off-topic question, on your site for studentships in summer for non-member countries (http://jobs.web.cern.ch/join-us/studentships-summer-non-member-state-nationals) it says that one of the requirements is being a university-level undergraduate (Bachelor or Masters) at least in your third year. I assume this presumes you are a Physics major though, even though it's not stated? I'm interested in applying but am not a Physics major.
Thank you in advance!
askCERN3 karma
Physics major is not a requirement; we do a lot of computing and software development! In addition to the CERN summer student programme, you may want to check openlab summer student programme where we welcome students from all over the world. (ts)
askCERN4 karma
the collaborative, international and open minded work environment! lots of interesting colleagues and friends to work with! (sdt)
sierrazas2 karma
Is going to happen a technological leap in the near future? (10 years aprox)
moreorlessrelevant2 karma
Hello, first of all, great initiative!
I just took a cursory glance through the descriptions but it seems there is no MC samples. Why? Size?
I didn't see if the tools include a generator/showerer/detector simulator, does it?
Thanks!
askCERN3 karma
For this first release of the 2010 data from CMS, we do not have a fully corresponding sample of simulated MC events (i.e reconstructed with a compatible software release). But, our intention is to include the MC samples for open data releases in the future. (klp)
seismicor2 karma
What science is behind virtual particles that are popping in and out of existence and how often do you see them in the results?
askCERN3 karma
LHC does produce collisions between quarks and antiquarks even if the colliding particles, protons, consist of quarks (and gluons). The antiquarks are popping up as virtual quark-antiquark pairs within the proton and they can exist as long as they respect Heisenberg's uncertainty principle. But if the proton collides just when they exist, we do see them :-) (or rather what comes out in such collision..) (klp)
omkaram2 karma
There was a report in the economist a few months back about how many science research papers don't get published because they were unsuccessful. Honestly, I think that 'unsuccessful' experiments should get published online or offline, as they are legitimate results that need to be taken note of or reviewed. Newsworthiness of the findings should not be the top priority, but it appears that in some instances it has been the case.
The report also spoke of the difficulty in accurately replicating an experiment by a 3rd party, and the apparent mess that the whole system of peer review is in. This of course means a progressively shakier foundation for future research.
Do you guys agree that some fields of science have reached this stage? If so, can the situation be remedied? What steps might need to be taken?
Thanks!
askCERN4 karma
Important topic indeed! I believe Open Science (Open Software, Open Data... Open Access) provides a great opportunity in that regard. It is an important step towards reproducible and reusable research results. It requires good (open) documentation and discussions about standards, for example. On the publishing side, there are many new initiatives out there offering new peer review tools which happen in the open. I believe, we need to continue such paths. (sdt)
FancyFoxFive2 karma
As someone with interest in what you guys do, but not enough time to really read as much as I could on the subject, what are some of the highlights of what you do? And where could people like myself go to learn interesting things without being overwhelmed?
Nojopar2 karma
Hi, first a big Kudos to you guys for doing this! We covered this in our weekly geography podcast over at VerySpatial.com (latest episode to be released this evening EST)! I was wondering... what software did you use for your Open portal?
askCERN3 karma
The portal is built on top of Invenio digital library platform. You can check our "overlay" sources on GitHub. The development of the CERN Open Data portal has been... fully open! (ts)
observerkid2 karma
What is the ideal open data configuration and what role do the authors have in it? What challenges face the attainment of CERN's open data framework?
askCERN3 karma
at is the ideal open data configuration and what role do the authors have in it? What challenges face the attainment of CERN's open data framework?
The work of the Open Data Portal is just the starting point. The work on the portal started (as you can see from this AMA) as a collaborative effort between researchers/authors, our IT department and the scientific information service. This configuration worked very well as all these skills are needed for documenting and exposing the data. However, our challenges is now to continue this work, expand to more datasets, more tools and visualization. (sdt)
KingSix_o_Things2 karma
Hi, I'm a firm believer that the Open Data agenda generally is one of the ways to get more and more people motivated to challenge and innovate. What do you think is the next step in terms of getting even more people interested in the information that is, or could be, available?
askCERN3 karma
It is important to expand the discussion to software and documentation so that Open Data is actually (re)usable. Also, more visualization will help getting more people interested. (sdt)
sindbis1 karma
I know CERN has a lot of computer power. Any plans to open up the compute grid for non-physics related data analysis (say for example in computational biology)? Also, have you guys thought about monetizing it and charging individuals/corporations/universities for compute time?
askCERN4 karma
The grid that we use includes resources and services at around 200 sites. They are pledged for the LHC explicitly, but have in the past been main available for specific non-physics purposes for short periods. (js)
unbreathless1 karma
What do you expect people to do with this data? As in, what do you think they could discover?
Will you be considering theories put forth by the general public based on your data?
askCERN4 karma
Something unexpected I hope! The data has already been compared with a vast range of theories, but many more can be conceived. Algorithms can be compared and tuned. Analysis techniques can be learned, explored and refined. Analyses based on our data should be entered into the standard scientific publication process to undergo the rigors of the scientific review process [tjs]
AlexanderGooZH1 karma
Are you expecting to find specific particle when you do experiments, or you have a whole list of particles that are hypothesized?
askCERN2 karma
The so-called standard model predicts which particles should exist. Some have not yet been seen - some were discovered recently e.g. by the LHCb collaboration. Search the CERN press releases for more info (js)
orr250mph1 karma
how do you feel about elements being placed into the periodic table which only exist in highly controlled environments for extremely short periods of time, like nanoseconds?
askCERN2 karma
For the periodic table, what matters are the properties of an element, after all, and not how long they exist for. So for the understanding of physical properties it is important that they are studied and listed indeed! (sm)
arx421 karma
Hi thanks for the AMA and thanks for the open data portal.You guys are awesome. What i want to ask is what advice would you give to a CS undergraduate student planning a career on data science and how did you become a data scientist at CERN? Thanks again.
askCERN4 karma
Happy to hear you like it! We are a bunch of physicists, computer scientists and librarians who work collaboratively on this project. We all took different paths before working on this portal... but we all had our hands on data at some point. Either directly with the data taking, or when putting the metadata together. So I guess I would recommend to get your hands on data asap (and maybe apply at CERN - there are lots of opportunities!). (sdt)
seismicor170 karma
Hi. After finding a Higgs particle (or a particle similar to it), what is the next biggest goal of LHC?
View HistoryShare Link