Hi, reddit! We are currently developer relations engineers at Deephaven. Amanda has a master's degree in astrophysics and a doctorate in computer science, and JJ has a master's degree in applied mathematics.

We work at Deephaven teaching other data scientists to work with big data, streaming data, and AI using Python and Deephaven. Our free open source projects for working with real-time, time-series and column-oriented data using our open core data query engine are available from GitHub. Check out some of our recent example projects, including using Twitter data in real time to do sentiment analysis and solve the daily wordle, using Prometheus data in a dashboard, and converting the 22GB r/place dataset to a 1.5GB Parquet file for easier analysis.

AMA from how to get started with a career in data science, to working on large data sets in Python, Apache Parquet, Apache Kafka, or using Deephaven in your wo

Proof: Here's my proof!

Comments: 316 • Responses: 53  • Date: 

DeephavenDataLabs103 karma

JJ was asked his favorite Python packages. His top 3: NumPy, SciPy, and Scikit-Learn. Honorable mention to pandas.

baineschile3 karma

Ooh no tensorflow, I am sads

DeephavenDataLabs9 karma

Have you looked at

Polars

? It's a new dataframe library that has an api that makes a lot more sense than pandas, and on top of that is much, much faster.

Re: TensorFlow. JJ is a big fan of that as well, and uses it in several of his example projects.

Firesequence78 karma

do you think retraining and getting into the industry at 55 is too late ?

DeephavenDataLabs66 karma

Absolutely not! This industry is growing so much and it's never too late to learn something new.

DeephavenDataLabs49 karma

Not at all. There's a huge demand for people in the software world. If someone is extremely motivated and willing to learn, there are ways to enter the industry!

cv-boardgamer50 karma

There is a free program through the city college here in my city which offers free Python classes. I'm almost 46. Am I too old to get started?

I have a BA in graphic design. I freelance in web design, editing video, and writing e-newsletters mostly. I would like to make more money.

DeephavenDataLabs52 karma

It's not too old to get started at 55 or 46 or any age. You have a desirable skillset and web development is huge. No matter where you get your education, learn on your own - read books, watch YouTube, to make sure you're learning the right things. Check out Django.

With such good front-end skills, you might want to learn React or JavaScript? See what else they offer.

jaamulberry44 karma

Maybe not a simple question but why python?

DeephavenDataLabs30 karma

can we compare it with say ksqldb ? can you touch upon similar or direct competition for deephaven ?

Its syntax is incredibly cool, and it's one of the most popular right now. Our backend is Java because it's fast and memory safe, but it doesn't have a great ecosystem for data science. Python may be slow, but it's well suited for machine learning and data science. We say more on our live feed.

LFW66239 karma

I am currently working on a MSBA (business analytics), and my courses are heavily focused around Python and SQL. I am aiming to land a job as a data scientist in about a year after I graduate. If you were interviewing a candidate for a lower level data scientist position at your firm, what are the top 5 qualities/skills you would be looking for?

DeephavenDataLabs60 karma

We hire with two steps in mind:

1) a coding test - can they write functional code? an automated test will determine if it works, and we need to be able to read that code and see if it's clean

2) what do they KNOW? you're going to be hired on your ability to figure things out. Problem-solving skills are essential. How do you approach a situation you aren't prepared for?

The new employees who excel the most are usually unafraid to ask questions.

nyteghost14 karma

What type of coding test is it? Or rather, what type of program needs to be written?

DeephavenDataLabs10 karma

We have different coding tests for different positions. In general, the problems define what the program needs to do. Then we see if the implementer can understand and implement the specification. All of the problems can be solved with a basic understanding of a programming language and a basic understanding of algorithms. We just want to see that someone understands the basics. If specific knowledge needs to be assessed, that will happen during an interview. Developer tests skew towards computer science, and quant / data science tests include some mathematical programming.

Bananape4l23 karma

Sorry this is late, but many people in the r/datascience sub have been telling everyone that data science is not a beginner's field, not an entry-level job, and that you're hopeless without either a phd, publications, or "domain knowledge" ... while i agree that a certain mathematical maturity would be required to be effective, their narrow view seemingly contradicts yours that "any age isn't too late to learn."

what level of education is required to be truly effective?
would you or your friends hire early-career data scientists? or is it truly just a field of adjacent transfers, for experts by experts?

thanks

DeephavenDataLabs3 karma

As with all jobs, there are various levels of work, which require different levels of knowledge. There are certainly some jobs that do need very specialized knowledge, degrees, domain knowledge, and experience. There are other jobs that do not have the same requirements. Every year many thousands of entry-level data science jobs are filled by new graduates, beginners with no experience. No matter what, a data scientist needs good number sense and mathematical reasoning. You do need to know something. You don't need to know everything.

B055_MU5T4NG19 karma

What’s your opinion on the R platform, particularly for analysis of datasets?

DeephavenDataLabs36 karma

When I used it in the past, I thought it was intuitive and easy to use. I haven't used it much in recent years, but I can understand why it's very popular. Per our colleague Chip - R is powerful because there are a lot of packages to go with it, but in terms of language, it's not very structured. If your program gains complexity and grows larger, it gets awkward. For large programs, I wouldn't recommend it.

Jjjohn040412 karma

What is the most challenging aspect of working with real time big data?

DeephavenDataLabs23 karma

There are a few challenges. In particular, for machine learning, whether or not your process can actually keep up with the amount of data you want to process. You need to find the happy medium between the complexity of the algorithm I'm trying to implement and the adequacy of the results I'm getting. Is it worth making my model more complex for a 2% accuracy increase? Not always.

DeephavenDataLabs6 karma

From one of our colleagues:

There are several ways to approach that. From a UI perspective, it is making the data discoverable and responsive - users do not always have an understanding of what is “expensive” or not. From a backend engine standpoint, I would say that there is a careful balancing act with having simple data structures [because simple is generally faster] vs. complex structures [because storing or linking information allows you to do less work].

randomesthinker10 karma

Can you recommend any courses or certifications that are actually valued by job recruiters in the data science field? I'm trying to break into the field, but I don't know which of the many courses/programs are viewed positively by hiring managers.

DeephavenDataLabs12 karma

When we're looking for someone to hire, we want to see what they know vs. what we can see on paper. We want to see a genuine desire to get better at their craft. Some people do look at that, of course, and it depends on the industry you want to enter.

randomesthinker4 karma

Thanks for the answer! That's fair. What do you use to establish what they know? Reviewing a portfolio they've created? An internal testing process? I appreciate any insights. Just trying to figure out how I can get the initial callback to prove what I know. :-)

DeephavenDataLabs14 karma

Every interviewer has a different technique they use. Jake, one of our DevRels, likes to dive into the candidate's resume. Anything on the resume is fair game to ask, ranging from experience building with a specific language/framework, or asking about problems on a specific project. He then follows up with a technical question, either a system design or a coding question on using an external service. For him, this technique shows him that the candidate has both a technical understanding and understands previous work well enough to talk about it on both a high level and technical level.

Indi_mtz9 karma

How do you see the future development of the DS/ML job market? With so many new graduates in that field and things potentially being automated do you think there could soon be a saturation in industry?

DeephavenDataLabs9 karma

There is definitely a deficit of people to hire! At the same time, there are several interesting new technologies - machine learning to create machine learning - and some aspects of what people are doing now may become automated in the next 10 years. Nevertheless, it'll take a long time for the supply of people to catch up with demand.

daffas8 karma

I'm trying to learn Numpy and Pandas on the side and I'm having trouble finding the motivation to work on it after getting done with working a full time job. Do you have any tips on getting and keeping the energy/motivation to work on projects on the side?

DeephavenDataLabs10 karma

Amanda here. I did a lot of learning while I had a full-time job (or 1.5 jobs). I found that doing my fun learning early in the morning before work was useful, but then on a Saturday or Sunday, I would pretend like it was a workday for my dream job (using what I wanted to learn) and spend time doing that! I am lucky that my family understood these "work days" on the weekend; even though they didn't like it they appreciate the quality of life it has led to.
Another thing that helped was sneaking in a lunch break that was actually a learning break! I also listened to videos and podcasts while I drove. All of that little learning adds up!

newpua_bie6 karma

[deleted]

DeephavenDataLabs12 karma

How would you recommend someone coming from STEM/science (as in an actual scientist) and programming/Python background but no specific DS/ML experience to get into the field?

To get into CS is a matter of persistence and learning, and you can absolutely transfer your science skills into the field. Amanda comes from Astrophysics, and we've interviewed plenty of people from diverse backgrounds. More advice in our live stream, and we'll answer more fully later! Lots to say here. https://www.youtube.com/watch?v=8hmQ9DzTr-g&list=PLx68WY_F9lf5UmeE_0Dchc3xYT19D40bb

ab6244 karma

where does Deephaven fit in ETL data pipeline ?

DeephavenDataLabs3 karma

Deephaven’s rich table API allows it to be an incredible tool for ETL, and arguably the only one for streaming data.

This is part of why we support Python; any data cleanup you might normally do in Python would work in Deephaven, too. The transform piece fits well into our update/update_view mechanism, where we just build views on views on views, and do the work at runtime. If you do it that way, it becomes more of ELT. We load the data and then transform it at runtime - depending on what kinds of sources you are hooking into this might be necessary. Some of the transformations may not make sense until you have other ticking sources already at your disposal.

ab6242 karma

any good tutorials/resources to get my hands dirty ?

DeephavenDataLabs2 karma

If we can say so ourselves, we've got a cool demo experience you can check out here:

https://demo.deephaven.app/

There are sample notebooks with runnable code that show off Python features, as well as data science, AI/ML.

Also, our Python tutorial:

https://deephaven.io/core/docs/tutorials/tutorial/

quietIntensity3 karma

How much of an impediment would a lack of a degree be in finding a job in your field, coming from a 25+ year professional background in IT, mainly Software Engineer roles and some SE/InfoSec overlapping roles? I dropped out of college during the dot com boom for financial reasons and promptly started an IT career, never going back to finish my degree, as I've always just taught myself whatever I need to learn when the need arises. Currently learning Kotlin and Python to expand my horizons, along with a ton of cloud stuff for my job role. Is there a potential to shift into a more DS/AI/ML focused role without spending $100K to go back to college in my mid 40s?

DeephavenDataLabs3 karma

For our interview process, we like to start off by giving candidates a programming test. We do this to make sure our candidates can solve technical problems.
Then we follow up with an interview to see what the candidates know. For more junior candidates, we care more about the ability to adapt and solve new problems than any specific knowledge. We actually don't care much about the degree and credentials on paper.

Find some good Udemy classes so you have a knowledge base to work from. Try out some projects to test your understanding. Show off your work on GitHub! Find a good resume writer to help you tell your story - this is a natural move, and use lots of action verbs from your career to show how the skills from your prior job make you a good fit. Talk to some head hunters with connections. 25+ years of experience is a huge asset.

At the end of the day, whether you have a degree or are self-taught, do you have the skills necessary to do the job?

neildmaster3 karma

What do you recommend studying now (middle school) or doing on your own time for a teenager that wants to get into data science or computer science?

DeephavenDataLabs9 karma

Definitely look at some introduction to programming courses. Python, Java, and C are solid fundamentals to have going into computer science. And see if your school has any clubs or after school activities for computer science!

Alpha_sands3 karma

Heyya! I'm a high school student currently pursuing a future in computer sciences and python programming; and I find the work y'all have done so COOOL. As such, I'm interested in what got y'all into this field and what sort of advice/notices/warnings would you give students (like myself) as we dwelve into this field?

DeephavenDataLabs3 karma

Thank you! It's a very diverse field with lots of problems to be solved. Lots of us had different reasons for getting into the field, ranging from an interest in technology and computers, interest in problem solving, and some of us even came from a scientific background.
One of the biggest pieces of advice for this field would be to be prepared for the collaborative aspect of it. Even with our 100% remote team, we are extremely collaborative. Don't be afraid to reach out to your colleagues when doing your work.

Tidalsky1143 karma

Best place for someone with no experience to start?

crazymoefaux2 karma

Not one of the AMA folks here, but anyone can start learning python through Khan Academy, Udemy, or any one of dozens of tutorials on youtube. Buy a book on python, if you really want, but there's enough free materials out there for anyone to get started. Python itself is Free Software, the official python packages are completely free to download and use to get started coding, and anyone can contribute to the open source project that runs the show.

Python runs on windows, *nix, apples, there are integrated developing environments (IDEs) that help things along, but all you need is any text editor and a python interpreter for your platform to get started.

I've been brushing up my python skill by doing the www.pythonchallenge.com, which is a bit esoteric in some spots, but with some determination and google-fu, you can find the python libraries, functions, and regular expressions (regex, a very powerful string analysis tool, comparable C++ code is just... gross and bloated compared to a clean python or even a perl regex) to pull secret messages out of images or data sets.

And the language was named after Monty Python, which will always be one of the best name origins in computing history (Lady Ada of Lovelace notwithstanding).

DeephavenDataLabs3 karma

Great answer! Thanks.

alexgand2 karma

Statistics vs other majors to work on the field?

DeephavenDataLabs5 karma

JJ - speaking from first-person experience, one of our interns was a Stats major and an absolute pleasure to work with. He did some awesome stuff we still work with up to this day, particularly with the deephaven.learn library.

DeephavenDataLabs7 karma

The specific major that you study isn't going to make a huge difference in your career. As long as you understand the math behind the algorithms, can solve problems and can write code you should be able to succeed independent of your major. No matter your study, knowledge of computer science fundamentals will help greatly with the work that you do.

FrontierPsycho2 karma

Given that python is not the fastest language, what methods are used to process large datasets? Are workloads divided into independent chunks and processed in parallel? Is a different method used? What python tools or libraries are used to accomplish this?

DeephavenDataLabs5 karma

  1. use efficient data types; use NumPy, SciPy for vectorized computing; use Arrow for efficient data interchange (which Deephaven fully supports/extends) etc.; drop unwanted columns.
  2. definitely break the workload into chunks (ingesting chunks of data instead of whole, process data in chunks, etc.)
  3. use distributed/parallel computing package such as DASK

DeephavenDataLabs2 karma

Our core engine is actually implemented in Java, with efficient columnar data structures, designed for high throughout and low latency.

https://deephaven.io/core/docs/conceptual/technical-building-blocks/#mechanical-sympathy goes into more detail.

IntroducingHagleton1 karma

Who is or was the Indiana Jones of the data science industry?

DeephavenDataLabs1 karma

Wes McKinney - the founder of pandas!

DeephavenDataLabs2 karma

Two others of note: Rudolf Kalman and Andrew Ng.

DeephavenDataLabs1 karma

Please find the full video here: https://youtu.be/8hmQ9DzTr-g

AwkwardExaltation1 karma

In general, what's the real time processing speed of the engine? For context, my workplace develops log management software, and we've discussed implementing AI/ML for security events. Our software can process terabytes of data per day on a single server, how much of that could we reasonably run analysis on?

DeephavenDataLabs2 karma

Deephaven was originally built to handle historical and real-time analysis for use cases that produce terabytes/day of ordered, structured data. Users drive real-time analyses and batch computations in production environments, often serving tens or hundreds of query engines.
Obviously, the format of your data may make the task easier or harder, but we would certainly expect that an appropriately sized deployment could be used to analyze your data, and our community team would love to help get you started and address any pain points you encounter.
It might be best to reach out to us on Slack: https://join.slack.com/t/deephavencommunity/shared\_invite/zt-11x3hiufp-DmOMWDAvXv\_pNDUlVkagLQ

nepaligirl1 karma

Any tips for someone with non tech background to make a career switch to Data Analytics? I’m a nurse looking into making that switch. And on that same vein, how hard is it for someone like me to break into the industry?

DeephavenDataLabs2 karma

Any tips for someone with non tech background to make a career switch to Data Analytics? I’m a nurse looking into making that switch. And on that same vein, how hard is it for someone like me to break into the industry?

Jake here: Start with learning how to code. Python is probably the more relevant language for you to learn, along with a few frameworks like Pandas. JJ will jump in also!

KoreanBoi32131 karma

If I wanted to get started, what resources is best to learn. Also what computer do you recommend using and how can I connect with others educated in this field?

DeephavenDataLabs3 karma

Udemy or any coding tutorial would be a good start. Any computer should be fine to get started, but a Linux distro would be ideal (dual-booting Linux and Windows would be a good start as well). Amanda needed processing power for her data science projects that used huge datasets, and rather than upgrade her computer, used Google Cloud. Something to consider depending on what you're doing. In terms of connections, local tech meetups are a great way to meet people.

DeephavenDataLabs1 karma

Chip adds: Mac is also a good option. Windows is a good option now that there is wsl. Linux is a good option. Chromebook is not a great option unless a cloud ide like Cloud 9 is used. I lean to Mac if you want to spend time coding and not fiddling with the machine. Linux needs to be learned at some point, but it will require some fiddling to get everything working on a laptop.

SnickeringBear1 karma

Not a question, just general observations. I put in 9 months so far helping a large telecommunications provider build an analysis capability to track usage in their network and to predict required network upgrades. This requires real time data feeds along with ability to analyze static network elements. This system has nothing at all to do with monitoring individual calls and everything to do with analyzing huge volumes of calls to predict future network requirements. Here are a few of the hurdles we overcame.

  1. The collection platform has a limited capacity to stream large volumes of data. We had to limit data inputs to the most important items and implement routines to manage the data as it was imported and collected into a database. Long term, they will need a system with a lot more horsepower and much larger storage capacity. Takeaway that any large data analysis effort requires some serious hardware to run the collection and analysis routines.

  2. There was a huge lack of know-how about what should be collected and how to analyze it once collected. The provider paid to have me provide the technical know-how required. (hint, there are very good paying jobs in data analysis!)

  3. Lack of technical skills using Excel and Python in particular caused many delays. I am not talking about everyday skills like using spreadsheets, generating charts, etc. This is serious skills in writing routines that are both efficient and effective. Learn python and learn to use Excel VBA. It doesn't hurt to know a bit about SQL.

  4. The results are best presented in a visual such as a graph, chart, or in a timeline. The people involved are highly knowledgeable about their particular industry so were able to immediately spot choke points in their network once the data was visually displayed. Not familiar with a timeline? Best learn about this one. Timelines are the only effective means of displaying many types of data that change over time.

  5. Automation of the data collection and analysis is a huge part of working with large data sets. We are automating everything possible which directly translates to man-hours previously spent manually collecting and evaluating data. The amount of time saved with an automated process is a huge justification for the money spent implementing the automated processes.

  6. The trigger for this collection and evaluation system was a series of equipment failures that directly tracked back to inadequate analysis. The executives in the company called people on the carpet with intent to prevent future occurrences. It is far better to spend a few $K on a monitoring and reporting solution than to stand in front of your boss's boss's boss explaining.

DeephavenDataLabs1 karma

Thanks for your insights. We agree that these are important problems, and we think Deephaven is an important part of the software stack for solving them. Users of Deephaven are able to focus their efforts on domain-specific requirements, instead of re-inventing real-time data analysis tools.

ClassifiedName1 karma

I'm an electrical engineering student who's been coding with Python for 6 years, in C for 8 years, and I've received ML experience the last 2 years.. I want to take a break off of school and either get an internship in programming or, preferably, a full-time software job. What do you recommend I work at to make my resume more appealing, and how likely is it that I could find a job in software without a bachelor's?

Thank you for your time, I've enjoyed reading your other responses in this thread!

DeephavenDataLabs2 karma

Per Ryan, our CTO: If I was the hiring manager, I would be looking at their resume with an eye for the hard problems they have helped solve, and their contribution to the solution. Whether this is at at paid position, a research project, open source, or a hobby portfolio, show me what you can do.
I have worked with people with no degree or non-CS degrees many times, and I firmly believe that the quality of thinking and motivation to learn and work hard is more important than any specific credential. That said, learning core computer science concepts, data structures, and algorithms is never time poorly spent.

DontLookAtMe_Thanks1 karma

Is your company hiring remote interns? I have experience with C++ but I also have used Python

DeephavenDataLabs1 karma

Thanks for you interest! We actually just finished hiring our round of summer interns. Keep an eye on our Careers page, as we'll likely hire interns (among other openings, of course) again in the future. https://deephaven.io/company/careers/

shortAAPL1 karma

I work at a systematic hedge fund and was just reading about your company the other day. What are some of the reasons why a company would use deep haven over kdb/q?

DeephavenDataLabs2 karma

Deephaven and kDB are the leading technologies one might consider for a general-purpose data system on Wall Street.They separate themselves from the field in regards to their performance. Think "single-threaded speed." Other technologies are either orders of magnitude slower or have little range; Silicon Valley data systems focus heavily on sharding to provide performance (and are also not good enough with real-time data), so Deephaven and kDB are the leaders in the capital markets.kDB has brand because it has been around much longer.The two systems are comparable on performance with historical data and real-time data and the combination of the two. For micro loads kDB is a bit faster for singular operations (-- think "on something small that is simple", kDB might take 15 millis and Deephaven 22 millis for example).... but for 'real' loads with any complexity, each will win various races.I'll itemize Deephaven advantages below, but the core value prop is simple: Deephaven allows people to get more done. It is not a close call. There are many examples of Deephaven customers evolving systems and innovating much more quickly with their team than they would have if they were using kDB, their own homegrown tech, or something else. The difference in business velocity and innovation capacity is 2-5X, not "20% more".It matters. A lot.There are significant differences in the 2 systems. Here are the first 10 that come to mind:

  1. Deephaven is open source. It's fundamental transport API (https://deephaven.io/barrage/docs/) and JavaScript Web-UI harness (https://github.com/deephaven/web-client-ui) are Apache-licensed; its core engine is source-available, with a single restriction that will have no impact on parties using it for their own interest.
  2. Deephaven embraces open formats. kDB requires you to marry their tech for life, because your data is in their proprietary format.That is not modern and it is really bad for the future evolution of your Wall Street business. By having your data in Parquet, Orc, Iceberg; and streaming it in real-time using something like an Apache-Flight-compatible format... you can use any tech you want with the data. That's true today and as the world turns in the future. Locking in with a commercial vendor really limits the pace of infrastructure evolution for your company 3-0 years in the future.BTW: We think #1 and #2 are really big deals.
  3. Deephaven is infinitely self-serve. kDB is (kind of) the opposite. The greatest advantage of Deephaven is its singular ability to bring everyone around the data -- in the case of Wall Street this means quants, traders, execution people, algo developers, surveillance, risk modelers, salespeople, quant PMs, management. kDB is the opposite, where very few people in an organization touch the data. You don't want bottlenecks.
  4. Amongst other things, #3 refers to 'how you program the thing'. We know a very small number of people love q and k. God bless them. Deephaven is the opposite. Though it is fantastic for quants (- think 'pandas-like, but real-time') and developers ('SQL-like, but it's a proper Python application or Java application').... 30% of users of Deephaven are the traders, PMs, surveillance people, and managers that only used Excel before. On a single system, you can have literally all these diverse personas getting work done, building apps, and streaming derived work product to one another.
  5. Deephaven has huge range. It is much more than a classic "tick database". At its core, Deephaven is a Java application... and the team has evolved a Python-Java bridge (https://github.com/jpy-consortium/jpy) so most people now use it as a Python-first experience. Apps and analytics are easy to write... as one combines Python (or Java/Groovy) with table operations and other Deephaven-Table-API capabilities... setting up a logical tree where data flows from one node to the next. This style of linear and iterative data-driven (imperative) development is powerful.
  6. Deephaven is organized to have nodes sending source and derived (streaming) data to one another and to clients. This easy ability to essentially have a mesh of independent workers can provide nice pipelining and parallelization of course, but it gets much more interesting as you think of different people writing different apps that automatically inherit updates from a variety of sources, add modeling or business logic, and then publish to downstream consumers -- whether other workers, web front ends, or general CS or DS tools.
  7. Deephaven user experiences are compelling. For Community, that means its Web-IDE, which is second-to-none for looking at real-time data and exploring... or building applications. In enterprise, additionally, there is a compelling workflow for creating apps (-- this is important!), handling data lifecycle, and sharing.
  8. Dashboarding with Deephaven is fantastic. They're easy to create and share (in Community or Enterprise).
  9. There is a comprehensive PlugIn system, so the sky's the limit for marrying real-time data to either (i) your customized JS widgets; or (ii) Python visualization or calculation libraries (i.e., matplotlib, seaborn, etc.).
  10. DH's interactive widgets that update in real-time rendered in Jupyter Notebooks or your own web assets create sharing flows rock.

k0mmand0c0z1 karma

Any thoughts on how quantum computing is going to change the industry? Particularly with the cyber security space

DeephavenDataLabs2 karma

Quantum computing is making slow and steady progress. I expect that quantum computing may lead to interesting improvements in things like deep learning. ... but I am very concerned about the security implications. As a planet, we are far behind where we should be in protecting our digital security from quantum computing. Right now, there may already be quantum computers breaking existing security protocols. There are already many articles about encrypted data being stored so that it can be decrypted when the technology is available.

Ok-Category92491 karma

Can you really go to a trade school and get certified to immediaty get a job in the data science undustry?

DeephavenDataLabs2 karma

People asked us similar versions of this question. It doesn't matter so much what your background is, as much as your perseverance, desire to self-teach, problem-solving skills, etc. Employers want to see driven candidates who are willing to keep learning and are unafraid to ask questions. That's all general advice, but we talked a lot about this in the live feed on YouTube.

ab6241 karma

how is Deephaven different from pyspark ?

FamousTheme62482 karma

Not too sure on this one. When one of us used pyspark it wasn't very good at the real-time data aspect. While Deephaven was built with that as in mind. Need to look up pyspark for more information in this. Thanks for the new tool to me!

FamousTheme62482 karma

One of our devs adds: Spark is a stream-processing engine, whereas we have an updating table model; different paradigm, more powerful, describes arbitrary tabular changes.

ab6241 karma

can we compare it with ksqldb ?

DeephavenDataLabs2 karma

Compared to ksqdlDB, we’re still getting better at schema ingest from Kafka, but we intend for Deephaven to be the go-to choice for table-oriented analysis of Kafka data in real time. Our capabilities at streaming tabular joins are unique.

Our core engine is actually implemented in Java, with efficient columnar data structures, designed for high throughout and low latency. https://deephaven.io/core/docs/conceptual/technical-building-blocks goes into more detail.

  1. Our table API is far more accessible for novice developers, and makes it easy to integrate with application code in Java, Groovy, or Python.
  2. We make real-time, incrementally-updating results a priority in our architecture at every level.
  3. We have terrific real-time visualization and query sharing experiences out of the box.
  4. Our community version has a very permissive, source-available license.
  5. Our enterprise version is licensed based on user count, rather than core count; we enable massive productivity for data engineers and data scientists.

ab6241 karma

say i have my streaming data coming into azure datalake .. how can i provision and leverage Deephaven ?

is there any storage functionality to it ?

DeephavenDataLabs1 karma

Our community project doesn't have persistent storage outside of being able to read and write Parquet files. However, our enterprise project has persistent storage capabilities.

ab6241 karma

say i have my streaming data coming into azure datalake .. how can i provision and leverage Deephaven ?

DeephavenDataLabs1 karma

Right now, it depends on what's available in the API - we are familiar with the lakehouse concept, and so potentially in the future we could rewrite Parquet data files in a way that Deephaven Core would be compatible with Azure Datalake specifically. In Community, our users roll their own environments.

DeephavenDataLabs1 karma

You can contact us on slack to discuss either a custom or general azure data lake integration - https://join.slack.com/t/deephavencommunity/shared\_invite/zt-11x3hiufp-DmOMWDAvXv\_pNDUlVkagLQ

DeephavenDataLabs1 karma

Spark is a stream-processing engine, whereas we have an updating table model; this paradigm is more powerful.

DeephavenDataLabs1 karma

- Once a context has been started, no new streaming computations can be set up or added to it.
- Once a context has been stopped, it cannot be restarted.
- Only one StreamingContext can be active in a JVM at the same time.
stop() on StreamingContext also stops the SparkContext. To stop only the StreamingContext, set the optional parameter of stop() called stopSparkContext to false.
- A SparkContext can be re-used to create multiple StreamingContexts, as long as the previous StreamingContext is stopped (without stopping the SparkContext) before the next StreamingContext is created.

from https://spark.apache.org/docs/latest/streaming-programming-guide.html

prpldrank1 karma

How do y'all think management of unstructured and heavy data, like video, will change in the next few years?

DeephavenDataLabs2 karma

Over the coming years, video will be considered a data stream. From the raw video, we will be creating all sorts of derived data that will also need management. Think about things like object locations, tagging of people in different parts of the video, etc.

JSA24221 karma

I've been considering joining a coding bootcamp that has a data science route. Do you have any opinion on this method of learning?

DeephavenDataLabs3 karma

Not a data science bootcamp, but we know others who have done Python or other language bootcamps and spoke favorably and felt they did well on interviews as a result. They are structured to teach even people without a lot of prior knowledge. If you do one, ask questions and learn about incentives - some boot camps get paid when their students are placed, which could improve outcomes. Understand what you're getting into.

Amanda's husband actually runs a coding BootCamp, and it can build community. But...some just want to take your money so do your research.