Hacker News

3 hours ago by michelpp

This is an excellent example of what I call the Copy-Object-Copy effect. It's particularly apparent in frameworks with ORMs like Django. In Pandas case, devs will do 'SELECT *' and use pandas as a sort of pseudo ORM.

You run a query to get your data as a bunch of objects, but you're copying the data over the db connection from the postgres wire protocol into Python objects in memory, which are typically then garbage collected at the end of the transaction, then copy the result to a JSON buffer that is also garbage collected, and then send the final result to send to the browser which has been waiting this whole time.

I regularly see Django apps require several gigs of RAM per worker and when you look at the queries, it's just a bunch of poorly rendered SELECTs. It's grotesque.

Contrast PostgREST: 100 megabytes of RAM per worker. I've seen Django DRF workers require 50 to 1 memory vs PostgREST for the same REST interface and result with PostgREST being much faster. Since the database is generating the json, it can do so immediately upon generating the first result row which is then streamed to the browser. No double buffering.

Unlike Django, it's not that I don't like Pandas, it's that people way overuse it for SQL tasks as this article points out. I've seen pandas scripts that could be a simpler psql script. If psql can't do what you want, resist the temptation to 'SELECT *' into a data frame and break the problem up into stages where you get the database to do the maximum work before it gets to the data frame.

2 hours ago by WhompingWindows

Within the VA hospital system, the data admins often got onto their "soapboxes" to dress down the 1400+ data analysts for writing inefficient SQL queries. If you need 6 million patients, joining data from 15 tables, gathering 250 variables, a beginning SQL user has the potential to take 15-20 hours where they could be pulling for 1-2 if they do some up-front filtering and sub-grouping within SQL. If you already know you'll throw out NA's on certain variables, if you need a certain date or age range, or even the way you format these filters: these all lead to important savings. And this saves R from being full on memory, which would often happen if you fed it way too much data.

Within a system of 1400 analysts, it makes a big difference if everyone's taking 2X or 4X the time pulling that they could be. Then, even the efficiently written pulls get slowed down, so you have to run things overnight...and what if you had an error on that overnight pull? Suffice to say, it'd have been much simpler if people wrote solid SQL from the start.

43 minutes ago by protomyth

I don't disagree with writing solid SQL. I would go so far as to say some things (most) need to be in stored procedures that are reviewed by competent people. But, some folks don't think about usage sometimes.

This is one of those things I just don't get about folks setting up their databases. If you have a rather large dataset that keeps building via daily transactions, then its time to recognize you really have some basic distinct scenarios and to plan for them.

The most common is adding or querying data about a single entity. Most application developers really only deal with this scenario since that is what most applications care about and how you get your transactional data. Basic database knowledge gets most people to do this ok with proper primary and secondary keys.

Next up is a simple rule, "if you put a state on an entity, expect someone to need to know all the entities with this state." This is a killer for application developers for some reason. It actually requires some database knowledge to setup correctly to be performant. If the data analysts have problems with those queries, then its to to get the DBA to fix the damn schema and write some stored procedures for the app team.

At some point, you will need to do actual reporting, excuse me, business intelligence. You really should have some process that takes the transactional data and puts it into a form where the queries of data analysts can take place. In the old days that would be something to load up Red Brick or some equivalent. Transactional systems make horrid reporting systems. Running those types of queries on the same database as the transactional system is currently trying to work is just a bad idea.

Of course, if you are buying something like IBM DB2 EEE spreading queries against a room of 40 POWER servers, then ignore the above. IBM will fix it for you.

an hour ago by solaxun

I understand your point generally but I don't understand this example. If you need data from 15 tables, you need to do those joins, regardless of prefiltering or subgrouping, right?

42 minutes ago by protomyth

Well, yes but. Why do you need 15 tables for analysis queries and why isn't someone rolling those tables into something a bit easier with some backend process.

2 hours ago by tomnipotent

> it's that people way overuse it for SQL tasks as this article points out

I'm very confused by this. I've used pandas for ever a decade, and in most cases it's a massive time saver. I can do a single query to bring data into local memory in a Jupyter notebook, and from there re-use that memory across hundreds or more executions of pandas functions to further refine an analysis or whatever task I'm up to.

Your "copy-object-copy" is not relevant in data analysis use cases, and exists in pretty much any system that pulls data from an external service be it SQL or not.

2 hours ago by qsort

Pandas is great when you're working on the data manually because it lets you interleave python code and "queries", but it's strictly worse than a plain SQL statement if you're writing, say, a REST service that needs to output some data.

SQL executed by the database is orders of magnitude more efficient, way more expressive, and doesn't require you to memorize pandas' absurd API.

And I say this as someone who is by no means a SQL wizard, and fully acknowledging all of SQL's blemishes.

2 hours ago by tomnipotent

> a REST service

This is a post about data analysis, and everyone wants to point out that pandas isn't good at real-time service requests.

> SQL executed by the database is orders of magnitude more efficient

Compared to pandas? No, it's not. Once I have all the data I need locally, it's MUCH faster to use pandas locally then to re-issue queries to a remote database.

an hour ago by tarsinge

The point is that if you only need to join and aggregate data it's easier and more efficient to do it directly in SQL. Also in production or when your database doesn't fit in memory the idea is to first use SQL to generate an optimized view of your data from the database before further analysis and transformation.

an hour ago by jstrong

> The point is that if you only need to join and aggregate data it's easier and more efficient to do it directly in SQL

citation needed? I've seen plenty of cases where it would have taken the db ages to do something that pandas does fast, and I don't consider pandas to be particularly fast.

2 hours ago by mistrial9

parent post says "devs do SELECT *" and ...

relational databases can have a lot more data in one table, or related tables, than one query needs. It is often very wasteful of RAM to get ALL and filter again

39 minutes ago by protomyth

If your query does " * " it gets all the columns in all the tables. Often, the optimizer when all the columns you need are on the index, never visits the actual table (I remember the term covered index). " * " basically screws this up. "Select *" should never be used in anything in an actual production environment.

2 hours ago by tomnipotent

Not for data analysis, it's a pointless constraint unless you're having issues.

Most data analysis isn't against datasets larger than memory, and I'd rather have more data than I need to spend time waiting to bring it all local again because I forgot a few columns that turn out to be useful later on.

2 hours ago by geraneum

Well it's true that bringing row(s) of data from database to python and then serializing it for any reason as has extra overhead.

But in Django's case, the same arguments can be made as for Pandas. Django is a big complex framework and an application using it might consume more memory due to countless number of other reasons. There are also best practices to use Django ORM.

But to say if a given Django instance consumes more memory it is only because of "Copy-Object-Copy effect"... I don't think so.

2 hours ago by mixmastamyk

Interesting. I'm guessing the extra layer may be needed for some manipulations of data at the application layer, but perhaps that is almost always avoidable?

Have you ever tried Hasura? I'm thinking of giving it a go in a future project.

2 hours ago by musingsole

It's only avoidable to the degree that your software architecture and company architecture allow you to make the more efficient decision to move the logic upstream.

The cases where this pattern emerges are invariably because someone without insufficient access to the DB and its configuration bites the bullet and instead builds whatever they need to do it in the environment they have complete access to.

I've similar problems lead to dead cycles in processor design where the impact is critical. These problems aren't software technology problems. They're people coordination problems.

40 minutes ago by arcticfox

As a Hasura user, I highly recommend it for OLTP workloads. And for the specific subject at hand (memory usage) it is fantastic. Its (in-database) serialization speed just completely runs circles around anything we could do before in Ruby.

2 hours ago by tylerhannan

If/as you give it a go in the future...do let us know what is working and what could be better. We do, in fact, care.

<disclaimer: I work at Hasura>

3 hours ago by m16ghost

>Pandas is a very popular tool for data analysis. It comes built-in with many useful features, it's battle tested and widely accepted. However, pandas is not always the best tool for the job.

SQL is very useful, but there are some data manipulations which are much easier to perform in pandas/dplyr/data.table than in SQL. For example, the article discusses how to perform a pivot table, which takes data in a "long" format, and makes it "wider".

In the article, the pandas version is:

>pd.pivot_table(df, values='name', index='role', columns='department', aggfunc='count')

Compared to the SQL version:

>SELECT role, SUM(CASE department WHEN 'R&D' THEN 1 ELSE 0 END) as "R&D", SUM(CASE department WHEN 'Sales' THEN 1 ELSE 0 END) as "Sales" FROM emp GROUP BY role;

Not only does the SQL code require you to know up front how many distinct columns you are creating, it requires you to write a line out for each new column. This is okay in simple cases, but is untenable when you are pivoting on a column with hundreds or more distinct values, such as dates or zip codes.

There are some SQL dialects which provide pivot functions like in pandas, but they are not universal.

There are other examples in the article where the SQL code is much longer and less flexible, such as binning, where the bins are hardcoded into the query.

33 minutes ago by kbelder

I've been doing a lot of data analysis in Pandas recently. I started off thinking that for efficiency's sake, I should do as much initial processing in the DB as possible, and use Pandas just for the higher level functions that were difficult to do in SQL.

But after some trial and error, I find it much faster to pull relatively large, unprocessed datasets and do everything in Pandas on the local client. Faster both in total analysis time, and faster in DB cycles.

It seems like a couple of simple "select * from cars" and "select * from drivers where age < 30", and doing all the joining, filtering, and summarizing on my machine, is often less burdensome on the db than doing it up-front in SQL.

Of course, this can change depending on the specific dataset, how big it is, how you're indexed, and all that jazz. Just wanted to mention how my initial intuition was misguided.

an hour ago by michelpp

Does `tablefunc.crosstab()` do what you want?

https://www.postgresql.org/docs/13/tablefunc.html

an hour ago by papercrane

That's one of the non-standard ways to do it. MSSQL and Oracle also have a pivot function to do this. Unfortunately there is no standard way to do this.

2 hours ago by ziml77

I've always been disappointed by the SQL pivot. It's hardly useful for me if I have to know up-front all of the columns it's going to pivot out into. The solution would be to use another SQL query to generate a dynamic SQL query, but at that point I would rather just use Pandas

2 hours ago by undefined

[deleted]

an hour ago by flakiness

The more I learn SQL, the less I write Python.

Although the SQL syntax is weird and so dated, its portability across tools trumps everything else. You finished the EDA and decided to port the insights to a dashboard? With SQL it's trivial. With Python... well, probably you'll have to port it to SQL unless you have Netflix-like, Jupyter-backed dashboard infrastructure in place. For many of us who only have much-more-prevalent SQL-based dashboard platform, why not starting from SQL? Copy-n-paste is your friend!

I still hate the SQL as a programmer, but as a non-expert data analyst I now have accepted it.

38 minutes ago by arcticfox

After 15 years programming I have come to love SQL - 10 yrs ago I used to do everything in the book to avoid it out of hate, even though I knew how to use it, but over time I have moved more and more into the database / SQL and my code is just better and better as a result.

3 hours ago by beforeolives

Nice article. Pandas gets the job done but it's such a step backwards in terms of useability, API consistency and code feel. You can do anything that you can possibly need with it but you regularly have to look up things that you've looked up before because the different parts of the library are patched up together and don't work consistenly in an intuitive way. And then you end up with long lines of().chained().['expressions'].like_this(0).

an hour ago by carabiner

Can you honestly say you'd prefer to be debug thousands of lines of SQL versus the usual <100 lines of Python/pandas that does the same thing? It's no contest. A histogram in pandas: df.col.hist(). Unique values: df.col.value_counts(). Off the top of your head, what is the cleanest way of doing this in SQL and how does it compare? How anyone can say that SQL is objectively better (in readability, density, any metric) other than the fact that they learned it first and now are frustrated that they have to learn another new tool baffles me.

I learned Pandas first. I have no issue with indexing, different ways of referencing cells, modifying individual rows and columns, numerous ways of slicing and dicing. It gets a little sprawling but there's a method to the madness. I can come back to it months later and easily debug. With SQL, it's just madness and 10x more verbose.

an hour ago by beforeolives

> Can you honestly say you'd prefer to be debug thousands of lines of SQL versus the usual <100 lines of Python/pandas that does the same thing?

That's fair, I was using the opportunity to complain about pandas and didn't point out all the problems SQL has for some tasks. What I really want is a dataframe library for Python that's designed in a more sensible way than pandas.

3 hours ago by nerdponx

I much prefer writing actual functions in a real programming language to the verbose non-reusable non-composable relic that is SQL syntax.

Give me a better query language and I will gladly drop Pandas and Data.Table.

3 hours ago by semitones

SQL is a real programming language.

3 hours ago by beforeolives

Yes, SQL does have the problem of bad composeability and some things being less explicit than they are when working with dataframes. Dplyr is probably the most sane API out of all of them.

2 hours ago by alexilliamson

+1 for dplyr. I've used both pandas and dplyr daily for a couple years at different times in my career, and there is no comparison in mind when it comes to usability/verbosity/number of times I need to look at the documentation.

2 hours ago by nvilcins

You might be thinking of specific functionality that you find is being implemented in an overly long/verbose fashion.. But generally speaking, how are

> long lines of().chained().['expressions'].like_this(0)

a _bad thing_?

IMHO these pandas chains are easy to read and communicate quite clearly what's being done. If anything, I've found that in my day-to-day while reading pandas I parse the meaning of those chains at least as efficiently as from comments of any level of specificity, or from what other languages (that I have had experience with) would've looked like.

2 hours ago by musingsole

People don't like them because the information density of pandas chains is soooo much higher than the rest of the surrounding code. So, they're reading along at a happy place, consuming a few concepts per statement

...and then BOOM, pandas chain! One statement containing 29+ concepts and their implications.

Followed by more low density code. The rollercoaster leads to complaints because it feels harder. Not because of any actual change in difficulty.

/that's my current working theory, anyway

2 hours ago by disgruntledphd2

Hmmm, interesting. I don't mind the information density, coming from R which is even more terse, but the API itself is just not that well thought out (which is fair enough, he was learning as he went).

2 hours ago by isoprophlex

This really grinds my gears too. There's something about the pandas API that makes it impossible for me to do basic ops without tedious manual browsing to get inplace or index arguments right... assignments and conditionals are needlessly verbose too.

Pyspark on the other hand just sticks in my brain, somehow. Chained pyspark method calls looks much neater.

2 hours ago by ziml77

Setting inplace=True isn't too bad, but I definitely have had many issues with working with indexes in Pandas. I don't understand why they didn't design it so that the index can be referenced like any other columns. It overcomplicates things like having to know the subtle difference between join() and merge().

2 hours ago by disgruntledphd2

Pandas is just a bad API, to be honest. I know base-R very, very well (which was one of the inspirations, I believe) and I still spend most of my time looking stuff up.

It's such a shame that python doesn't have a better DF library.

2 hours ago by shankr

I also switched from R to Python/pandas. I remember always being frustrated with pandas since it tries to emulate data.frame, but then just does its own thing without being consistent.

an hour ago by EForEndeavour

I've used pandas regularly for the past ~5 years and find its API intuitive enough not to complain. I can write and read decently long pandas chained expressions fluently, but I barely know any R. Am I unwittingly a hostage of an inferior API and don't know what I'm missing?

3 hours ago by nerdponx

SQL syntax sucks for doing non-trivial data analysis. I've tried it. Verbose, no composability or code reuse, not really portable across different databases, no easy interoperability with other tools, limited editor/IDE support, etc.

I guess if you have huge amounts of data (10m+ rows) already loaded into a database then sure, do your basic summary stats in SQL.

For everything else, I'll continue using SQL to get the data from the database and use Pandas or Data.Table to actually analyze it.

That said, this is a very comprehensive review of SQL techniques which I think could be very useful for when you do have bigger datasets and/or just need to get data out of a database efficiently. Great writeup and IMO required reading for anyone looking to be a serious "independent" data scientist (i.e. not relying on data engineers to do basic ETL for you).

I'd be a huge fan of something like the PySpark DataFrame API that "compiles" to SQL* (but that doesn't require you to actually be using PySpark which is its own can of worms). I think this would be a lot nicer for data analysis than any traditional ORM style of API, at least for data analysis, while providing better composability and IDE support than writing raw SQL.

*I also want this for Numexpr: https://numexpr.readthedocs.io/en/latest/user_guide.html, but also want a lot of other things, like a Arrow-backed data frames and a Numexpr-like C library to interact with them.

3 hours ago by ramraj07

Completely disagree. I have a choice of using snowflake and spark/pandas to do my EDA and I’ll choose sql every time. The code is significantly more readable once you get used to it and you can most definitely do things one step at a time using udfs and temp tables / cte s. I’ve come back to EDA I did a year back and it’s always easier to read a long sql script than a notebook with pandas code.

2 hours ago by disgruntledphd2

There's pluses and minuses to both. That being said, how do I write a function in SQL which can abstract over something I do a lot (like the pivoting example earlier), or even some date function like iff(date>'important_date', 'before', 'after') as grouper.

Honestly, that's what ends up making me move away from doing analytics in SQL.

29 minutes ago by saltcured

Edit: oops, I think I replied a level deeper than intended. I was responding to the composition/abstraction topic. I think I should just leave this here now though?

I assume you are talking about composition of generic set-processing routines, but I wonder if others realize that? It is easy enough to write a set-returning function and wrap it in arbitrary SQL queries to consume its output. But, it is not easy to write a set-consuming function that can be invoked on an arbitrary SQL query to define its input. Thus, you cannot easily build a library of set-manipulation functions and then compose them into different pipelines.

I think different RDBMS dialects have different approaches here, but none feel like natural use of SQL. You might do something terrible with cursors. Or you might start passing around SQL string arguments to EXECUTE within the generic function, much like an eval() step in other interpreted languages. Other workarounds are to do everything as macro-processing (write your compositions in a different programming language and "compile" to SQL you pass to the query engine) or to abuse arrays or other variable-sized types (abuse some bloated "scalar" value in SQL as a quasi-set).

What's missing is some nice, first-class query (closure) and type system. It would be nice to be able to write a CTE and use it as a named input to a function and to have a sub-query syntax to pass an anonymous input to a function. Instead, all we can do is expand the library of scalar and aggregate functions but constantly repeat ourselves with the boilerplate SQL query structures that orchestrate these row-level operations.

3 hours ago by dreyfan

> SQL syntax sucks for doing non-trivial data analysis. I've tried it. Verbose, no composability or code reuse, not really portable across different databases, no easy interoperability with other tools, limited editor/IDE support, etc.

You're entitled to your opinion and tooling choices of course, but the problem is you don't know SQL.

an hour ago by default-kramer

"No composability or code reuse" is definitely a valid criticism of SQL. Check out the very first example of my personal project https://docs.racket-lang.org/plisqin/Read_Me_First.html and let me know how you would implement something akin to `(CategoryName p)` and `(TotalSales p)` in SQL. A view does not count, because then you either have one view per derived field and that is very cumbersome to consume, or you have one view with all the derived fields and you will end up paying performance costs for fields that you are not using.

3 hours ago by dunefox

That could be said about any tool, framework, language, library, etc. Invest enough time and you can do everything with malbolge but that doesn't mean it doesn't suck.

2 hours ago by mixmastamyk

A more useful comment would be to illustrate how. I'd like to know as well as I need to reuse queries in SQL occasionally but am not an expert. Currently I believe learning pl/pgsql is the answer, but even it reeks of punch-cards and other sixtiesisms :-). Tough sell when you're used to Python etc.

3 hours ago by pandasusr

I am a pandas user considering refactoring part of my ETL pipeline to SQL. I see the trade off as memory efficiency vs expressiveness, and for simple queries on big data, SQL wins. Would you disagree that Pandas/Python is more expressive than SQL? I’m less experienced in SQL but based on my limited experience there, it seems Pandas is clearly more expressive. What is the SQL equivalent of Pandas .apply(lambda x) ?

an hour ago by hodgesrm

ClickHouse has lambdas for arrays. They are very useful. Here's an example.

  WITH ['a', 'bc', 'def', 'g'] AS array
  SELECT arrayFilter(v -> (length(v) > 1), array) AS filtered
  
  ┌─filtered─────┐
  │ ['bc','def'] │
  └──────────────┘

The lambda in this case is a selector for strings with more than one character. I would not argue that they are as general as Pandas, but they are might useful. More examples from the following article.

https://altinity.com/blog/harnessing-the-power-of-clickhouse...

2 hours ago by higeorge13

Please define an example lambda function, in order to see whether there is an sql equivalent. Imo >90% of the data issues i have seen, can be solved with sql queries. I have seen some in the finance sector which would require super complex udfs, but other than these, sql is the first choise to solve a data issue.

41 minutes ago by hodgesrm

> I guess if you have huge amounts of data (10m+ rows) already loaded into a database then sure, do your basic summary stats in SQL.

This is not huge. This is quite small. Pandas can't handle data that runs into billions of rows or sub-second response on arbitrary queries. Both are common requirements in many analytic applications.

I like Pandas. It's flexible and powerful if you have experience with it. But there's no question that SQL databases (especially data warehouses) handle large datasets and low latency response far better than anything in the Python ecosystem.

3 hours ago by rdedev

For apache arrow backed dataframes check out Polars(https://github.com/ritchie46/polars)

3 hours ago by simonw

This article is so useful. It starts out with SQL basics but then quickly leaps into all kinds of PostgreSQL tricks that I didn't know about - binning, efficient sampling, calculating, even linear regression.

3 hours ago by spamalot159

I think for a lot of people, SQL is a skill that doesn't stick. You learn enough to do the queries you need for your project, they work then you forget about them as you work on the rest of your project. These skills are perishable. Left outer join? Yeah, I knew what that was some time ago, but not anymore

The days of dedicated SQL programers are mostly gone.

7 minutes ago by slver

We have a lot more clueless developers on average, which makes it seem like no one uses SQL anymore beyond basics. But we're still here, believe me. No one today is a dedicated X programmer, but doesn't mean we're all clueless about X.

3 hours ago by simonw

I started learning SQL twenty years ago and it's one of the only skills from the start of my career that has been consistently useful ever since.

3 hours ago by nisegami

I held this view before graduating and moving back to my 3rd world home country. For some reason, people here (the established ones with 5+ years of experience) see SQL as a general purpose programming language with which they do as much as possible.

3 hours ago by zabzonk

> The days of dedicated SQL programers are mostly gone.

I've never met a "dedicated SQL programmer" - all the C++ programmers I've worked with in the investment banking world were also expected to know, and did know, SQL pretty well - asking questions about it were normally part of the interview process.

34 minutes ago by muxator

> This benchmark does not mention the memory consumed by the database itself - this is intentional. [...] Whether you decide to use the database or not, the memory is already paid for, so you might as well use it!

This sentence is a big red flag for me. An analysis of a stategy that pushes work towards a subsystem, and then purposedly ignores the perforance implications on that subsystem is methodologically unsound.

Personally, I am all for using the DB and writing good SQL. But if I weren't, this argument would not convince me.

Daily digest email

Get a daily email with the the top stories from Hacker News. No spam, unsubscribe at any time.

Practical SQL for Data Analysis