Codeball's precision is 0.99. It simply means that 99% PRs that were predicted approvable by Codeball were actually approved. In layman, if Codeball says that a PR is approvable, you can be 99% sure that it is.
But recall is 48%, meaning that only 48% of actually approved PRs were predicted to be approvable. So Codeball incorrectly flagged 52% of the approvable PRs to be un-approvable, just to be safe.
So Codeball is like a strict bartender who only serves you when they are absolutely sure you're old enough. You may still be overage but Codeball's not serving you.
I want systems with low recall that "flag" things but ultra ultra high precision. Many times, we get exactly the opposite - which is far worse!
I’m assuming most PR’s are approvable. If that’s the case then this should cut down on time spent doing reviews by a lot.
PR reviews are a way of learning from each other, keeping up with how the codebase evolves, sharing progress and ideas, giving feedback and asking questions. For example at $job we 90% approve PRs with various levels of pleas, suggestions, nitpicks and questions. We approve because of trust (each PR contains a demo video of a working feature or fix) and not to block each other, but there might be important feedback or suggestions given among the comments. A "rubber stamp bot" would be hard to train in such a review system and simply misses the point of what reviews are about.
What happens if there is a mistake (hidden y2k bomb, deployment issue, incident, regression, security bug, bad database migration, wrong config) in a PR that passes a human review? At a toxic company you get finger pointing, but with a healthy team, people can learn a lot when something bad passes a review. But you can't discuss anything with an indeterministic review bot. There's no responsibility there.
Another question is the review culture. If this app is trained on some repo (whether PRs were approved or not), past reviews reflect the review culture of the company. What happens when a blackbox AI takes that over? Is it going to train itself on its own reviews? People and review culture can be changed, but a black box AI is hard to change in a predictable way.
I'd rather set up code conventions, automated linters (i.e. deterministic checks) etc. than have a review bot allow code into production. Or just let go of PR reviews altogether, there were some articles shared on HN about that recently. :)
> why are we reviewing code in the first place?
It being part of engineering culture is spot on. I think of it as two things: (1) quality gate and (2) knowledge sharing. Because of (1), by default reviews can feel a bit like submitting homework - not all contributions are of the same risk level but they follow the same process.
The idea behind Codeball is unassuming - identify and approve the bulk of easy contributions so that devs can focus their energy reviewing the trickier ones. This is can be especially nice in a trustful environment, keeping the momentum for devs to ship small & often.
Another thing is - models can incorporate a surprising amount of indicators, for example, not just the outcome of the PR but also what happens to contribution after merging (was the code retained as-is or was it hot fixed a day later etc).
If anything I think code review as a "nothing bad will happen" check gives a really false sense of security unless you have a super strict bus factor crazy smart kinda asshole engineer on the team who is probably going to piss everyone off with strict code reviews that most of the time are about personal preference but sometimes actually do catch the edge cases.
But those large-scale changes are also usually systematic, so wouldn’t have much to do with coding conventions or styles.
Sometimes we compare new things against their hypothetical ideal rather than the status quo. The latter is significantly more tractable.
On a one character code change? I’m inclined to think so.
In my experience, most "issues" in code review are not technical errors, they are business logic errors, which there is most of the time not even enough context in the code to know what the right answer is. It is in a PM or Sales Person's head.
Skipping code review is depriving the team of an opportunity to learn about the new incoming change, and depriving them of sharing knowledge (better implementations, business context, etc.)
Codeball is a result of a hack-week at Sturdy - we were thinking about ways to reduce the waiting-for-code-review and were curious exactly how predictable the entire process is. It turned out very predictable!
Happy to answer any questions.
A proper code review isn't simply catching API or style errors--it seeks to understand how the change affects the architecture and structure of the existing code. I'm sure AI can help with that, and for a broad class of changes it's likely somewhat to very predictable--but I'm skeptical that it is predictable for enough use cases to make it worth spending money on (for now), say.
Put another way: "approves code reviews a human would have approved" isn't exactly the standard I'd want automated reviews to aspire to. Human approval, in my experience, is mostly not good quality reviews.
Good to know that now it is doable in a week, with such good precision! Or do you have humans in the backend ;) ?
How do you compare yourself to PullRequest (they've been digging at it for 5 years as well) and recently folded? [funny fact, we've been interviewed in the same YC batch, which always makes me wonder, if YC liked the idea enough to have it implemented by another team ;) ]
>How do you compare yourself to PullRequest
So it turns out that the most of code contributions nowadays get merged without fixes or feedback during the review (about 2/3). I think this is because the increased focus on continuous delivery and shipping small & often. Codeball's purpose is to identify and approve those 'easy' PRs and humans get to deal with the trickier ones. The cool part about it is being less blocked.
Without something that semantically understands the code under review ( which all but requires general AI or at the least a strong static analyslzer) doing anything more than adding noise to the process or worse leading to certain groups of developers effectively being given a free pass.
Code reviews should be an interrupt for everything except downtime mitigation.
Reviewing your peer's code quickly will cause them to do the same to you. It is a virtuous circle.
Be the change you want to see.
For projects with trusted contributors only, PRs are usually approvable anyway, a one bit blackbox signal telling you some of them are (with zero explanation, it seems?) isn't very valuable.
Not sure why you would use this.
if len(diff) > 500 lines:
return "Looks good to me"
Given that it's a model, is there a feedback mechanism through which one could advise it (or you) of false positives?
I would be thrilled to see what it would have said about: https://gitlab.com/gitlab-org/gitlab/-/merge_requests/76318 (q.v. https://news.ycombinator.com/item?id=30872415)
Codeball did not approve the PR! https://codeball.ai/prediction/8cc54ce2-9f50-4e5c-9a16-3bc48...
On the site you can give it GitHub repos where it will test the last 50 PRs and show you what it would have done (false negatives and false positives included).
You can also give it a link to an individual PR as well, but GitLab is not yet supported.
Really bringing out the big guns here!
With that said, an adversarial from somebody within the team/organisation would be very difficult to detect.
Tone down the marketing page :) This page makes it sound like a non-serious person built the tool.
"Codeball approves Pull Requests that a human would approve. Reduce waiting for reviews, save time and money."
And make the download button: "Download"
On a related note, I'm working on https://denigma.app, which is an AI that tries to explain code, giving a second opinion on what it looks like it does. One company said they found it useful for code review. Maybe just seeing how clear an AI explanation is is a decent metric of code quality.
- be over-confident, providing negative value because the proportion of PRs which “LGTM” is extraordinarily low, and my increasingly deep familiarity with the code and areas of risk makes me even more suspicious when something looks that safe
- never gain confidence in any PR, providing no value
I can’t think of a scenario where I’d use this for these projects. But I can certainly imagine it in the abstract, under circumstances where baseline safety of changes is much higher.
Any linter is more useful than this.
Deriving features about the code contributions is probably the most challenging aspect of the project so far.
Can't tell if it's something like formatting and code style or "bad code" or what. Even as a first line reviewer I can't tell if this is valuable or not without any details on why it would approve something.
The PR's it would Approve here were all super minor. Could probably get similar number of these approved just by doing a Lines of Code changed + "Has it been linted"
It's really hard to tell if this is valuable or not yet.
With that said, there are ways of exposing more details to developers. For example, scoring is done per-file, and Codeball can tell you which files it was not confident in.
I can feel it.. it wants to be free!