TL;DR: Given 4 points on a two dimentional plane, representing a reclangle seen from an unknown perspective, can we deduce the width / height ratio of the rectangle ?
From a picture, and some opencv work (canny, hough lines, bucketing to tell appart “lines” and “columns”, choosing interesting lines, math to deduce lines intersections), I get this:
From this step, it’s easy to warp it to a “from the top” view, using opencv
wrapPerspective to “remove” the perspective, being on the top of the rectangle.
My goal now is to keep the aspect ratio of it, as I loose it while doing my actual warping, because I don’t know the ratio it should have.
For this I have to give to
getPerspectiveTransform the 4 destination points where I want my 4 found red points to be after warping, not just 4 random points like
(0, 0), (0, 100), (100, 100), (100, 0) leading to a deformation if my 4 red points are not a square.
So is there a known way to compute the width/height ratio, or even better the size, of this “seen thrue a perspective rectangle” ?
For the record and the curious, work-in-progress is here: https://github.com/JulienPalard/grid-finder
Dropbox has an extensive article on their tech blog where they describe how they solved the problem for their scanner app.
Rectifying a Document
We assume that the input document is rectangular in the physical world, but if it is not exactly facing the camera, the resulting corners in the image will be a general convex quadrilateral. So to satisfy our first goal, we must undo the geometric transform applied by the capture process. This transformation depends on the viewpoint of the camera relative to the document (these are the so-called extrinsic parameters), in addition to things like the focal length of the camera (the intrinsic parameters). Here’s a diagram of the capture scenario:
In order to undo the geometric transform, we must first determine the said parameters. If we assume a nicely symmetric camera (no astigmatism, no skew, et cetera), the unknowns in this model are:
- the 3D location of the camera relative to the document (3 degrees of freedom),
- the 3D orientation of the camera relative to the document (3 degrees of freedom),
- the dimensions of the document (2 degrees of freedom), and
- the focal length of the camera (1 degree of freedom).
On the flip side, the x- and y-coordinates of the four detected document corners gives us effectively eight constraints. While there are seemingly more unknowns (9) than constraints (8), the unknowns are not entirely free variables—one could imagine scaling the document physically and placing it further from the camera, to obtain an identical photo. This relation places an additional constraint, so we have a fully constrained system to be solved. (The actual system of equations we solve involves a few other considerations; the relevant Wikipedia article gives a good summary: https://en.wikipedia.org/wiki/Camera_resectioning)
Once the parameters have been recovered, we can undo the geometric transform applied by the capture process to obtain a nice rectangular image. However, this is potentially a time-consuming process: one would look up, for each output pixel, the value of the corresponding input pixel in the source image. Of course, GPUs are specifically designed for tasks like this: rendering a texture in a virtual space. There exists a view transform—which happens to be the inverse of the camera transform we just solved for!—with which one can render the full input image and obtain the rectified document. (An easy way to see this is to note that once you have the full input image on the screen of your phone, you can tilt and translate the phone such that the projection of the document region on the screen appears rectilinear to you.)
Lastly, recall that there was an ambiguity with respect to scale: we can’t tell whether the document was a letter-sized paper (8.5” x 11”) or a poster board (17” x 22”), for instance. What should the dimensions of the output image be? To resolve this ambiguity, we count the number of pixels within the quadrilateral in the input image, and set the output resolution as to match this pixel count. The idea is that we don’t want to upsample or downsample the image too much.
Yes, here’s a pen and pencil method:
Find the points $P,Q$ where “parallel” sides interset. The line through $P,Q$ is the “horizon” of the plane containing the rect. Find $R$ such that $\angle QRP=90^\circ$ and $RP=RQ$. Then the parallel to $PQ$ through $R$ intersects your pairs of “parallels” $AB,CD$ resp. $BC,AD$ in points with distance proportional to the rectangle side lengths.