isj4

http://www.reddit.com/user/isj4

Highest Rated Comments

isj483 karma2017-05-16 13:03:57 UTC

We have a split between the backend and the frontend.

Backend:

the web crawler and search engine is open-source-search-engine (https://github.com/privacore/open-source-search-engine)
the backend machines are split into 20 dedicated to fulfilling search requests and 10 dedicated to crawling the web. The machines are not identical;. We use SSDs in the query machines and spinning rust in the crawler machines. Each machine has a varying number of engine instances depending the resources available (CPU cores, memory, ...)
we have a dedicated news scanner that uses special logic to quickly discover new articles on major news sites.
we have "Cap'n Crunch" machine that chews through data offline calculating things such as page temperature, linkability, high-frequency terms, indicators for link farms, ... This is our "secret sauce".
The backend machines are located in Denmark.

Frontend:

The frontend(s) consists of a cluster of machines running CoreOS with Kubernetes, React, Docker, Concourse, Logstash, ...
The frontend is currently located in France, but we can create more frontend clusters in other location closer to the users as needed.

View History Share Link

isj423 karma2017-05-16 14:43:17 UTC

Partially correct. When you send a query to us someone must know what your IP-address is for you to ever get the answer back. The question is where that information is disassociated from the query string. When the HTTP request hits our frontend the requesting IP-address is not logged. The user-agent string is not logged.

Inserting a proxy between your machine and our frontends would mean that we won't see you IP-address, but then you have to trust proxy owner not to cooperate with us to correlate the two information sets. An alternative is to perform a privacy audit, but then you have to trust the auditor. Btw, we have been looking into official certifications (eg. europrise privacy seal) but they are crazy expensive. If a professional privacy auditor is willing to do it for free then please contact us - we will buy you lunch.

We chose a different way that isn't proxies, trust and turtles all the way down: Make a business model that does not entice us to track you. Thus, we are not an advertising agency; we are not big-data number crunchers; and we are certainly not an analytics company.

View History Share Link

isj49 karma2017-05-16 16:33:00 UTC

We currently don't fix typos and misspellings. Yes, we are planning on implementing that.

What we want to do is that if the words you type have suspiciously low frequency (or 0) then suggest an alternate search with typos and misspellings fixed. We don't want to be annoying and just presume we know better and immediately override your search with what would give more results.

View History Share Link

isj46 karma2017-05-16 13:36:41 UTC

We love Canada too. If a Canadian site is in .ca TLD and is in English or French. We are using a whitelist of languages and a blacklist of TLDs.

Note: it seems that we current don't crawl any of the Eskimo–Aleut languages. We'll have to look into that.

View History Share Link

isj45 karma2017-05-16 18:34:44 UTC

A take your question in two meanings:

Bare metal servers versus cloud-like servers (eg. Openstack): Each instance is tied to its data shard which is stored on local SSD. We don't just spin up an extra instance because the disk space has to be allocated locally too. So cloud-like/virtualization doesn't give us any significant benefits.

Own servers versus public cloud: A rule-of-thumb is that a public cloud is 20% more expensive than own servers. And you may lose some capabilites, such as NUMA-aware process placement. It isn't black/white because we do rent dedicated servers for the frontends (they scale in a different way)

View History Share Link