

Black Friday: How a risky experiment saved the day

Last Friday, Digitec Galaxus AG was in a state of emergency throughout Switzerland. The biggest concern was whether the servers could cope with the onslaught. Software engineer Enes Poyraz was on the front line. He talks about a day when the engineers put all their eggs in one basket.
17 seconds. That's how long it takes digitec.ch to go offline for the first time last Friday, internationally known as Black Friday. The reason: too many user requests. Our company's servers collapsed after 17 seconds under the load of you and other customers. Because too many people wanted to take advantage of the special offers on the international sales day.
"We actually thought that the servers would stay down for longer," says Junior Software Engineer Enes Poyraz.
The engineer recounts a day when he and his team forced a company to become unproductive and the engineers saved your deal with an act of desperation.
Black Friday is 17 seconds old before our servers give upEnes is present the second the servers go down for the first time. This is because a troop of engineers - all teams are named after James Bond films - are on stand-by every year for Black Friday. In the new offices on Förrlibuckstrasse, at home in the home office or somewhere out there on laptops with mobile internet. They are waiting for the servers to give in and migrate if they can.
But at midnight they are at their limits. The otherwise proud Engineers have to take a hard knock. The men and women, for whom even a downtime of a few seconds is too long, only manage to get the servers back online after a good two hours. But the shop was not stable. The site went offline again and again, but only very briefly. Only a livestream from Digital Marketing is active.
"It's a bit of a hassle for customers, but it's expected behaviour on the engineering side," says Enes. He shrugs his shoulders. Sure, it's unpleasant and the engineers try to minimise these times. But when a nation hammers a website, it can happen.
The second wave
It gets quiet. After 2 a.m., the inhabitants of Switzerland are asleep, having dusted off their good deals. Later in the day, I hear from the Zurich shop that one person has been assigned to take out smartphone subscriptions. Because last Friday there was a 50% discount on every smartphone when taking out a plan. The plans can also be purchased online.
At around 9 a.m., he is back at his desk and describes himself as "well-rested". The servers are holding. Barely.
"The load increased continuously over the course of the morning," says Enes, "and we realised that if it continued like this, we wouldn't be able to make it through the day."
But it's all in vain: at almost exactly 12 noon, the server gives up again. Not as bad as the night before, the site comes and goes every second, but still too unstable for a good shopping experience on the site.
The engineers forget all about lunch and set about getting the site back online.
An experiment saves the day
While Oliver and a team of Engineers from all Engineering departments took services offline, Enes and two other Engineers were busy hatching a plan. What to do if taking the internal services offline is not enough?
The three engineers were tasked with finding a solution if all else fails.
"It doesn't get any more out-of-the-box than that," says Enes. He is a little proud to have been part of the team that saved the day. The usually quiet man suddenly speaks a little louder.
"I haven't tried Redis on a large scale yet," he says, "but I spent two days testing it out." That was never enough to entrust the system with the largest online shop in Switzerland. When Enes looks at his two-week-old notes, he can't help but grin.
I'm sure there are several areas of application here at
The engineers decide to take Redis live without testing on a demo system. A risky endeavour. Before a company puts software into a live system, it is put through its paces by internal departments and often also by external parties. After all, just because software sounds like exactly what a system needs according to the manufacturer's marketing material doesn't mean that it will deliver the promised added value. Sometimes using it only makes things worse.
"What do you think, will it be good?" Enes is asked.
The engineer nods.
Redis takes over
From then on, everything happens quickly. According to Enes, Redis is "damn quick and easy", so the server is up and running within 20 minutes. Whereas the old digitecs cache system runs locally on each server, Redis is centralised on one server and also processes requests on other servers. In other words, every server writes to the Redis cache, making the solution more scalable and reducing the load on the reading servers.
"To ensure that we don't start a completely reckless #yolo campaign, we have also set up a SwitchBit," says Enes. A small team from IT Operations had to agree to constantly monitor the servers so that the shop is not completely shut down. This is because IT Ops is already busy monitoring server performance all day long. According to Enes, they are the ones who had his team's back for the experiment with Redis.
The problem comes with the go-live, which is as uncoordinated as possible. Enes wants to launch Redis on a managed Microsoft server so that the internal load is not too great.
"Unfortunately, Redis is running on port 6380, which is closed here," he says. He makes an emergency request to open the port, but this is not possible because this port is blocked by Microsoft's server. But Enes has learnt one thing: he always has a second plan up his sleeve.
"At the same time, I tried to get Redis running on Google's cloud," he says. But even that was more difficult than expected. With the help of senior software engineer Michal Nebes, he manages to get a Redis cluster up and running.
At 4 pm it's time to get serious. A short code review. According to Senior Software Engineer Boško Stupar, the code fits and he measures the round trip of the data, i.e. the time it takes for data to be sent to the server and returned to the user's computer.
- Normal: 50 milliseconds to 500 milliseconds. Too slow
- Redis, local test system: 8 to 9 milliseconds
- Redis, productive: 16 to 19 milliseconds
Redis goes online. Evening sales are placed in the hands of an untested system with minimal security
Anxious seconds.
Redis reads requests, builds up a cache.
The load on the servers decreases noticeably. The shop stabilises.
The engineers breathe a sigh of relief
Boško Stupar later wrote on Facebook:
Black Friday survived. That was the greatest day of my career! So stressful, so hard, so rewarding. For nerds: we implemented two-stage caching with a centralised Redis farm. Open heart surgery without painkillers. PS: I hate Black Friday.
The Engineers meet up after sunset for a beer on the fifth floor of Pfingstweidstrasse, one eye always on the shop. A lively party is different, but they relax.
At 6.55 pm, Enes gives the all-clear via WhatsApp:
The servers are running normally, under normal load. Meanwhile, people are queuing in the shop, making smartphone plans and picking up orders. Store Manager Adrian Maier stands at the shop entrance in Zurich and provides information about waiting times. Twenty minutes, forty minutes. He is tired, just like the crew behind the tills.
The engineers are halfway through their day, but the shop employees are the last ones who have to hold out. Because at 8 pm, the day is also over at the tills. No more plans, no more deliveries, no more questions.
Quo vadis, Redis?
The Redis system will remain online over the weekend until Monday evening so that it can take part in Cyber Monday. The engineers are proud that the shop will be stable on Monday. But then Redis has done its job and the cluster is taken offline again.
"Because when it's all said and done, we're still talking about a largely untested system here," says Enes.
The engineers have a long list of questions that they need to answer before they can permanently connect Redis to the network in good conscience. Among them are many questions that sound something like this: "Why does Redis do $ding?"
Now that the situation has calmed down again, the engineers can turn their attention to these questions. Because if Black Friday has shown them anything, it's that Redis has potential. It would be a shame not to utilise it. <p


Journalist. Author. Hacker. A storyteller searching for boundaries, secrets and taboos – putting the world to paper. Not because I can but because I can’t not.
News about features in our shop, information from marketing and logistics, and much more.
Show all