Inside the Big Data Team

David He, Director of Software Engineering, Cortex Data Lake

David_He1

As the Director of Engineering in the Cortex Data Lake organization at Palo Alto Networks, I am working in some of the most interesting and challenging areas of technology today. Our team, which totals more than 20+, is composed of professionals who perform a variety of work, including Development Engineers, Site Reliability Engineers, Software Quality Testing Engineers, Product Managers, Program Managers, and more. We’re all focused on ingesting a massive amount of data in the data lake — roughly 3 million data requests per second at peak, which is on the high end of the industry. We provide real-time streaming data, and queries to our customers and internal apps for analytics, machine learning, and effective rapid security responses. It’s our job to ensure that the data are delivered on time with low latency and that the quality of that data remains good.

This is my first role in cybersecurity. I’ve worked in Big Data for eBay and LinkedIn, which are both consumer-based systems with high volume, but Palo Alto Networks is an enterprise-based company, meaning that the volume of data is much higher, and the way we use the data and work with customers is different. Instead of being used to make marketing decisions, this data helps us determine where a company’s vulnerabilities are and what potential threats they face. People typically don’t associate Big Data processing with security, but in fact, they do connect. We’re using artificial intelligence and machine learning to make the data more intelligent — these are tools used to analyze human behavior but we’re leveraging it for security purposes. For me, the combination of these things makes this a very interesting place to be, and I feel I’ve joined the right company.

For someone early in their career, this is a great opportunity to work with technology you’ve never used before. We use Google Cloud Platform (GCP) and a streaming engine called Dataflow, as well as the industry-standard Apache Beam, to stream data to our customers and internal apps in real-time. We use BigQuery in GCP to build indexes and handle complex SQL queries with fast response times.  We use Java heavily, as well as Kafka for temporarily storing data, and a number of other cutting-edge technologies, all to ensure that we’re handling enormous amounts of data in the most secure way possible, to provide the most value for our customers. This company is a pioneer in firewall technology and is a growing player in cloud-based security, so this is a chance for someone to explore technology from a whole new perspective.

This is not work we perform in isolation. We communicate with each other frequently, within the team and with others outside our team, to solve problems and develop new products, and the communication channels are always open at all levels in the organization, all the way up to the Senior Vice President, who is actually very involved and supportive of what we do. Not only that but we also work directly with our enterprise customers — upwards of 5,000 of them — to ensure we’re addressing their issues properly. We do presentations for customers and other departments within the company as well. Anyone who works within the Cortex Data Lake team should be comfortable communicating with others on a regular basis.

The people I work with and I all feel motivated when we walk into work every day. We see new things happening every day, new challenges to meet, and we truly want to improve and grow. We’re excited by the importance of the work we’re doing, and anyone who’s looking to join this team should be excited to embrace that. Whether you’re coming in with nothing but training in Java or basic and looking for an entry-level position or you are a senior-level architect with strong skills in streaming and Big Data and are prepared to mentor junior engineers, this growing team needs motivated people at all levels who are ready for the challenges of Big Data in a security environment.