Sr. Big Data Engineer Resume
Schaumburg, IL
PROFESSIONAL SUMMARY:
- A goal - oriented and self-motivated IT expert with more than 8 years of experience who can implement strategies in challengingsituations. A quick learner with experience in Data Lake, Data Warehousing, Data Mart, Data modeling, ETL Data pipeline, and Data Visualization.
- Knowledge of the software development life cycle, scalable platform architecture, object-oriented programming, database design, and agile and waterfall approaches.
- Understanding of Hadoop architecture and the various Apache Hadoop Ecosystems, including HDFS, Job Tracker, Task Tracker, Name Node, Data Node, and the MapReduce programming paradigm.
- Worked on a variety of Hadoop architectures and underlying Hadoop, as well as storage management, and Worked with Apache Flume to feed streaming data into Hadoop clusters for faster processing.
- Experience in using Hadoop ecosystem components like Hadoop, Hive, Pig, Sqoop, HBase, Cassandra, Spark, Spark Streaming, Spark SQL, Oozie, Zookeeper, Flume, Kafka, MapReduce framework, Yarn, and Scala.
- Experience in configuring Spark Streaming to receive real-time data from the Apache Kafka and store the stream data to Hadoop file systems and expertise in using Spark-SQL with data sources like JSON, Parquet, and Hive.
- Experience with Pig Latin Script and Hive Query Language development. Used Hive tables to store data in HDFS and HiveQL to process the data.
- Good Knowledge of architecture and components of Spark, and efficient in working with Spark Core, Spark SQL, Spark Streaming, and expertise in building PySpark and Spark-Scala applications for interactive analysis, batch processing, and stream processing.
- Working knowledge of data migration, data profiling, data cleaning, transformation, integration, data import, and data export utilizing a variety of ETL technologies such as Informatica and SSIS.
- Understanding Python, Django, CSS, HTML, JavaScript, and jQuery for web-based applications and Django models and Cassandra were used to create all database mapping classes.
- Extensive experience with MongoDB, MySQL, and PostgreSQL databases, including Sub Queries, Stored Procedures, Triggers, Cursors, and Functions.
- Designed and implemented an ETL framework with the help of Sqoop, pig, and hive to be able to automate the process of frequently bringing in data from the source and making it available.
- Ingested data into Snowflake cloud data warehouse using Snowpipe. Extensive experience in working with micro batching to ingest millions of files on Snowflake when files arrive at the staging.
- Designed, configured, and deployed Microsoft Azure for a multitude of applications utilizing the Azure stack (Including Compute, Web & Mobile, Blobs, Resource Groups, Azure SQL, Cloud Services, and ARM), focusing on high - availability, fault tolerance, and auto-scaling.
- Experience in Microsoft Azure platform services like Virtual Machines (VM), App Service, Virtual Machine scales sets, Logic Apps, Service Fabrics, container services, Batch, Cloud Services, Queue Storage, File Storage, Disk Storage, MS Build, MS Deploy, etc.
- Expertise in Microsoft Azure Cloud Services (PaaS & laaS), Application Insights, Document DB, Internet of Things (IoT), Azure Monitoring, Key Vault, Visual Studio Online(VSO), and SOL Azure.
- Knowledge in working with Azure cloud platform (HDInsight, DataLake, DataBricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH, and Data Storage Explorer).
- Experience in GCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/sub cloud shell, GSUTIL, BQ command-line utilities, Data Proc, Stack driver.
- Using the Google cloud platform to configure and establish a virtual data center for Enterprise Data Warehouse hosting, including Virtual Private Cloud (VPC), Security Groups, Route Tables, Public, and Private Subnets, and Google Cloud Load Balancing.
- Experienced in Providing support on AWS Cloud infrastructure automation with multiple tools including Gradle, Chef, Nexus, Docker, and monitoring tools such as Splunk and CloudWatch.
- Working knowledge of Amazon EC2, S3, RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, CloudFront, CloudWatch, SNS, SES, SQS, and other Amazon services
- Extensive experience in Text Analytics, developing different Statistical Machine Learning solutions to various business problems, and generating data visualizations using Python and R.
- Proficient in Python Scripting and worked in stats function with NumPy, visualization using Matplotlib and Pandas for organizing data. Involved in loading the structured and semi-structured data into spark clusters using SparkSQL and DataFrames Application programming interface (API).
- Experience in constructing various Kafka Producers and Consumers in accordance with software requirements and also with capturing the streaming data through Kafka.
- Skilled in using Kerberos, Azure AD, Sentry, and Ranger for maintaining authentication and authorization and experience in using Visualization tools like Tableau, and Power BI.
- Experience with Integrated Development Environments (IDEs) such as Eclipse, NetBeans, IntelliJ, PyCharm, Vi / Vim, Sublime Text, Visual Studio Code, and Jupyter Notebook.
- Experience implementing CI/CD (continuous integration and continuous deployment) and automation for ML model deployment using Jenkins, Git, Docker, and Kubernetes.
- Used Jenkins pipelines to drive all microservices builds out to the Docker registry and deployed on Kubernetes, Created Pods and managed, and created a private cloud using Kubernetes that supports DEV, TEST, and PROD environments.
- Working knowledge of issue tracking software such as JIRA and Buganizer and version control systems such as git and Bitbucket.
TECHNICAL SKILLS:
Languages: Python, JAVA, SCALA, R, C++, SQL, PL/SQL.
Big Data Ecosystem: HDFS, Apache NIFI, Map Reduce, Sqoop, Cloudera Manager, Hortonworks, HBase, Flume, Pig, Hive, Oozie, Impala, Zookeeper, Ambari, Storm, Spark, and Kafka
Cloud Technologies: Compute Engine, Cloud Functions, BigQuery, GCR, GKE, Data Proc, DataFlow, App Engine, Knative, Cloud storage, Cloud Datastore, AWS S3, AWS Redshift, AWS EMR, AWS EC2, AWS Lambda, AWS Glue, Databricks, Data Lake, Blob Storage, Azure Data Factory, SQL Database, SQL Data Warehouse, Cosmos DB, Azure Active Directory)
Web technologies: HTML, Ajax, CSS, Bootstrap, XML, JSON, JQUERY, JavaScript, Bash, Shell Scripting, Ruby, Groovy, YAML.
Web Services: SOAP, Restful.
Web Server: APACHE SERVER, Apache HTTP.
Databases: ORACLE, MySQL, Postgre, MS SQL SERVER, Cassandra, MongoDB, CouchDB.
Editors: Notepad++, Sublime text 3, PyCharm, Visual studio.
Visualization Tools: Tableau, Power BI, QlikSense.
Operating System: Mac OS, Ubuntu, CentOS, Red Hat, Windows, Linux.
Version Control: GitHub, Git, Bitbucket.
SDLC Methods: Scrum, Agile.
PROFESSIONAL EXPERIENCE:
Confidential, Schaumburg, IL
Sr. Big Data Engineer
Responsibilities:
- Insightful in GCP platform Services like Compute Engine, Cloud Functions, Container Security, App Engine, Knative, Cloud storage, Persistent Disk, Google Kubernetes, Container Registry.
- Design and architect various layers of Data Lake, Design star schema in Big Query, and Loading Data every 15 min on an incremental basis to BIGQUERY raw and UDM layer using SOQL, Google DataProc, GCS bucket, HIVE, Spark, Scala, Python, Gsutil and Shell Script.
- Proficient in creating GCP firewall rules to allow or deny traffic to and from the VM's instances based upon specified configuration and configured GCP cloud CDN (content delivery network) to deliver the content from GCP cache locations drastically developing user experience and latency.
- Build data pipelines in airflow in GCP for ETL-related jobs using different airflow operators.
- Write a program to download a SQL Dump from their equipment maintenance site and then load it in the GCS bucket. On the other side load this SQL dump from the GCS bucket to MYSQL and load the Data from MYSQL to Bigquery using Python, Scala, spark, and Dataproc.
- Used Pub/Sub topics and subscriptions when the file is dropped in the GCS bucket and the topic gets triggered where the subscribers of that particular topic will start executing the scripts.
- Experience in reducing latency and integrating with Cloud Monitoring and Cloud Logging for getting latency metrics.
- Experience in building and architecting multiple Data pipelines, end-to-end ET, and ET processes for Data ingestion and transformation in GCP and coordinating tasks among the team.
- Significantly reduced the execution time to migrate a cluster from Hortonworks distribution to Cloudera and Used Cloudera Manager to manage and monitor the Hadoop cluster. Worked with Cloudera, Hortonworks, and other distributions on both on-premises and cloud clusters.
- Installed, configured, and maintained the Hadoop cluster and Hadoop ecosystem components such as Hive, Pig, HBase, Zookeeper, and Sqoop for application development.
- Data from ORACLE RDBMS and Tera data systems is ingested into the Hadoop data lake and have familiarity with Spark RDD, Data Frame API, Data Set API, Data Source API, Spark SQL, Spark Context, Dataframes, and Spark Streaming.
- For structured data, we used dataproc, which allows us to run spark DataFrames on dataproc tables and Loaded HIVE tables with data, wrote hive queries that run on MapReduce, and Created a customized BI tool for management teams that perform query analytics using HiveQL.
- Spark experience, including using Spark context, Spark-SQL, Data Frames, pair RDDs, and Spark YARN to improve the performance and optimization of existing Hadoop methods.
- Load D-Stream data into Spark RDD and compute in memory to generate an output and Developed Spark code for quicker processing and data transformations using Scala and Spark-SQL.
- Extract real-time data feed using Kafka, process core job using Spark Streaming to Resilient Distributed Datasets (RDD) to process them as Data Frames and save as Parquet format in HDFS and NoSQL databases
- Worked with delimited text files, click stream log files, Apache log files, Avro files, JSON files, and XML files, among other file types. Mastered the use of many columnar file formats such as RC, ORC, and Parquet.
- Responsible for loading and transforming huge sets of structured, semi-structured, and unstructured data by satisfying all the V’s of Big data technology.
- Developed code to extract data from databases such as Oracle and MySQL and feed it into the GCP Data Pipeline platform.
- Experience tuning Big Data clusters such as Hadoop for performance and reducing the execution time of the files to be loaded and retrieved.
- UsedSpark SQL to implement rule-based fraud detection. Developed complicated Hive queries to extract data from a variety of sources (Data Lake) and store it in HDFS.
- Airflow was used to automate the loading of data into HDFS and pre-processing with Pig and Worked on Hive Table creation and Partitioning.
- In Kafka, I used Java to handle producer and consumer APIs, which helped me transport data from an application to Spark without losing any data.
- Managed the Metadata for the ETL operations that were utilized to populate the Data Warehouse and have experience with Docker, Mesos, and orchestrating clusters using Kubernetes.
- CA Agile Rallywas used to create features, and use cases, track bugs, add test cases from Red hat studio via a Jenkins tool, and keep track of the project.
- Easily manage and collaborate on the project by using GIT version control to keep track of changes in source code.
Environment: Hadoop, Scala Spark, Spark-SQL, Spark Sqoop, HBase/MapR DB, Apache Drill, Hive, Map Reduce, HDFS, Sqoop, databricks, Jupyter notebook, PyCharm, Maven, Jenkins, Java (JDK 1.7), Java 7, Eclipse, Oracle 10g, PL/SQL, Linux.
Confidential, Richmond, Virginia
Big Data Engineer
Responsibilities:
- Developed the batch scripts to fetch the data from AWS S3 storage and do required transformations in Scala using the Spark framework.
- Created AWS EC2 instances to execute Hadoop/Spark jobs on AWS Elastic MapReduce (EMR) to store the results in S3 buckets and used JIT servers.
- Used Azure Data Factory, SQL API, Mongo API, integrated data from MongoDB, MS SQL, cloud (Blob, Azure SQL DB, Azure Cosmos DB).
- Model complex ETL jobs that transform data visually with data flow or by using compute services Azure Databricks, Azure Blob Storage, Azure SQL Database, Cosmos DB.
- Created Big Data Solutions that allowed the business and technology teams to make data-driven decisions about how to best acquire customers and provide them with business solutions.
- Build and maintain data pipelines and data products to ingest and process large amounts of structured and unstructured data from a variety of sources.
- Analyzing the data needs, migrating the data into an Enterprise data lake, building data products, and generating reports using tools.
- Build real-time and batch-based ETL pipelines with a strong understanding of big data technologies and distributed processing frameworks.
- Strong understanding of the big data cluster, and its architecture and experience in building and optimizing big data ETL pipelines.
- Created producer APIs and consumer APIs managed by Zookeeper to gain experience with messaging systems such as Kafka.
- Workflows and supporting applications for Continuous Integration/Continuous Delivery using Jenkins and have working knowledge of MapReduce, Hive, and Spark jobs.
- Used data bricks to handle spark tasks for faster processing because it contains a cache that allows us to access data even faster than Apache Spark.
- Used Scala, Python, and Java to develop and maintain HADOOP applications and converted PL/SQL packages to Scala objects, and used Sqoop to load data into a Cloudera Hadoop cluster.
- Spark performance and optimization experience using Spark context, Spark-SQL, Data Frames, pair RDDs, and Spark YARN.
- Utilized Spark RDD to generate the output from the given data using in-memory data computation and Worked with a variety of file types, including text, Avro, ORC, and parquet.
- Createdbatch and real-time ETL workflows to migrate data from a variety of data sources, such as SQL Server, Netezza, and Kafka.
- Developed Scala scripts, UDFs, and queries for data aggregation/transformations, queries, and writing data back into RDBMS using Sqoop using both Data frames/Spark SQL and RDD/MapReduce in Spark.
- Worked with the production support team to ensure that the monthly production jobs' SLAs were met and to fix any issues that arose.
- Participated in systems engineering operations such as deploying code to production, branching code in GIT, and configuring environments.
- Developed Linux Shell Scripts (Bash) to perform Hadoop jobs in non-production and production environments.
- Using Tableau, I was able to successfully deploy data from numerous sources into HDFS and create reports, and using Sqoop, exported the relevant business data to an RDBMS, allowing the BI team to build reports based on the data.
- Used Hive, Map Reduce, and HDFS to perform transformations, cleaning, and filtering on imported data, and then loaded the finished data into HDFS.
Environment: Hadoop, Scala Spark, Spark-SQL, Spark Sqoop, HBase/MapR DB, Apache Drill, Hive, Map Reduce, HDFS, Sqoop, Maven, Jenkins, Java (JDK 1.7), Java 7, Eclipse, Oracle 10g, PL/SQL, Linux, Tidal
Confidential
Big Data Developer
Responsibilities:
- Utilize Azure’s ETL, Azure Data Factory (ADF) services to ingest data from legacy disparate data stores - SAP (Hana), SFTP servers & Cloudera Hadoop’s HDFS to Azure Data Lake Storage (Gen2).
- Model complex ETL jobs that transform data visually with data flow or by using compute services Azure Databricks, Azure Blob Storage, Azure SQL Database, Cosmos DB.
- ExecutedETL procedures to analyze the business requirements for Health care service data.
- Performed data integration and created data pipelines using Java to ensure data dependability and availability, allowing for continuous data flow.
- Using Apache Spark and the Scala programming language, createdand deployedservice-oriented applications.
- Used Hive DDLs and Hive Query Language (HQL) extensively to analyze partitioned and bucketed data and compute various metrics for reporting, as well as write MapReduce and Pig jobs to conduct data transformations as needed.
- Created Hive tables (external, internal) with static and dynamic partitions and performed bucketing on the tables to improve efficiency and developed Pig scripts to help do analytics on JSON and XML data.
- Using Hive custom UDFs, we performed complete data analysis and used Bulk Load and Non-bulk Load to load data into an HBase/MapR database.
- Used HBase Shell and HBase Client API to import data into HBase NoSQL and used Java extensively to work on HBase/MapR DB Admin and Client APIs.
- Tidal workflows were configured to automate data extraction, processing, and analysis. For continuous integration and continuous development (CI/CD), weusedGit, Jenkins, TeamCity, and SonarQube.
- Test features and functionalities for the health care application and upgradedthe project in accordance with business users' specifications and requirements.
- Used Apache Kafka to aggregate web log data from multiple servers and make it available for analysis in downstream systems, as well as Kafka Streams to configure Spark streaming to acquire data and then store it in HDFS.
- We migrate data into Hadoop using HDFS from traditional databases including MS SQL Server, MySQL, and Oracle.
- Streamed data from numerous sources using the Spark Streaming API and created Apache Spark apps to clean and validate data before it was uploaded to the cloud.
- Worked on converting MapReduce jobs to Spark jobs and loading structured and semi-structured data into Spark clusters using DataFrames and SparkSQL.
- Created a pipeline to automate the operations of importing data into HDFS and pre-processing it in Oozie with Pig.
- Responsible for Cluster maintenance, monitoring, commissioning, and decommissioning Data nodes, troubleshooting, managing, and reviewing data backups, and managing & review log files.
Environment: Hadoop, Scala Spark, Spark-SQL, Spark Sqoop, HBase/MapR DB, Hive, Map Reduce, HDFS, Sqoop, Maven, Jenkins, Java (JDK 1.7), Java 7, Eclipse, Oracle 10g, PL/SQL, Linux, Tidal
Confidential
Data Analyst
Responsibilities:
- Collaborated closely with Data Analysts on test designs, test cases, and test execution, as well as fully documenting and reporting the results.
- Worked diligently with cross-functional Data warehouse members to establish a connection to a SQL Server for the purpose of creating spreadsheets by importing data into the SQL Server.
- Experience in functional, regression, and end-to-end testing, as well as planning, designing, and executing complex testing solutions.
- Developed SQL queries based on desired data from several relational customer databases using joins, grouping, nested sub-queries, and aggregation.
- Developed numerous dashboards across all areas to analyze data from the client's business and worked on reporting. PerformedData profiling and data validation to ensure data accuracy between the warehouse and source systems.
- Manipulated and cleansed the data by sub-setting, sorting, and pivoting on a need basis and worked on CSV files to get input from the SQL Server database.
- Created data reports in Excel for simple sharing and used SSRS to aid with statistical data analysis and report decision-making.
- Performed Data Analysis and Data Profiling, as well as workedon data transformations and data quality norms and worked on data quality concerns and participated in back-end testing.
- Responsible for analysing and interpreting complex data reporting and/or performance trend analysis.
Environment: SQL Server, MS Excel 2010, V-Look, T-SQL, SSRS, SSIS, OLAP.