Sr Big Data Engineer Resume Pittsburgh, Pennsylvania - Hire IT People

SUMMARY

8 years of overall experience as Big Data Engineer, ETL Developer and Java Developer comprises designing, development and implementation of Data models for enterprise - level application.
Closely collaborated with business products, production support, engineering team on a regular basis for Diving deep ondata, Effective decision making and to support Analytics platforms.
Excellent knowledge of Hadoop cluster architecture and its key concepts - Distributed file systems, Parallel processing, High availability, Fault tolerance and Scalability.
Proficient in converting Hive/SQL queries into Spark transformations using Data frames and Data sets.
Extensive experience in Text Analytics, developing different Statistical Machine Learning solutions to various business problems and generating data visualizations using Python and R.
Strong experience working with Amazon cloud services like EMR, Redshift, DynamoDB, Lambda, Athena, Glue, S3, API Gateway, RDS, CloudWatch for efficient processing of Big Data.
Hands on experience building PySpark, Spark Java and Scala applications for batch and stream processing involving Transformations, Actions, Spark SQL queries on RDD’ Confidential, Dataframes and Datasets.
Extensively used automation tools like Docker, Jenkins, Terraform, Ansible, Puppet.
Hands on experience in J2SE, J2EE, JSP, Servlets, EJB, WebLogic, WebSphere, Tomcat, JDBC, Python and JavaScript.
Strong database skills in IBM- DB2, Oracle and Proficient in database development, including Constraints, Indexes, Views, Stored Procedures, Triggers and Cursors.
Highly involved in all facets of SDLC using Waterfall and Agile Scrum methodologies.
Experience in working with Data warehousing concepts like Star Schema, Snowflake Schema, DataMarts, Kimball Methodology used in Relational and Multidimensional data modeling.
Used
IAM, Kerberos and Ranger for security compliance.
Proficient in handlingand ingesting terabytes of Streaming data(Kafka, Spark streaming, Strom),Batch Data,Automation andScheduling(Oozie, Airflow).
Strong experience leveraging different file formats like Avro, ORC, Parquet, JSON and Flat files.
Sound knowledge on Normalization and De-normalization techniques on OLAP and OLTP systems.
Worked on Google Cloud Platform (GCP) services like compute engine, cloud load balancing, cloud storage, cloud SQL, stack driver monitoring and cloud deployment manager.
Good experience with Version Control tools Bitbucket, GitHub, GIT.
Experience with Jira, Confluence and Rally for project management and Oozie, AirFlow scheduling tools.
Expertise in framing data pipelines using Pyspark, Kafka, Presto, Airflow, Azure Data Factory, SQL server.
Created the scripts to load data into the Teradata database using Load utilities like (Fast Load, MultiLoad, Fast Export, Bteq, ctl).
Transfer, Monitor and Ingest data using Apache Nifi.
Sound knowledge and Hands-on-experience with - NLP, MapR, IBM infosphere suite, Storm, Flink, Talend, ER Studio and Ansible.
Experienced in Providing support on AWS Cloud infrastructure automation with multiple tools including Gradle, Chef, Nexus, Docker and monitoring tools such as Splunk and CloudWatch.
Experience and using different IDE’ Confidential like Eclipse, Pycharm, Intellij, Google Collab.
Building reporting stories using various BI tools like Microsoft Power BI, Tableau, Alteryx, Qlik, SAP, SAS, Looker.
Working on Google Cloud Platform (GCP) services like cloud storage, cloud SQL, stack driver monitoring
Created Airflow Scheduling scripts in Python.
Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
Hands on working experience with RESTful API’ Confidential, API life cycle management and consuming RESTful services
Have good working experience in Agile/Scrum methodologies, communication with scrum calls for project analysis and development aspects.
Ext experience in development of Bash scripting, T-SQL, and PL/SQL Scripts.
Successfully working in a fast-paced environment, both independently and in a collaborative way. Expertise in complex troubleshooting, root-cause analysis, and solution development.
Realtime experience in using Azure services: Portal, Azure Cosmos DB, Azure Synapse Analytics, Azure Data Lake Storage, Azure Data Factory, Azure Stream Analytics, Azure Databricks, Azure Log Analytics and Azure Blob storage.

TECHNICAL SKILLS

Programming Languages: Python, Scala, SQL, Java, Shell Scripting

Web Technologies: HTML, CSS, XML, AJAX, JSP, Servlets, JavaScript, REST

Big Data Stack: Hadoop, Spark, MapReduce, Hive, Pig, Yarn, Sqoop, Flume, Oozie, Kafka, Impala, Storm

Amazon Web Services: S3, VPC, EC2, EMR, RedShift, DynamoDB, RDS, IAM, Lambda,Athena, Glue, CloudWatch, Kinesis

Azure Cloud Services: Databricks, ADF, Azure CDN, Cosmos DB

Relational databases: Oracle, MySQL, SQL Server, PostgreSQL, Teradata, Snowflake

NoSQL databases: MongoDB, Cassandra, HBase, Pig

Version Control Systems: Bitbucket, GIT, SVN, GitHub

Libraries: Pandas, NumPy, Scikit-learn, Matplotlib, Tweepy, Seaborn, TensorFlow, Keras, MLlib, Boto3

IDEs: PyCharm, Intellij IDEA, Jupyter Notebooks, Google Collab, Eclipse

Operating Systems: Unix, Linux, Windows,LOCUS

PROFESSIONAL EXPERIENCE

Confidential, Pittsburgh, Pennsylvania

Sr Big Data Engineer

Responsibilities:

Knowledge for architecting and designing a streamline solution by considering overall roadmap.
Built various Data pipeline using Hadoop Spark using python, Azure data factory and Informatica tool to ingest data from various type of sources.
Built Spark program pipeline using PySpark to evaluate complex business rules such as health risk score, member match, predictive analytics to analyze members health.
Worked on scheduling tools like Tivoli workgroup (TWS) to configure and automated the scripts executions process.
Wrote reporting stories on various dashboards using data visualization tools (Microsoft Power BI, Tableau, Alteryx, Qlik).
A complete end to end SOX compliant solution taking data from source systems using ODBC connectors into the Azure Data Lake using Databricks / Azure Data Factory, building business transformation logic with alerts, and reducing costs by automating the build of report.
Created azure data factory (ADF pipelines) using Azure Data Lake Gen2.
Created Informatica Cloud mappings by accessing multiple systems and merged them into one database.
Developed and designed Snowflake scripts and executed them through Unix and windows batch file.
Used python scripting/PySpark/Spark Sql to parse unstructured data in Azure Databricks.
Migrated on-premise SQL Server ETL process to Azure SQL Server.
Used stored procedure, lookup, execute pipeline, data flow, copy data, azure function features in ADF.
Excerpted keywords with 60% accuracy from unstructured data using Talend open studio and Java programming and performed data fusion of the unstructured data; loaded them in the analytical zone of Snowflake Data Lake.
Build Informatica jobs extract data from multiple Telegence billing source and perform complex compute before loading into Teradata eCDW.
Worked on Erwin data modelers for conceptual/logical dimensional model.
Used Sqoop to load the data fromTeradatainto HDFS.
Used Bitbucket for version control and Jira for project management, tracking issues and bugs and Jenkins for building continuous integration in software development.
Reduced access time by refactoring data models, query optimization and implemented Redis cache to support Snowflake.

Environment: Hadoop, Spark, Azure Databricks, Informatica, Power BI, Tableau, Qlikview, Azure Data Lake, Azure SQL Server, Snowflake, Sqoop, Teradata, Bitbucket, Jenkins, Jira

Confidential

Big Data Developer

Responsibilities:

Installed Hadoop, Map Reduce, HDFS, AWS and developed multiple Map Reduce jobs in PIG and Hive for data cleaning and pre-processing.
Architected, Designed, and Developed Business applications and Data marts for reporting.
Implemented Spark using Scala and utilizing Data frames and Spark SQL API, Data Frames and Pair RDD' Confidential for faster processing of data and created RDD' Confidential, Data Frames and datasets.
Worked extensively on building Nifi data pipelines in docker container environment in development phase.
Designed both 3NF data models for ODS, OLTP, OLAP systems and dimensional data models using star and snowflake Schemas.
Implemented Installation and configuration of multi-node cluster on Cloud using Amazon Web Services (AWS) on EC2.
Analyzed and optimized pertinent data stored in Snowflake using PySpark and SparkSQL.
Ingested data into Snowflake cloud data warehouse using Snowpipe. Extensive experience in working with micro batching to ingest millions of files on Snowflake cloud when files arrives to staging area.
Working on migration of Data from Teradata to Google Big Query using Google cloud and developed reports using Looker (Third Generation Visualization tool for analytics).
Performed Data scrubbing and processing with Apache Nifi and for workflow automation and coordination.
Architected enterprise data models & subsystems for optimal storage and retrieval to assist in marketing & email campaigns.
Implemented custom solution to match & merge customer records from different data sources using Record Linkage Lib in Python.
Used Zookeeper to provide coordination services to the cluster.
Responsible for migrating terabytes of on-premise enterprise data to AWS S3.
Involved in ingesting large volumes of credit data from multiple provider data sources to AWS S3.
ImplementedData warehousesolutions inAWS Redshift bymigrating the data to Redshift from S3.
Automated the jobs and data pipelines using AWS Step Functions, AWS Lambda and configured various performance metrics using AWS Cloud watch.

Environment: HDFS, Scala, Pig, Hive, Apache NIFI, Snowflake SaaS, Apache Zookeeper, AWS S3, AWS Redshift, AWS Lambda, AWS Cloudwatch, Python 3.3

Confidential

Big Data Engineer

Responsibilities:

Developed Kafka-spark streaming jobs to read real time streaming messages and write them to different target systems like Solace, HDFS and Hbase databases which helped many teams to have the data in a near real time fashion to perform real time analytics or batch processing.
Contributed developing REST API to collect clickstream/service/event-logs data from various end points.
Stored incoming data in the Snowflakes staging area. Created numerous ODI interfaces and load into Snowflake DB.
Worked on Amazon Redshift for shifting all Data warehouses into one Data warehouse.
Worked on various business requirements in analyzing large data sets using Hive, Spark SQL, MapReduce and load the results back to Elasticsearch which helped business to build weekly and monthly reports to make business decisions.
Developed re-usable UDF' Confidential in Java for Hive and Pig which helped and saved time for other team members.
Designed SQL, SSIS, and Python based batch and real-time ETL pipelines to extract data from transactional and operational databases and load the data into target databases/data warehouses.
Create Pig scripts to join HDFS files and create the file structures needed to load to records (tuples and bags) to the HBase model.
Create Ruby scripts to map the tuple fields from the joined data files. Execute Impala and Hive queries to validate Pig and Ruby jobs.
Optimized the query performance with modifications in T-SQL queries, removed unnecessary columns, and eliminated redundant and inconsistent data.
Implemented Event Handlers and Error Handling in SSIS packages and notified process results to various user communities.

Environment: Apache Kafka, HDFS, HBase, REST Web Services, AWS ElasticSearch, SSIS, Impala, HIVE, PIG, Apache Spark, Python, Java, T-SQL, Ruby

Confidential

Database Engineer

Responsibilities:

Worked on data pipelining from MYSQL to Hive using Sqoop.
Developed an IAM applicationusingSpring, JavaEE, Oracle, Okta, Redis and Postman with microservices architecture and REST services.
Developed ETL transformations for aggregation of key value pairs of Map Reduce Framework.
Worked on designing data model and dimensional model for Oracle database and data warehouse.
Worked on ETL job scheduling using pan and kitchen of ETL offerings to synchronize Heterogeneous databases.
Worked on performance tuning and optimization through indexes using Microsoft SSRS.
Involved in design, development and integration of databases and data marts using PDI and PL/SQL procedures for diagnosis studies.
Worked on OLAP, hierarchical, Performance tuning, views, materialized views and trigger auditing.
Provided high performance solutions for the application users to enroll, record Diagnosis using PL/SQL cursors, procedures and functions.

Environment: Mysql, Hive, Sqoop, Hadoop, Oracle 11g, SSRS, Command Prompt, Git

We provide IT Staff Augmentation Services!

Sr Big Data Engineer Resume

Pittsburgh, PennsylvaniA

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship